![]() |
To play around with Šolar in Sketch Engine click on the image on the left or follow link. To download the corpus go to the availability section. |
Corpora are digital collections of authentic texts collected for a particular purpose according to criteria defined in advance. Usually, corpora are available together with tools for the analysis and exploration of language data. Corpus of school essays Šolar is a collection of authentic texts written by pupils and students in Slovene primary and secondary schools during school classes. The corpus was collected and processed in the years 2009-2010. It contains one million words or, more exactly, 967,477 words. Based on the concept of foreign language learners’ corpora, it is the first corpus of this type in Slovenia. A distinctive unique feature of the corpus is the fact that language errors marked in texts and integrated in the corpus were not created by researchers but by teacher in the class. This feature enables researches to assess also feedback of teachers on students’ use of language.
The Šolar corpus represents a valuable resource for Slovene corpus linguistics as it offers insight into actual writings of Slovene school population whose written production was previously not available for systematic research. As the result of this deficiency, language reference materials and textbooks for primary and secondary schools are not based on real language use of modern Slovene. The Šolar corpus therefore represents a big step towards the possibility of including real language problems into teaching materials for Slovene, the reason being not only authentic texts included in the corpus but also teachers’ comments and corrections which offer both the analysis of written production of students and the insight what is actually corrected (and not corrected) in the teaching process.
Šolar was created to enable research on language skills of Slovene school population. By means of corpus analysis, detection of general language problems and – more specifically – problems of school population with writing in Slovene is possible for the first time. Appropriate didactic solutions can be prepared, based on the analysis. The structure and annotation of the corpus enables research on various levels of linguistic description. However, in addition to linguistic annotation, annotations of teachers’ error correction is included in the corpus which was developed specifically for this corpus. Šolar is therefore a tool and a resource which will enable teachers and linguists to explore language competence of pupils during their education process and to prepare new teaching material based on real language data.
Šolar consists of 2,703 texts written by students in Slovene secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15), with a small percentage also from the 6th grade. School essays form the majority of the corpus (64.2%) while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications etc. (18%). The third part is separated into two groups: the “answers to questions” group (16.1%) includes standard tests with questions and (free-text) answers written during lessions in different school subjects. The “longer texts” group (1.7%) includes various longer texts which were written as parts of tests in Slovene language.
Type of school |
Number of texts |
Percentage |
Primary school |
505 |
18,7 % |
Secondary “academic” |
1.172 |
43,3 % |
Secondary “technical” |
843 |
31,2 % |
Secondary vocational |
183 |
6,8 % |
Table 1: Percentage of texts according to the type of school
Image 1: Percentage of words per grade/year
A large proportion of the collected texts contained also teachers’ corrections which were made during marking of tests or served as feedback information on adequacy of students texts in terms of language norm, textual coherence or subject matter. Therefore, these corrections represent a ready-made analysis of language problems of students on the one hand, but on the other hand they offer insight into what teachers perceive as problematic in students texts. As we did not want to lose this valuable information we decided to record all the corrections in the way that corpus tools could be used for their analysis. A system of labels was created for this purpose and a style guide for the annotation of corrections was prepared. To illustrate the process of transcription, the image of a part of the original hand-written student essay is shown below, as well as is transcription with a few annotations of teachers’ intervention: a correction, textual and graphical comment.
The owner of the Šolar corpus is the Ministry of Education, Science and Sport. The contract between the Ministry and project partners determines that the following license be used when transmitting databases to third parties as well as for the purpose of attributing copyright: “attribution” + “non-commercial” + “share alike”, which allows users to copy, distribute, transmit, and alter the work and its adaptations only under the condition that its use be non-commercial and that users also themselves further share original works and the adaptations of such under the same conditions.
This work is offered under the licence: Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Slovenia
Šolar corpus is available in the CLARIN.SI repository: http://hdl.handle.net/11356/1036.
Concept and specifications: Simon Krek, Marko Stabej, Tadeja Rozman, Špela Arhar, Irena Krapš Vodopivec
Text acquisition: Tadeja Rozman and teachers:
Vanja Benko (OŠ Prežihovega Voranca, Ravne na Koroškem)
Barbara Bolarič (Gimnazija Šentvid)
Tatjana Dorman (Srednja šola za gostinstvo in turizem Maribor)
Katja Dragar in David Puc (Škofijska klasična gimnazija Ljubljana)
Andreja Dvornik in Tatjana Rupnik Hladnik (OŠ Poljane Ljubljana)
Nataša Felc in Vanda Trošt (OŠ Spodnja Idrija)
Janja Florjančič, Valentina Madjar Sitar, Katja Jović in Nada Fortuna Makar (Srednja zdravstvena in kemijska šola Novo mesto – Šolski center NM)
Vesna Gubenšek Bezgovšek (Srednja ekonomska šola Celje)
Polona Gujtman Maučec (OŠ II Murska Sobota)
Terezija Gujtman (OŠ III Murska Sobota)
Mojca Hafner in Mija Injac Ožbolt (Srednja ekonomska šola Ljubljana)
Tatjana Hafner (OŠ Sava Kladnika Sevnica)
Irena Hočevar, Tanja Luštek in Marinka Cerinšek (OŠ Frana Metelka Škocjan)
Ksenija Horvat (Srednja šola za farmacijo, kozmetiko in zdravstvo Ljubljana)
Irena Humar Kobal in Petra Gabriel (OŠ Dornberk)
Silva Kastelic, Katja Peršič in Lidija Jesenovec (Srednja zdravstvena šola Ljubljana)
Marjana Klemenčič Glavica, Darja Mlakar in Peter Prhavc (Gimnazija Ledina)
Petra Knapič (OŠ Jurija Vege Moravče)
Romana Kokošar (Gimnazija Jurija Vege Idrija)
Katja Koren Valenčič (Srednja šola Postojna – Šolski center Postojna)
Sanja Kostanjšek, Gordana Stepanovska, Jožica Jožef Beg, Tina Cvijanović, Magdalena Udovč, Barbara Grabnar Kregulj in Zlata Kocjan (Srednja elektro šola in tehniška gimnazija – Šolski center NM)
Nataša Kralj (Srednja elektro-računalniška šola Maribor)
Irena Krapš Vodopivec, Tatjana Božič in Bojana Kompara (Škofijska gimnazija Vipava)
Bernarda Kričej (Srednja šola Zagorje)
Jelka Kvartič (Gimnazija Velenje)
Katja Lasbaher (Srednja šola za elektrotehniko in računalništvo Ljubljana)
Mateja Medvešek Rjavec (Osnovna šola Milke Šobar Nataše, Črnomelj)
Andreja Mlakar in Erika Koren-Plahuta (OŠ Antona Globočnika Postojna)
Mojca Osvald (Gimnazija Bežigrad)
Katja Pobega (Pomorski in tehniški izobraževalni center Portorož)
Duška Safran (Srednja šola za gostinstvo in turizem Celje)
Suzana Skočaj Kavčič (Osnovna šola dr. Bogomirja Magajne Divača)
Mitja Spreizer (OŠ Križe)
Maja Sušin (Osnovna šola Trebnje)
Jožica Šalehar (OŠ Šentjernej)
Jana Škoda, Mateja Traven, Alenka Vene in Simona Karl (Šolski center Krško – Sevnica)
Nuša Šorn (Gimnazija Šiška)
Marija Velkovrh Petrič in Meta Rogelj (OŠ Livada Ljubljana)
Authors of the annotation system: Tadeja Rozman, Mojca Stritar, Simon Krek, Irena Krapš Vodopivec, Iztok Kosem
Annotation: Tadeja Rozman, Matic Korošec
Transcription: Marjeta Burja, Maja Dichlberger, Ana Fonda, Andreja Jankovič, Karmen Jordan, Alenka Laharnar, Melita Perkovič, Tomaž Potočnik, Eva Radič, Maja Rajh, Nina Stankovič, Simon Šuster, Andrej Tomažin, Martin Uranič, Barbara Vojsk, Urška Vranjek, Matic Korošec, Tadeja Rozman, Irena Krapš Vodopivec
Pregledovanje transkripcij: Mojca Stritar, Melita Perkovič, Eva Radič, Matic Korošec, Tadeja Rozman
Conversion into XML format: Iztok Kosem, Mihael Arčan
Validation: Iztok Kosem, Karmen Kosem, Miro Romih
POS tagging: Peter Holozan, Miro Romih
Articles, books
Tadeja Rozman, Mojca Stritar in Iztok Kosem (2012): Šolar – korpus šolskih pisnih izdelkov. V: T. Rozman, I. Krapš Vodopivec, M. Stritar, I. Kosem: Empirični pogled na pouk slovenskega jezika. Ljubljana: Trojina, zavod za uporabno slovenistiko.
Iztok Kosem, Tadeja Rozman in Mojca Stritar (2011): How do Slovenian primary and secondary school students write and what their teachers correct: a corpus of student writing. V: Proceedings of The Corpus Linguistics Conference 2011 (Birmingham, 20-22 July 2011). Birmingham: University of Birmingham.
Iztok Kosem, Sara Može (2011): Rešitve slovničnih zagat na dosegu miške: analiza napak v besedilih učencev in dijakov za potrebe elektronskega slovničnega vira. V: S. Krajnc (ur.) Meddisciplinarnost v slovenistiki, (Obdobja, Simpozij, = Symposium, 30). Ljubljana: Znanstvena založba Filozofske fakultete, str. 249-257.
Tadeja Rozman (2011): Šola(r) in slovnica. VideoLectures.net, 4. feb. 2011.