KRES – BALANCED CORPUS

Kres_270_68 To play around with the Kres web concordancer click on the image on the left or follow link.

WHAT IS KRES?

Corpora are digital collections of authentic texts collected for a particular purpose according to criteria defined in advance. Usually, corpora are available together with tools for the analysis and exploration of language data. The Kres corpus is an extensive collection of Slovene text of various genres, from daily newspapers, magazines, all kinds of books (fiction, non-fiction, textbooks), web pages, and similar, with a balanced genre structure. It contains almost 100 million words, or exactly 99,831,145 words.

WHY DID WE BUILD KRES?

Kres was sampled from the Gigafida corpus and is a balanced corpus, especially by text types or genres. For corpora which aim at representing language as a whole it is vital that they are extensive and balanced by text types. Gigafida is an extensive reference corpus but it is not balanced as 77% of words come from periodicals (magazines, newspapers). On the other hand, only 6% of the words come from books (fiction, non-fiction). This structure originates from the fact that almost entire FidaPLUS corpus was included, and all new material was added. From the first stages of the project, our plans included building of a balanced sub-corpus which was later realized as the 100-million word corpus KRES. Its structure is shown in the chart below.

Kres_zvrst_Eng

Kres consists of texts published between 1990 and 2011. The reason for a bigger number of words in 2010 is the part of the corpus which originates from the web (20%) as web crawling and processing was done in that year.

HOW DID WE BUILD KRES?

Basic sampling units were not entire corpus documents in Gigafida but random paragraphs, which ensured better representation of the original Gigafida material in Kres. If entire documents had been chosen, some of the texts would not have been represented at all or they would have been – in case of large documents comprising entire books or yearly collections of magazines – overrepresented. Sampling was based on a table with bibliographic data about all the texts, together with the desired number of words for Kres.

The choice of texts for Kres – in the sense of type and quantity – was determined on the basis of two sources of information: National Reading Survey with data about the readership of Slovene newspapers and magazines, and Online Audience Measurement – MOSS, which determined the percentages of online text from three most visited news portals (24ur.com, rtvslo.si, siol.net). In all other categories of text types we selected the material available in Gigafida: 71% of fiction, 36% of non-fiction, 96% of transcriptions of parliamentary debates and text from the national radio and television, and in the web part of the corpus 12.5% from company web sites and 87.5% from governmental institutions.

WHO ARE THE USERS OF KRES?

As in the case of Gigafida, Kres is intended for all users interesed in modern Slovene: linguists and language specialists, teachers of Slovene in primary and secondary schools, their pupils and students, those who learn Slovene as the second or foreign language, and for all web users trying to solve their language problem by searching the web. However, as Kres is a balanced corpus it is more relevant in cases where users need a more precise information about general distribution of various language phenomena, as far as it can be inferred from a balanced sample of the language (corpus) which has a carefully premeditated structure. Because of its balanced nature on different levels, results from both corpora will be different and interesting to compare.

For Kres we used the same web concordancer as for Gigafida, which means that it includes automatic lemmatization of words in the query and immediate presentation of data in filters on the left side of the interface.

DOES KRES CONTAIN ONLY RAW TEXT?

Besides raw text, Kres contains other kinds of information. Each of the 21,456 corpus documents includes the information about the source (npr. Mladina magazine, Delo, Dnevnik newspapers), year of publication, text type (fiction, newspaper), the title and autor if they are known. In addition of document meta-data, the corpus is tagged which means that each word is attributed with two additional types of information. The first one is the basic form of the word, also called a lemma (e.g. jagode, jagodi, jagodam -> lemma = jagoda), and the seciond one is a morpho-syntactic tag. This tag describes which part-of-speech the word belongs to (noun, verb, adjective, etc.), and what are its morphological features (e.g. gender, case, number). Since the corpus contains large quantities of texts tagging was automatic, it was done by a tagger called Obeliks, also developed within the “Communication in Slovene” project. You can test it in the web service.

OWNERSHIP AND AVAILABILITY

The owner of the Kres corpus is Ministry of Education, Science and Sport. The corpus is freely available online and it can be accessed through various web concordancers. The database of the corpus in textual format (XML) is available only if a special contract is signed between the owner and the user, due to the need to copyright protection of the text providers. If you want to obtain the corpus in XML format or include it in your concordancer, write to the address info@slovenscina.eu. The ccKres corpus, a 9-percent part of Kres is avaliable also under Creative Commons licence and can be downloaded from the open corpora page.

COLLABORATORS

Head of text acquisition and Kres sampling specifications: Nataša Logar Berginc
Text acquisition: Simon Šuster, Matic Korošec, Teja Roglič, Mateja Grča, Urška Sančanin, Tamara Ambrožič, Mitja Knapič, Nataša Gliha Komac
Text conversion: Simon Šuster
Web crawling and text processing: Miha Grčar
Linguistic annotation: Obeliks tagger (Miha Grčar, Matjaž Juršič, Simon Krek, Kaja Dobrovoljc)
XML scheme, TEI validation and Kres corpus sampling: Tomaž Erjavec
Web concordancer concept: Simon Rigač, Špela Arhar Holdt, Iztok Kosem, Simon Krek, Polona Gantar, Nataša Logar Berginc
Web concordancer programming: Rok Rejc, Simon Rigač

BIBLIOGRAPHY

Conference papers, Journal articles, Books

Tomaž Erjavec in Nataša Logar Berginc (2012): Referenčni korpusi slovenskega jezika (cc)Gigafida in (cc)KRES. V T. Erjavec, J. Žganec Gros (ur.): Zbornik Osme konference Jezikovne tehnologije. Ljubljana: Institut Jožef Stefan.

Nataša Logar Berginc, Miha Grčar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt in Simon Krek (2012): Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede.

Nataša Logar Berginc in Simon Krek (2010): New Slovene corpora within the “Communication in Slovene” project. Slavicorp conference. Warsaw.

Nataša Logar Berginc in Simon Šuster (2009): Gradnja novega korpusa slovenščine. Jezik in slovstvo 54/3–4. 57–68.