OPEN CORPORA

WHAT ARE ccGIGAFIDA AND ccKRES?

There are several corpora of Slovene in existence, both reference and specialized, however they can be searched only by using online concordancers. This has its limitations, as it depends on the range of concordancer’s search options; in addition, the number of results shown is often restricted. Such access to the corpus still meets the needs of a majority of linguistic studies. However, this cannot be said for the use of corpora for language technology purposes, where the entire corpus as a database is needed in order to train or test different language processing programmes, such as morphosyntactic tagging and lemmatizations models.

For these reasons, we have sampled two subcorpora of the Gigafida corpus and its balanced version, i.e. the Kres corpus. The ccGigafida corpus contains approximately 9% or 100 million words, taken from the Gigafida corpus, and ccKres that contains approximately 9% or 10 million words, taken from the Kres corpus. The structure of the sample corpora is the same as the structure of their parent corpora.

The ccGigafida and ccKRES corpora enable others, including researchers abroad, to conduct in-depth linguistic and computer (language technology) analyses of the Slovene language without any restrictions.

DO ccGIGAFIDA IN ccKRES CONTAIN ONLY RAW TEXTS?

Besides raw text, both corpora contain other kinds of information. Each of the 31.722 corpus documents in ccGigafida and 9.376 in ccKres includes the information about the source (e.g. Mladina magazine, Delo, Dnevnik newspapers), year of publication, text type (fiction, newspaper), the title and autor if they are known. In addition of document meta-data, corpora are tagged which means that each word is attributed with two additional types of information. The first one is the basic form of the word, also called a lemma (e.g. jagode, jagodi, jagodam -> lemma = jagoda), and the seciond one is a morpho-syntactic tag. This tag describes which part-of-speech the word belongs to (noun, verb, adjective, etc.), and what are its morphological features (e.g. gender, case, number). Since both corpora contain large quantities of texts tagging was automatic, it was done by a tagger called Obeliks, also developed within the “Communication in Slovene” project. You can test it in the web service.The tag set used for corpus tagging is describen on the Linguistic Annotation of Slovene project web page. Corpora are encoded in XML TEI format (Text Encoding Initiative P5) described on the Korpusi SSJ web page.

OWNERSHIP AND AVAILABILITY

The owner of ccGigafida and ccKres corpora is the Ministry of Education, Science and Sport. The contract between the Ministry and project partners determines that the following license be used when transmitting databases to third parties as well as for the purpose of attributing copyright: “attribution” + “non-commercial” + “share alike”, which allows users to copy, distribute, transmit, and alter the work and its adaptations only under the condition that its use be non-commercial and that users also themselves further share original works and the adaptations of such under the same conditions.

This work is offered under the licence: Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Slovenia

1. ccGigafida corpus is available in the CLARIN.SI repository: http://hdl.handle.net/11356/1035.

2. ccKres corpus is available in the CLARIN.SI repository: http://hdl.handle.net/11356/1034.

COLLABORATORS

Head of Gigafida and Kres text acquisition: Nataša Logar Berginc
Text acquisition: Simon Šuster, Matic Korošec, Teja Roglič, Mateja Grča, Urška Sančanin, Tamara Ambrožič, Mitja Knapič, Nataša Gliha Komac
Text conversion: Simon Šuster
Web crawling: Miha Grčar
Linguistic annotation: Obeliks tagger (Miha Grčar, Matjaž Juršič, Simon Krek, Kaja Dobrovoljc)
XML scheme, TEI validation and ccGigafida and ccKres corpus sampling: Tomaž Erjavec

BIBLIOGRAPHY

Articles, books

Tomaž Erjavec in Nataša Logar Berginc (2012): Referenčni korpusi slovenskega jezika (cc)Gigafida in (cc)KRES. V T. Erjavec, J. Žganec Gros (ur.): Zbornik Osme konference Jezikovne tehnologije. Ljubljana: Institut Jožef Stefan.

Nataša Logar Berginc, Miha Grčar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt in Simon Krek (2012): Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede.