TRAINING CORPUS

WHAT IS A TRAINING CORPUS?

Training corpus is a collection of texts, containing manually validated linguistic information, attributed to the original texts. The information is used by machine-learning programmes to create a statistical model. Also, the information can be used to check the accuracy of rule-based programmes. Statistical programmes can use such a model, which is based on the training corpus, for analysing new, unknown texts.

HOW DID WE MAKE THE ssj500k TRAINING CORPUS?

The ssj500k training corpus is based on two training corpora, built within the JOS project. It contains the entire jos100k corpus and additional 400.000 words from a million-word jos1M corpus. When making the training corpus, the text, consisting of a sequence of characters (letters, numbers, spaces, symbols etc.), has to be first divided into meaningful units such as paragraphs, sentences, words and punctuation. This procedure is called segmentation (sentence identification) and tokenization (identification of tokens, i.e. words and punctuation). Two other types of information are attributed to each word: a basic form or a lemma (jagodam, jagodami -> jagoda) and a morphosyntactic tag. The latter is formed as an acronym, containing the information on word class and related morphosyntactic features, for example Somei = samostalnik (noun), občno ime (common noun), moški spol (masculine gender), ednina (singular), imenovalnik (nominative). The ssj500k corpus uses the JOS tagset that contains exactly 1,902 tags with combinations of categories and features according to the specifications of the JOS project.

Other data found in the training corpus can be syntactic information (subject, predicate, object, adjunct etc.), name entities (e.g Peter, Državni zbor, NASA, Mestna občina Ljubljana), the connection between pronouns and their referents etc. The ssj500k training corpus contains manually validated information obtained by segmentation, tokenization, lemmatization, morphosyntactic tagging, parsing (11,411 sentences) and name entity recognition (personal name, place name, proper name).

All linguistic information (tags, lemmas, tokens) were manually re-validated during the transfer of data from the jos100k and jos1M corpora, and the number of parsed and manually checked sentences has increased. Name entity information for name entity recognition purposes was added to the jos100k texts. In contrast to the procedures used in building the jos100k and jos1M corpora, segmentation and tokenization have been manually validated and corrected for the ssj500k corpus, which among other things enables us to obtain the information on the accuracy of the algorithms used in the aforementioned procedures. Statistical information on the elements in the ssj500k corpus are available in the table below.

Element	Description	Number
<div>	division/text	1.677
<p>	paragraph	8.137
<s>	sentence	27.829
<w>	word	500.295
<c>	punctuation/symbol	85.953
<w> + <c>	token	586.248
<links>	element containing dependency tree links	11.411
<link>	dependency tree link	235.865
<chunks>	element with named entity links	2.178
<chunk>	named entity	4.398

OWNERSHIP AND AVAILABILITY

The owner of the ssj500k training corpus is the Ministry of Education, Science and Sport. The contract between the Ministry and project partners determines that the following license be used when transmitting databases to third parties as well as for the purpose of attributing copyright: “attribution” + “non-commercial” + “share alike”, which allows users to copy, distribute, transmit, and alter the work and its adaptations only under the condition that its use be non-commercial and that users also themselves further share original works and the adaptations of such under the same conditions.

This work is offered under the licence: Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Slovenia

Training corpus ssj500k is available in the CLARIN.SI repository: http://hdl.handle.net/11356/1029.

AUTHORS AND COLLABORATORS

Coordination of manual annotation of morpho-syntactic tags, dependency trees and named entities: Simon Krek
Morpho-syntactic tagging annotation: Kristina Bizjak, Živa Blaževič, Klara Canzutti, Lea Cibrič, Kaja Dobrovoljc, Tadeja Dušej, Ivana Fekeža, Nanika Holz, Urška Kamenšek, Andreja Košir, Robert Kuret, Andrej Lovšin, Boštjan Marhold, Nina Mikulin, Barbara Modrijan, Tanja Novak, Lea Peršič, Tanja Radovič, Simona Šinkovec, Urška Vranjek, Jerneja Umer, Petra Žalodec
Dependency treebank annotation: Kaja Dobrovoljc, Nanika Holz, Nina Ledinek, Sara Može
Named entity annotation: Nanika Holz
Manual verification of automatic segmentation and tokenization: Kaja Dobrovoljc
TEI format: Tomaž Erjavec

BIBLIOGRAPHY

Guidelines

Peter Holozan, Simon Krek, Matej Pivec, Simon Rigač, Simon Rozman, Aleš Velušček, 2008: Specifikacije za učni korpus. Projekt »Sporazumevanje v slovenskem jeziku« ESS in MŠŠ.

Videolectures

Špela Arhar Holdt: Jezikovne tehnologije in nove metode. Slovarji, več kot le besede, 2009.