DOWNLOAD

LEXICAL DATABASE FOR LANGUAGE TECHNOLOGIES

Slovene Lexical Database, available in XML format, is also a source rich with information for natural language processing. Language technology community still often uses raw corpus data when processing lexical information; however, there has been an increase in the use of grammatic and semantic arguments that stress the importance of including linguistic facts and theories when processing language data.

The lexical database is a machine-readable language resource in which meanings of the words are linked with a number of specific lexical and syntactic information in their contexts. For example, it is possible to automatically link a meaning of a verb with its prototypical sentence pattern and other frequent (e.g. prepositional) patterns. At the same time, it is possible to link valency information for a particular meaning, identified by semantic types, with typical lexical realizations, registered as collocations and syntactic structures. Information on grammar, domain, style and connotation, and register labels and grammar restrictions enable sense disambiguation in decoding the text and an appropriate selection of language when encoding.

Syntactic structures as formalized two- or three-word compounds (the latter are predominantly prepositional) contain essential information for automatic extraction of collocational information and corpus examples. This has already been tested during the compilation of the lexical database; the description of the procedures and the results can be found in Kosem et al. (2012).

Information in the lexical database can be linked with the data in Slovene Wordnet, FrameNet and other lexical resources, and used for the purposes of word sense disambiguation, automatic data extraction, question answering systems, automatic translation and in applications based on language data. In addition, the lexical database can be used for automatic morphosyntactic tagging, parsing, and semantic annotation of texts, and for the improvement of tools such as parsers and taggers for Slovene.

STRUCTURE OF AN ENTRY IN SLOVENE LEXICAL DATABASE (SLD) AND XML TAGS

For the lexical database, a DTD schema was developed, adapted for dictionary-writing systems such as IDM’s DPS Dictionary Writing System, iLex by the Danish EMP, TLex by TschwaneDje from South Africa, Lingvo.Pro by Russian ABBYY and similar. The schema enables the adjustment of information for different kinds of dictionaries, either by adding or changing categories of data and hierarchical relations between them.

By clicking on the link you can download a zip file containing Document Type Definition (DTD) and W3C schema (XSD) that define a formal structure of the lexical database in XML format.

The table below describes the contents of XML Schema elements, and their tag forms. The content part briefly describes types of lexical, lexico-grammatical, style, grammar and other data, the purpose for inclusion and the relationship towards other lower or upper level elements in the schema. Hierarchically superior elements, containing different sub-elements and/or attributes, are shown in grey.

 

SLD content

DTD element

Contains the entire entry within the <glava> and <geslo> elements.

<clanek></clanek>

Contains the elements <oblika> and <zaglavje>.

<glava></glava>

Contains the elements <zapis>, <korpus> and <iztočnica>.

<oblika></oblika>

Under <oblika>: contains the spelling of the headword for the purposes of database searches and for some other (internal) purposes.

<zapis></zapis>

Under <oblika>: contains the information on the frequency of the headword in the lemma form in the Gigafida corpus.

<korpus></korpus>

Under <oblika>: contains the lemma form of the headword.

<iztocnica></iztocnica>

Contains the following elements: <besvrsta> and <oznaka>.

<zaglavje></zaglavje>

Under <zaglavje>: contains the information on the word class, attributed to the lemma in the Gigafida corpus

<besvrs></besvrs>

Under <zaglavje>: contains domain, connotation, register and grammar information on the lemma.

<oznaka tip=”attribute”></oznaka>

Contains elements, specified as lexical units: <pomen> (sense) and <podpomen> (subsense), <stalne_zveze> (multi-word units) and <frazeoloske_zveze> (phraseological units).

<geslo>

Contains the elements that define a lexical unit: <indikator> and <pomenska_shema> are compulsory sub-elements.

<pomen></pomen>

Under the elements <pomen>, <podpomen>, <stalne_zveze> and <frazeoloske_zveze>:: contains a short indicator of the sense, intended to create an association with the meaning of the sense, and to act as part of the sense menu.

<indikator></indikator>

Under <indikator>: contains pragmatic explanation of the sense of the word, multi-word units or phraseological units.

<pr></pr>

Under <pomen> or <podpomen>: contains the information on the form of the word, typical of a particular sense.

<ustaljena_oblika></ustaljena_oblika>

Under <pomen> or <podpomen>: contains domain, connotation, register and grammar information on the sense.

<oznaka tip=”attribute”></oznaka>

Under <pomen> or <podpomen>: contains a semantic argument of the sense, written in whole sentence format with SEMANTIC TYPES as abstract representatives of typical roles in the valency positions.

<pomenska_shema></pomenska_shema>

Under <pomen> or <podpomen>, <stalne_zveze> and <frazeoloske_enote>: contains the definition that explains the main semantic tendencies of the sense in a user-friendly manner, based on corpus data.

<definicija1></definicija1>

<definicija2></definicija2>

Contains the <skladenjske_strukture> element including associated collocations, patterns and corpus examples.

<skladenjske_skupine></skladenjske_skupine>

Must contain at least one element <skladenjska_struktura> and corresponding corpus examples in <zgledi>. In majority of cases also <kolokacije> (collocations) and <vzorec> (patterns).

<skladenjska_struktura></skladenjska_struktura>

Under <skladenjska_struktura>: contains the information on the structure written as a combination of word class and case. Word class of the headword is written in capital letters.

<struktura></struktura>

Within <struktura>: contains the information on grammatical restrictions of the element in the structure, npr. restriction to a specific number, verb form etc.

<r></r>

Under <skladenjska_struktura>: (at verb headwords) it can contain several successive realizations of a prototypical valency pattern, presented in the semantic frame.

<vzorec></vzorec>

Contains at least one element <kolokacija> (collocations) and related examples. In most cases extended collocations are provided.

<kolokacije></kolokacije>

Within <kolokacije>: contains the headword and a group of semantically or morphologically related collocates in the <k></k> element.

<kolokacija><k></k></kolokacija>

Under <kolokacije>: contains the headword and a group of collocates that can be expanded by a group of its own collocates under the <k></k> element.

<r_kolokacija></r_kolokacija>

Under <skladenjske_skupine>, <skladenjske_zveze>, <stalne_zveze> and <frazeoloske_zveze>: must contain at least one <zgled> (example).

<zgledi></zgledi>

Under <zgledi>:more than one can be used; contains a corpus example, atesting the recorded collocates, extented collocations and patterns. The headword in the example is written in bold in the <i></i> element.

<zgled><i></i></zgled>

Contains at least one <skladenjska_zveza> element with related multi-word units, collocations, patterns and corpus examples.

<skladenjske_zveze></skladenjske_zveze>

Must contain at least one element <zveza> and related <zgledi> (examples). Less frequenty it contains collocates and pattens.

<skladenjska_zveza></skladenjska_zveza>

Under <skladenjska_zveza>: contains semantically transparent and structurally fixed parts of the language, often with semantically or/and morphologically predictable empty slot in the <k></k> element.

<zveza><k></k></zveza>

Under <pomen>: contains elements (other than <pomen>) that describe the lexical unit. Must contain the <indikator> and <pomenska_shema> elements.

<podpomen></podpomen>

Contains at least one <stalna_zveza> element.

<stalne_zveze></stalne_zveze>

Must contain the <zveza>, <indikator>, <struktura> and <zgledi> elements. In some cases also collocations and extended collocations.

<stalna_zveza></stalna_zveza>

Under <stalna_zveza>: contains a multi-word unit, including its variant forms, separated by a slash. Different cases of the multi-word unit and different forms of its use are listed in separate <zveza></zveza> elements.

<zveza></zveza>

Contains at least one <frazeoloska_enota> element.

<frazeoloske_zveze></frazeoloske_zveze>

Must contain elements <enota> and <indikator>. Often also collocations and labels.

<frazeoloska_enota></frazeoloska_enota>

Under <frazeoloska_enota>: contains a phraseological unit, including variants of individual elements, separated by a slash. Different cases of the phraseological unit and different forms of its use are listed in separate <enota></enota> elements.

<enota></enota>

OWNERSHIP AND AVAILABILITY

The owner of Slovene Lexical Database is the Ministry of Education, Science and Sport. The contract between the Ministry and project partners determines that the following license be used when transmitting databases to third parties as well as for the purpose of attributing copyright: “attribution” + “non-commercial” + “share alike”, which allows users to copy, distribute, transmit, and alter the work and its adaptations only under the condition that its use be non-commercial and that users also themselves further share original works and the adaptations of such under the same conditions.

Creative Commons licence
This work is offered under the licence: Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Slovenia

Slovene Lexical Database is available in the CLARIN.SI repository: http://hdl.handle.net/11356/1030.

AUTHORS AND COLLABORATORS

DTD file and W3C schema: Simon Krek, Iztok Kosem, Polona Gantar