GOS – SPOKEN CORPUS
|To play around with the Gos web concordancer click on the image on the left or follow link.
To download the corpus go to the availability section.
WHAT IS GOS?
GOS is a corpus of spoken Slovene that includes the transcripts of approximately 120 hours of speech that we are exposed to on a daily basis in various situations: radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations in buying and selling situations, etc. All speech is transcribed in two versions – with pronunciation-based spelling and with standardized spelling – and it comprises over one million words. The corpus can be searched by means of the web concordancer available on this website; furthermore, for all concordances it is possible to listen to the corresponding recordings. The corpus was created within the “Communication in Slovene” project.
WHY DID WE BUILD GOS?
Because very little is known about our everyday speech. On the one hand, the main issue of Slovene grammar books, dictionaries, and school books, as well as of Slovene lessons, is the written form of Slovene and the standard spoken form. Dialectology, on the other hand, covers knowledge about (past) phonological systems, morphological paradigms, and the vocabulary of “pure” dialects which are spoken by now elderly people and are rapidly disappearing in modern times. However, in everyday life neither the prescribed standard Slovene nor pure dialects can be heard very often, and only very few of us can speak one or the other. Therefore, what is the Slovene that we actually speak? The answer to this can be determined only in such a manner that the Slovene language is recorded in its most authentic form, that it is transcribed, and then listened to and analysed. And that is why GOS has been built.
WHO ARE THE USERS OF GOS?
All those who wish to analyse spoken Slovene either from a linguistic or some other perspective, e.g. the sociological or language technology perspectives. However, the audience of GOS users is wider: the GOS web concordancer is extremely simple and sufficiently user-friendly to be used also by the following groups: teachers can use it for Slovene lessons at school or in courses of Slovene as a foreign language, speech editors on radio, television, and in the theatre, language interpreters, writers, and other people who in some manner encounter issues related to spoken Slovene.
WHAT IS THE COMPOSITION OF GOS?
The speech transcripts included in GOS were selected to ensure that the corpus be as representative as possible of contemporary spoken Slovene in the most common everyday situations. In addition to the representativeness of various situations, the criterion of the representativeness of speakers was taken into account in the recording of materials for GOS and therefore the part which includes recordings of private conversations comprises an appropriate share of speakers from different regions, of both genders, of different ages, and of different education levels.
With regard to the above mentioned, it must be taken into account that in order for the corpus to be truly representative a much larger sample than the existing 120 hours of speech would be needed. Truly representative corpora include several million words, whereas the size of GOS is a mere one million words. Therefore, our hope for the future is for the corpus to grow further.
GOS AS DATA SET: WHAT IS IN THERE?
THE GOS CORPUS AS A DATABASE: WHAT DOES IT CONTAIN?
1. Speech recordings
2. Pronuncition-based transcripts following the principle “write it down as you hear it” (e.g. tko)
3. Transcripts based on standardized spelling following the principle “write it down the way we write” (for the same example: tako)
4. Information about the basic form (lemma) and morphological features of a word, which is automatically added to the standardized spelling form
5. Information about the situation in which the recording was made
6. Information about the speaker
The owner of the GOS corpus is the Ministry of Education, Science and Sport. The contract between the Ministry and project partners determines that the following license be used when transmitting databases to third parties as well as for the purpose of attributing copyright: “attribution” + “non-commercial” + “share alike”, which allows users to copy, distribute, transmit, and alter the work and its adaptations only under the condition that its use be non-commercial and that users also themselves further share original works and the adaptations of such under the same conditions.
This work is offered under the licence: Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Slovenia
Gos corpus is available in the CLARIN.SI repository: http://hdl.handle.net/11356/1040.
AUTHORS AND COLLABORATORS
Specifications for the GOS corpus: Simon Krek, Agnes Pisanski Peterlin, Marko Stabej, Tina Verovnik, Jana Zemljarič Miklavčič, Ana Zwitter Vitez
Recording: Ana Zwitter Vitez, Brigita Bec, Mojca Bizjak, Rebeka Dragič, Aja Barbo Gruden, Jernej Golobič, Andreja Gregorič, Pija Kapitanovič, Ana Kočevar, Katja Krapež, Jaruška Majovski, Iztok Mikulan, Alenka Mirkac, Dusán Mukics, Barbara Omahen, Neža Pahovnik, Tomaž Potočnik, Lucija Ramovš, Lucija Rap, Erika M. Roblek, Mateja Strmšek, Ivana Šlaus, Maja Štefančič, Jure Tompa, Andrej Tomše, Slavka Vesenjak, Pija Vrezner
Management of the recordings: Rebeka Dragič
Transcription – pronunciation-based spelling: Aja Barbo Gruden, Mariša Bizjak, Mojca Bizjak, Jernej Golobič, Ana Gorinšek, Katja Krapež, Jaruška Majovski, Iztok Mikulan, Alenka Mirkac, Barbara Omahen, Neža Pahovnik, Tomaž Potočnik, Erika M. Roblek, Mateja Strmšek, Maja Štefančič, Maja Šučur, Andrej Tomše, Bojana Zevnik
Transcription check – pronunciation-based spelling: Mariša Bizjak, Alenka Mirkac, Tomaž Potočnik, Andrej Tomše
Transcription validation – pronunciation-based spelling: Ana Zwitter Vitez
Transcription – standardized spelling: Ana Zwitter Vitez
XML schema for the texts: Tomaž Erjavec
Head of the Web Concordancer for the National Corpus of Spoken Slovene project: Darinka Verdonik
Fila processing: Amebis, d. o. o., Kamnik, Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru
Design and programming of the GOS concordancer: Rok Rejc, Simon Rigač
VERDONIK, Darinka, KOSEM, Iztok, ZWITTER VITEZ, Ana, KREK, Simon, STABEJ, Marko. Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language resources and evaluation, ISSN 1574-020X, Dec. 2013, vol. 47, iss. 4, str. 1031-1048, doi: 10.1007/s10579-013-9216-5.
Verdonik, Darinka, Zwitter Vitez, Ana, 2011: Slovenski govorni korpus Gos. Ljubljana: Trojina, zavod za uporabno slovenistiko.
Zemljarič Miklavčič, Jana, Stabej, Marko, Krek, Simon, Zwitter Vitez, Ana, 2009: Kaj in zakaj v referenčni govorni korpus slovenščine. Stabej, Marko (ur.): Obdobja 28: Infrastruktura slovenščine in slovenistike. Ljubljana: Znanstvena založba Filozofske fakultete Univerze v Ljubljani. 437–442.
Zwitter Vitez, Ana, Zemljarič Miklavčič, Jana, Stabej, Marko, Krek, Simon, 2009: Načela transkribiranja in označevanja posnetkov v referenčnem govornem korpusu slovenščine. Stabej, Marko (ur.): Obdobja 28: Infrastruktura slovenščine in slovenistike. Ljubljana: Znanstvena založba Filozofske fakultete Univerze v Ljubljani. 437–442.
Zwitter Vitez, Ana, 2010: Kako in zakaj uporabljati govorni korpus slovenskega jezika. Predstavitev na konferenci Korpusi, več kot le statistika, Ljubljana, FDV.
Verdonik, Darinka, Zwitter Vitez, Ana, Romih, Miro, Krek, Simon, 2010: Konkordančnik za govorni korpus GOS. Erjavec, Tomaž, Žganec Gors, Jerneja (ur.): Zbornik Sedme konference Jezikovne tehnologije – IS 2010. Ljubljana: Institut Jožef Stefan. 12-15.
Verdonik, Darinka, 2011: Govorni korpus kot lektorjev priročnik. Krakar Vogel, Boža (ur.): Slavistika v regijah – Maribor: Zbornik Slavističnega društva Slovenije. Ljubljana: Zveza društev Slavistično društvo Slovenije. 171-173.
Zwitter Vitez, Ana, 2011: Korpus Gos in njegova uporaba v raziskovalne, didaktične in ljubiteljske namene. Kranjc, Simona (ur.): Meddisciplinarnost v slovenistiki – Obdobja 30. Ljubljana: Center za slovenščino kot drugi/tuji jezik. 559-564.