Grammatically coded Corpus of spoken Lithuanian : methodology and development

Direct Link:
Collection:
Mokslo publikacijos / Scientific publications
Document Type:
Straipsnis / Article
Language:
Anglų kalba / English
Title:
Grammatically coded Corpus of spoken Lithuanian: methodology and development
In the Journal:
Keywords:
LT
Leksika. Kalbos žodynas / Lexicon; Šnekamoji kalba / Spoken language.
Summary / Abstract:

LTReikšminiai žodžiai: CHILDES; Gramatinis anotavimas; Gramatinis išskaidymas; Lietuvių sakytinės kalbos korpusas; Sakytinės lietuvių kalbos tekstynas; Spontaninė kalba; Žodynas; CHILDES; Corpus of Spoken Lithuanian; Grammatical annotation; Grammatical disambiguation; Lexicon; Spoken Lithuanian corpus; Spontaneous communication.

ENThe paper deals with the main issues of methodology of the Corpus of Spoken Lithuanian which was started to be developed in 2006. At present, the corpus consists of 300,000 grammatically annotated word forms. The creation of the corpus consists of three main stages: collecting the data, the transcription of the recorded data, and the grammatical annotation. Collecting the data was based on the principles of balance and naturality. The recorded speech was transcribed according to the CHAT requirements of CHILDES. The transcripts were double-checked and annotated grammatically using CHILDES. The development of the Corpus of Spoken Lithuanian has led to the constant increase in studies on spontaneous communication, and various papers have dealt with a distribution of parts of speech, use of different grammatical forms, variation of inflectional paradigms, distribution of fillers, syntactic functions of adjectives, the mean length of utterances. [From the publication]

DOI:
10.5281/zenodo.1129916
ISSN:
1307-6892
Related Publications:
Permalink:
https://www.lituanistika.lt/content/76914
Updated:
2020-07-28 20:26:10
Metrics:
Views: 23    Downloads: 4
Export: