BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES

dc.contributor.authorChapaev D.
dc.contributor.authorTurapbekov B.
dc.date.accessioned2023-10-31T11:14:55Z
dc.date.available2023-10-31T11:14:55Z
dc.date.issued2018
dc.description.abstractAbstract. The lack of free public accessible Kazakh language corpus is one of the difficulties that Kazakh linguistics researchers face. Corpuses are used as a data source in statistical linguistics for the detection of unigrams, bigrams and n-grams. These data help analyze the structure of the language and find the most used words, etc. The aim of this paper is a step towards supporting Kazakh linguistics with the open source corpus built on Wikipedia dumps and one of its applications a Kazakh spell checker. Now, corpus contains over 21 million words. It is also open source and waiting for any contributors and suggestions.
dc.identifier.citationD. Chapaev , B. Turapbekov / BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES / СДУ хабаршысы - 2018
dc.identifier.issn2415-8135
dc.identifier.urihttps://repository.sdu.edu.kz/handle/123456789/675
dc.language.isoen
dc.publisherСДУ хабаршысы - 2018
dc.subjectN-gram, corpus, spell checker, tokenizer.
dc.subjectСДУ хабаршысы - 2018
dc.subject№2
dc.titleBUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES
dc.typeArticle
dspace.entity.type

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2018.2-154-160.pdf
Size:
633.25 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
13.85 KB
Format:
Item-specific license agreed to upon submission
Description: