BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES
dc.contributor.author | Chapaev D. | |
dc.contributor.author | Turapbekov B. | |
dc.date.accessioned | 2023-10-31T11:14:55Z | |
dc.date.available | 2023-10-31T11:14:55Z | |
dc.date.issued | 2018 | |
dc.description.abstract | Abstract. The lack of free public accessible Kazakh language corpus is one of the difficulties that Kazakh linguistics researchers face. Corpuses are used as a data source in statistical linguistics for the detection of unigrams, bigrams and n-grams. These data help analyze the structure of the language and find the most used words, etc. The aim of this paper is a step towards supporting Kazakh linguistics with the open source corpus built on Wikipedia dumps and one of its applications a Kazakh spell checker. Now, corpus contains over 21 million words. It is also open source and waiting for any contributors and suggestions. | |
dc.identifier.citation | D. Chapaev , B. Turapbekov / BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES / СДУ хабаршысы - 2018 | |
dc.identifier.issn | 2415-8135 | |
dc.identifier.uri | https://repository.sdu.edu.kz/handle/123456789/675 | |
dc.language.iso | en | |
dc.publisher | СДУ хабаршысы - 2018 | |
dc.subject | N-gram, corpus, spell checker, tokenizer. | |
dc.subject | СДУ хабаршысы - 2018 | |
dc.subject | №2 | |
dc.title | BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES | |
dc.type | Article | |
dspace.entity.type |