BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES

Chapaev D.; Turapbekov B.

BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES

Files

2018.2-154-160.pdf (633.25 KB)

Date

2018

Authors

Chapaev D.

Turapbekov B.

Publisher

СДУ хабаршысы - 2018

Abstract

Abstract. The lack of free public accessible Kazakh language corpus is one of the difficulties that Kazakh linguistics researchers face. Corpuses are used as a data source in statistical linguistics for the detection of unigrams, bigrams and n-grams. These data help analyze the structure of the language and find the most used words, etc. The aim of this paper is a step towards supporting Kazakh linguistics with the open source corpus built on Wikipedia dumps and one of its applications a Kazakh spell checker. Now, corpus contains over 21 million words. It is also open source and waiting for any contributors and suggestions.

Keywords

N-gram, corpus, spell checker, tokenizer., СДУ хабаршысы - 2018, №2

Citation

D. Chapaev , B. Turapbekov / BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES / СДУ хабаршысы - 2018

URI

https://repository.sdu.edu.kz/handle/123456789/675

Collections

3. Articles and Papers

Full item page

BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Find us

Call us

Mail us

Useful Links

Follow us