BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES

Loading...
Thumbnail Image

Date

2018

Journal Title

Journal ISSN

Volume Title

Publisher

СДУ хабаршысы - 2018

Abstract

Abstract. The lack of free public accessible Kazakh language corpus is one of the difficulties that Kazakh linguistics researchers face. Corpuses are used as a data source in statistical linguistics for the detection of unigrams, bigrams and n-grams. These data help analyze the structure of the language and find the most used words, etc. The aim of this paper is a step towards supporting Kazakh linguistics with the open source corpus built on Wikipedia dumps and one of its applications a Kazakh spell checker. Now, corpus contains over 21 million words. It is also open source and waiting for any contributors and suggestions.

Description

Keywords

N-gram, corpus, spell checker, tokenizer., СДУ хабаршысы - 2018, №2

Citation

D. Chapaev , B. Turapbekov / BUILDING KAZAKH LANGUAGE OPEN SOURCE CORPORA USING WIKIPEDIA RESOURCES / СДУ хабаршысы - 2018