Empowering Kazakh Text Generation: Developing a Meaningful Natural Language Processing Model for Kazakh Language

Karimov Zh.2025-04-252025-04-252024Karimov Zh / Empowering Kazakh Text Generation: Developing a Meaningful Natural Language Processing Model for Kazakh Language / 2024 / 7M06102 - Computer Sciencehttps://repository.sdu.edu.kz/handle/123456789/1721Nowadays, NLP is getting more and more popular due to the vast development of the computing units such as GPUs and CPUs. Especially, recent GPU models allow researchers to process analyse and process terabytes of text and audio data in a short period of time. The problem is that the data is not unlimited and has its own threshold value, therefore data augmentation is one of the techniques of data synthesis, the development of which can produce good effect on the Science of Kazakhstan, to be more precise on the kazakh language processing research area. This research aimed to develop and compare state-of-the-art uni-directional and bi-directional LSTM architectures in order to determine which one is more effective in a task of generating synthetic, kazakh language based text data. Generated synthetic data can be used in a variuos NLP tasks such as training NLP models, content generation, text classification and domain adaptation. By generating synthetic data scientist can overcome data limitations, model overfitting and open new abilities in adjusting and increasing model performance. Architectures used in this research are Uni-directional and Bi-directional LSTMs. After many training session and model parameters tuning the study shows that Bi-directional LSTM has its own pros and cons over the traditional Uni-directional LSTM. Bi-directional neural network learns well and understands the relationship between words better than Uni-directional one, but as expected the training process takes much more time as it extracts the information from the input text in both directions. The dataset used in this study consist of 10 000 unique rows of text taken from the kazakh language based Wikipedia area where each row has median as 80 word sequence. It worth to mention that in order to understand the model relationship between the words study proposes the usage of FastText pre-trained kazakh word vectors as an embedding layer for the both models. This layer turns the input words into vectors where similiar words have close vectors. Finally, the research takes a step towards generating synthetic data with the help of which future works can avoid the problems of data limitations, on the other hand researchers can save the time by ignoring ineffective model selection problem and considering this study’s models comparison.enNLPGPU modelKazakh LanguageEmpowering Kazakh Text Generation: Developing a Meaningful Natural Language Processing Model for Kazakh LanguageThesis