Barlybay K.2024-12-122024-12-122023https://repository.sdu.edu.kz/handle/123456789/1575The domain of legal text processing in the Kazakh language is currently underserved, presenting a unique challenge due to its specialized language and the relative scarcity of computational resources dedicated to it. This thesis explicitly identifies the problem: the need for an efficient model to process, understand, and generate meaningful insights from Kazakh legal texts. Addressing this problem, the thesis proposes a solution by developing and evaluating bespoke language models pre-trained on a vast corpus of Kazakh legal documents. The study begins with the assembly of a corpus, which comprises over 315 million words from Kazakh legal texts, alongside a benchmark dataset of 2500 multiple-choice questions for civil service examinations in Kazakhstan. Three language models based on the BERT architecture are then pre-trained. Among these, one model is pre-trained entirely from scratch. To emulate a real-world application in the legal domain, the performance of these models is assessed using the multiple-choice question-answering task. The BERT base model pre-trained from scratch, leveraging both Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks, achieves an accuracy of 56.11%. This result underlines the potential of custom pre training strategies on domain-specific corpora for enhancing the performance of language models in specialized areas. In conclusion, this research represents a significant advancement in using AI for legal text processing in the Kazakh language. It presents a promising solution to the problem, paving the way for more efficient and informed decision-making processes in legal and civil service settings.enusing AI, text processing, Kazakh language, civil service settingsQuestion Answering system on Regulatory DocumentsOther