LLM-Assisted Weak Supervision for Low-Resource Kazakh Sequence Labeling: Synthetic Annotation and CRF-Refined NER/POS Models

Aigerim Aitim

doi:10.3390/app16083632

journal article Open Access Apr 08, 2026

LLM-Assisted Weak Supervision for Low-Resource Kazakh Sequence Labeling: Synthetic Annotation and CRF-Refined NER/POS Models

Aigerim Aitim

Applied Sciences Vol. 16 No. 8 pp. 3632 · MDPI AG

View at Publisher Save 10.3390/app16083632

Abstract

Kazakh sequence labeling is constrained by limited annotated resources, while its agglutinative morphology and productive suffixation increase data sparsity and exacerbate label inconsistency in part-of-speech (POS) tagging and named entity recognition (NER). This paper proposes an LLM-assisted weak supervision framework in which a large language model generates synthetic token-level annotations that are subsequently filtered using confidence-based criteria and combined with a smaller manually verified subset to train Transformer-based sequence taggers with Conditional Random Field (CRF) decoding. The pipeline unifies corpus construction, weak-label generation, quality filtering, word-to-subword alignment, and CRF-refined structured prediction into a reproducible workflow. Experimental results show that contextual encoders and structured decoding provide strong performance for Kazakh POS and NER, while the proposed training design enables efficient convergence with diminishing returns beyond moderate epoch budgets. Error-slice analysis indicates that residual errors are concentrated in rare tokens, morphologically complex long words, longer sentences, and the ORG entity class. Overall, the findings support the use of LLM-assisted weak supervision as a scalable strategy for low-resource Kazakh sequence labeling when synthetic labels are controlled through filtering and refined by structured decoding.

Topics

No keywords indexed for this article. Browse by subject →

References

38

[1]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 8–10). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA.

[2]

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020, January 18–20). Unsupervised Cross-Lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. 10.18653/v1/2020.acl-main.747

[3]

Imani, A., Severini, S., Jalili Sabet, M., Yvon, F., and Schütze, H. (2022, January 12–14). Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates. 10.18653/v1/2022.emnlp-main.102

[4]

Ratner "Snorkel: Rapid Training Data Creation with Weak Supervision" VLDB J. (2020) 10.1007/s00778-019-00552-1

[5]

Kudo, T., and Richardson, J. (2018, January 5–8). SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium. 10.18653/v1/d18-2012

[6]

Lafferty, J.D., McCallum, A., and Pereira, F.C.N. (July, January 28). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA.

[7]

Smith "Language Models in the Loop: Incorporating Prompting into Weak Supervision" ACM/IMS J. Data Sci. (2024)

[8]

Su, J., Yu, P., Zhang, J., and Bach, S.H. (2024). Leveraging Large Language Models for Structure Learning in Prompted Weak Supervision. arXiv. 10.1109/bigdata59044.2023.10386190

[9]

Ma, T., Du, J., Huang, W., Wang, W., Xie, L., Zhong, X., and Zhou, J.T. (2025, January 4–9). Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China. 10.18653/v1/2025.findings-emnlp.294

[10]

Clark "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation" Trans. Assoc. Comput. Linguist. (2022) 10.1162/tacl_a_00448

[11]

Xue "ByT5: Towards a Token-Free Future with Pre-Trained Byte-to-Byte Models" Trans. ACL (2022)

[12]

Yeshpanov, R., Khassanov, Y., and Varol, H.A. (2022, January 20–25). KazNERD: Kazakh Named Entity Recognition Dataset. Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France.

[13]

Aitim "A Comparison of Kazakh Language Processing Models for Improving Semantic Search Results" East. Eur. J. Enterp. Technol. (2025)

[14]

Aitim "Development of a Hybrid CNN–RNN Model for Enhanced Recognition of Dynamic Gestures in Kazakh Sign Language" East. Eur. J. Enterp. Technol. (2025)

[15]

Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., and Ratner, A. (2021). WRENCH: A Comprehensive Benchmark for Weak Supervision. arXiv.

[16]

Zhang, J., Hsieh, C.-Y., Yu, Y., Zhang, C., and Ratner, A. (2022). A Survey on Programmatic Weak Supervision. arXiv.

[17]

Zheng, G., Karamanolakis, G., Shu, K., and Awadallah, A. (2022, January 10–15). WALNUT: A Benchmark on Semi-Weakly Supervised Learning for Natural Language Understanding. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA. 10.18653/v1/2022.naacl-main.64

[18]

Aitim "Data Processing and Analysing Techniques in UX Research" Procedia Comput. Sci. (2024) 10.1016/j.procs.2024.11.154

[19]

Kuang, Z., Arachie, C.G., Liang, B., Narayana, P., Desalvo, G., Quinn, M.S., Huang, B., Downs, G., and Yang, Y. (2022, January 28–30). Firebolt: Weak Supervision under Weaker Assumptions. Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, Virtual Conference.

[20]

Zhu, D., Shen, X., Mosbach, M., Stephan, A., and Klakow, D. (2023). Weaker Than You Think: A Critical Look at Weakly Supervised Learning. arXiv. 10.18653/v1/2023.acl-long.796

[21]

Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., Wang, G., and Guo, C. (May, January 29). GPT-NER: Named Entity Recognition via Large Language Models. Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA. 10.18653/v1/2025.findings-naacl.239

[22]

Zhou, W., Zhang, S., Gu, Y., Chen, M., and Poon, H. (2024). UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. arXiv.

[23]

Lu, J., Yang, Z., Wang, Y., Liu, X., Mac Namee, B., and Huang, C. (2024). PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition. arXiv.

[24]

Radchenko, V., and Drushchak, N. (August, January 31). Improving Named Entity Recognition for Low-Resource Languages Using Large Language Models: A Ukrainian Case Study. Proceedings of the Fourth Ukrainian Natural Language Processing Workshop, Online.

[25]

Li, Q., Xie, T., Peng, P., Wang, H., and Wang, G. (2023, January 9–14). A Class-Rebalancing Self-Training Framework for Distantly-Supervised Named Entity Recognition. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada. 10.18653/v1/2023.findings-acl.703

[26]

Tian, J., Zhou, K., Wang, M., Zhang, Y., Yao, B., Liu, X., and Guo, C. (2023, January 11–12). UseClean: Learning from Complex Noisy Labels in Named Entity Recognition. Proceedings of the 2023 CLASP Conference on Learning with Small Data, Gothenburg, Sweden.

[27]

Merdjanovska, E., Aynetdinov, A., and Akbik, A. (2024, January 12–16). NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA. 10.18653/v1/2024.emnlp-main.1011

[28]

Hsu "Leveraging Large Language Models for Knowledge-Free Weak Supervision in Clinical Natural Language Processing" Sci. Rep. (2025) 10.1038/s41598-024-68168-2

[29]

Zhang "A Survey on Learning with Noisy Labels in Natural Language Processing: How to Train Models with Label Noise" Eng. Appl. Artif. Intell. (2025) 10.1016/j.engappai.2025.110157

[30]

Xie, T., Zhang, J., Zhang, Y., Liang, Y., Li, Q., and Wang, H. (2025, January 19–24). Retrieval Augmented Instruction Tuning for Open NER with Large Language Models. Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates.

[31]

Jeon, T., Yang, B., Kim, C., and Lim, Y. (2023). Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition. arXiv.

[32]

Asgari, E., El Kheir, Y., and Javaheri, M.A.S. (2025). MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies. arXiv.

[33]

Brahma, M., Karthika, N.J., Singh, A., Adiga, D., Bhate, S., Ramakrishnan, G., Saluja, R., and Desarkar, M.S. (2025). MorphTok: Morphologically Grounded Tokenization for Indian Languages. arXiv.

[34]

Teklehaymanot, H.K., Fazlija, D., and Nejdl, W. (2025). MoVoC: Morphology-Aware Subword Construction for Geez Script Languages. arXiv. 10.18653/v1/2025.findings-emnlp.706

[35]

García, A.T., Przybyła, P., and Wanner, L. (2025, January 4–9). Exploring Morphology-Aware Tokenization: A Case Study on Spanish Language Modeling. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China. 10.18653/v1/2025.emnlp-main.1552

[36]

Qiu "A Diffusion Enhanced CRF and BiLSTM Framework for Accurate Entity Recognition" Sci. Rep. (2025) 10.1038/s41598-025-04036-x

[37]

Ma, X., and Hovy, E. (2016, January 7–12). End-to-End Sequence Labeling via Bi-Directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. 10.18653/v1/p16-1101

[38]

Kapusta "The Importance of Morphology-Aware Subword Tokenization for NLP Tasks in Slovak Language Modeling" Expert Syst. Appl. (2026) 10.1016/j.eswa.2026.131492

Metrics

0

Citations

38

References

Details

Published: Apr 08, 2026
Vol/Issue: 16(8)
Pages: 3632
License: View

Authors

A

Aigerim Aitim

Department of Information Systems, International Information Technology University, Manas str. 34/1, Almaty 050000, Kazakhstan

Funding

Ministry of Culture and Information of the Republic of Kazakhstan Award: Tauelsizdik Urpaktary–2025

Cite This Article

Aigerim Aitim (2026). LLM-Assisted Weak Supervision for Low-Resource Kazakh Sequence Labeling: Synthetic Annotation and CRF-Refined NER/POS Models. Applied Sciences, 16(8), 3632. https://doi.org/10.3390/app16083632

LLM-Assisted Weak Supervision for Low-Resource Kazakh Sequence Labeling: Synthetic Annotation and CRF-Refined NER/POS Models

You May Also Like