A SYSTEMATIC LITERATURE REVIEW OF NATURAL LANGUAGE PROCESSING FOR INDONESIAN REGIONAL LANGUAGES

Authors

  • Hendri Ahmadian Universitas Islam Negeri Ar-Raniry

DOI:

https://doi.org/10.22373/jintech.v7i1.9797

Keywords:

Natural Language Processing, Indonesian Regional Languages, Systematic Literature Review, Large Language Models

Abstract

This systematic literature review (SLR) investigates the evolution of Natural Language Processing (NLP) for Indonesian regional languages from 2020 to 2025. Analyzing 13 pivotal studies, the research identifies a significant transition from fragmented studies of high-population languages, such as Sundanese and Madurese, toward inclusive, archipelago-wide frameworks covering low-resource dialects like Acehnese and Nias. Architecturally, the field has progressed from classical machine learning to Transformer-based Large Language Models (LLMs), including IndoBART and GPT. Furthermore, data provenance has evolved from unstructured social media corpora to standardized multilingual benchmarks like NusaX and NusaCrowd. Despite these advancements, persistent gaps in data standardization and large-scale pretraining resources remain. Future research should prioritize cross-lingual transfer learning and specialized benchmarks to ensure the technological sustainability of Indonesia’s diverse linguistic heritage

Author Biography

Hendri Ahmadian, Universitas Islam Negeri Ar-Raniry

References

Cahyawijaya, S., Lovenia, H., Aji, A. F., Winata, G. I., Wilie, B., Mahendra, R., Wibisono, C., Romadhony, A., Vincentio, K., Koto, F., Santoso, J., Moeljadi, D., Hudi, C. W. and F., Parmonangan, I. H., Alfina, I., Wicaksono, M. S., Putra, I. F., Oenang, S. R. and Y., Septiandri, A. A., … Purwarianti, A. (2022). NusaCrowd: Open Source Initiative for Indonesian NLP Resources.

Cahyawijaya, S., Lovenia, H., Koto, F., Adhista, D., Dave, E., Oktavianti, S., Akbar, S., Lee, J., Shadieq, N., Cenggoro, T. W., Linuwih, H., Wilie, B., Muridan, G., Winata, G., Moeljadi, D., Aji, A. F., Purwarianti, A., & Fung, P. (2023). NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 921–945. https://doi.org/https://doi.org/10.18653/v1/2023.ijcnlp-main.60

Cahyawijaya, S., Winata, G. I., Wilie, B., Vincentio, K., Li, X., Kuncoro, A., Ruder, S., Lim, Z. Y., Bahar, S., Khodra, M. L., Purwarianti, A., & Fung, P. (2021). IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation. http://arxiv.org/abs/2104.08200

Cohn, A. C., & Ravindranath, M. (2014). Local languages in Indonesia: Language maintenance or language shift. Linguistik Indonesia, 32(2), 131–148.

Dewi, N. P., & Ubaidi, U. (2020). Pos tagging Bahasa Madura dengan menggunakan algoritma Brill tagger. Jurnal Teknologi Informasi Dan Ilmu Komputer, 7(6), 1121–1128.

Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 6282–6293). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.560

Koto, F., & Koto, I. (2020). Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation. In M. Le Nguyen, M. C. Luong, & S. Song (Eds.), Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation (pp. 138–148). Association for Computational Linguistics. https://aclanthology.org/2020.paclic-1.17/

Novitasari, S., Tjandra, A., Sakti, S., & Nakamura, S. (2020). Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis. In D. Beermann, L. Besacier, S. Sakti, & C. Soria (Eds.), Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) (pp. 131–138). European Language Resources association. https://aclanthology.org/2020.sltu-1.18/

Nugraha, A. B., & Romadhony, A. (2023). Identification of 10 Regional Indonesian Languages Using Machine Learning. Sinkron: Jurnal Dan Penelitian Teknik Informatika, 7(4), 2203–2214.

Purwarianti, A., Adhista, D., Baptiso, A., Mahfuzh, M., Sabila, Y., Adila, A., Cahyawijaya, S., & Aji, A. F. (2025). NusaDialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages. Proceedings of the Second Workshop in South East Asian Language Processing, 82–100. https://aclanthology.org/2025.sealp-1.8/

Putra, O. V., Wasmanson, F. M., Harmini, T., & Utama, S. N. (2020). Sundanese Twitter Dataset for Emotion Classification. 2020 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM), 391–395. https://doi.org/10.1109/CENIM51130.2020.9297929

Putri, S. D. A., Ibrohim, M. O., & Budi, I. (2021). Abusive Language and Hate Speech Detection for Indonesian-Local Language in Social Media Text. In P. Meesad, Dr. S. Sodsee, W. Jitsakul, & S. Tangwannawit (Eds.), Recent Advances in Information and Communication Technology 2021 (pp. 88–98). Springer International Publishing.

Sujaini, H., & Putra, A. B. (2024). Analysis of language identification algorithms for regional Indonesian languages. IAES International Journal of Artificial Intelligence (IJ-AI), 13(2), 1741.

Sulistyo, D. A., Wibawa, A. P., Prasetya, D. D., & Ahda, F. A. (2023). LSTM-based machine translation for Madurese-Indonesian. Journal of Applied Data Sciences, 4(3), 189–199.

Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, S., & Purwarianti, A. (2020). IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 843–857. https://doi.org/https://doi.org/10.18653/v1/2020.aacl-main.85

Winata, G. I., Aji, A. F., Cahyawijaya, S., Mahendra, R., Koto, F., Romadhony, A., Kurniawan, K., Moeljadi, D., Prasojo, R. E., Fung, P., Baldwin, T., Lau, J. H., Sennrich, R., & Ruder, S. (2023). NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 815–834. https://doi.org/https://doi.org/10.18653/v1/2023.eacl-main.57

Wongso, W., Lucky, H., & Suhartono, D. (2022). Pre-trained transformer-based language models for Sundanese. Journal of Big Data, 9(1), 39. https://doi.org/10.1186/s40537-022-00590-7

Wongso, W., Setiawan, D. S., Limcorn, S., & Joyoadikusumo, A. (2025). NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural. In D. Wijaya, A. F. Aji, C. Vania, G. I. Winata, & A. Purwarianti (Eds.), Proceedings of the Second Workshop in South East Asian Language Processing (pp. 10–26). Association for Computational Linguistics. https://aclanthology.org/2025.sealp-1.2/

Downloads

Published

2026-02-26