Created
November 13, 2025 11:29
-
-
Save hugobowne/08d56ac6629bcd9eecc1ee5964ba7d3a to your computer and use it in GitHub Desktop.
Summary of recent research on LLMs and NLP for low resource languages.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Research on Large Language Models (LLMs) for Low Resource Languages | |
| ## Short Description of Research Question | |
| How do recent studies address the challenges and improvements of LLMs and other language technologies for low-resource languages? | |
| ## Summary of Work | |
| Recent research on LLMs and NLP for low-resource languages addresses diverse challenges including data scarcity, linguistic bias, domain specificity, and evaluation dataset creation. Major themes include: | |
| 1. Workshop Overview: The LoResLM 2025 workshop showcased 35 papers focusing on linguistic inclusivity in NLP for low-resource languages across multiple language families and research areas. | |
| 2. Dataset Creation: Automated annotation pipelines facilitate building evaluation datasets for semantic search in low-resource domain-specific languages, enhancing performance. | |
| 3. Plagiarism Detection: Combining traditional TF-IDF features with BERT embeddings improves plagiarism detection accuracy in Marathi, demonstrating benefits of neural methods in regional languages. | |
| 4. Resource Quality & Ethics: Emphasizing the involvement of native speakers in resource creation ensures meaningful, accurate, and ethical development of language resources. | |
| 5. Machine Translation: Strategies to overcome rare word problems and augment data with monolingual corpora enhance multilingual MT for low-resource pairs like French-English-Vietnamese. | |
| 6. Model Performance Factors: Beyond data quantity and size, token similarity and country similarity are critical factors for effective multilingual language model performance. | |
| 7. Speech Recognition: Reviews of Indian spoken language recognition highlight challenges and advances in low-resource multilingual speech processing. | |
| 8. Bias Studies: Investigation of social biases (gender, religion) in Bangla LLMs uncovers biases and provides a curated benchmark dataset for bias evaluation. | |
| 9. Lexical Annotation: Methodologies for annotating cognates and etymological origins in Turkic languages support automated translation lexicon induction. | |
| 10. QA Datasets: Development of QA datasets like KenSwQuAD for Swahili addresses the need for machine comprehension resources in low-resource languages. | |
| ## Papers | |
| 1. "Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)" - Hansi Hettiarachchi et al. | |
| 2. "Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language" - Anastasia Zhukova et al. | |
| 3. "Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing" - Atharva Mutsaddi, Aditya Choudhary | |
| 4. "Toward More Meaningful Resources for Lower-resourced Languages" - Constantine Lignos et al. | |
| 5. "Improving Multilingual Neural Machine Translation For Low-Resource Languages: French, English - Vietnamese" - Thi-Vinh Ngo et al. | |
| 6. "Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models" - Sina Bagheri Nezhad et al. | |
| 7. "An Overview of Indian Spoken Language Recognition from Machine Learning Perspective" - Spandan Dey et al. | |
| 8. "Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias" - Jayanta Sadhu et al. | |
| 9. "Annotating Cognates and Etymological Origin in Turkic Languages" - Benjamin S. Mericli, Michael Bloodgood | |
| 10. "KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language" - Barack W. Wanjawa et al. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment