Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save hugobowne/08d56ac6629bcd9eecc1ee5964ba7d3a to your computer and use it in GitHub Desktop.

Select an option

Save hugobowne/08d56ac6629bcd9eecc1ee5964ba7d3a to your computer and use it in GitHub Desktop.
Summary of recent research on LLMs and NLP for low resource languages.
# Research on Large Language Models (LLMs) for Low Resource Languages
## Short Description of Research Question
How do recent studies address the challenges and improvements of LLMs and other language technologies for low-resource languages?
## Summary of Work
Recent research on LLMs and NLP for low-resource languages addresses diverse challenges including data scarcity, linguistic bias, domain specificity, and evaluation dataset creation. Major themes include:
1. Workshop Overview: The LoResLM 2025 workshop showcased 35 papers focusing on linguistic inclusivity in NLP for low-resource languages across multiple language families and research areas.
2. Dataset Creation: Automated annotation pipelines facilitate building evaluation datasets for semantic search in low-resource domain-specific languages, enhancing performance.
3. Plagiarism Detection: Combining traditional TF-IDF features with BERT embeddings improves plagiarism detection accuracy in Marathi, demonstrating benefits of neural methods in regional languages.
4. Resource Quality & Ethics: Emphasizing the involvement of native speakers in resource creation ensures meaningful, accurate, and ethical development of language resources.
5. Machine Translation: Strategies to overcome rare word problems and augment data with monolingual corpora enhance multilingual MT for low-resource pairs like French-English-Vietnamese.
6. Model Performance Factors: Beyond data quantity and size, token similarity and country similarity are critical factors for effective multilingual language model performance.
7. Speech Recognition: Reviews of Indian spoken language recognition highlight challenges and advances in low-resource multilingual speech processing.
8. Bias Studies: Investigation of social biases (gender, religion) in Bangla LLMs uncovers biases and provides a curated benchmark dataset for bias evaluation.
9. Lexical Annotation: Methodologies for annotating cognates and etymological origins in Turkic languages support automated translation lexicon induction.
10. QA Datasets: Development of QA datasets like KenSwQuAD for Swahili addresses the need for machine comprehension resources in low-resource languages.
## Papers
1. "Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)" - Hansi Hettiarachchi et al.
2. "Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language" - Anastasia Zhukova et al.
3. "Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing" - Atharva Mutsaddi, Aditya Choudhary
4. "Toward More Meaningful Resources for Lower-resourced Languages" - Constantine Lignos et al.
5. "Improving Multilingual Neural Machine Translation For Low-Resource Languages: French, English - Vietnamese" - Thi-Vinh Ngo et al.
6. "Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models" - Sina Bagheri Nezhad et al.
7. "An Overview of Indian Spoken Language Recognition from Machine Learning Perspective" - Spandan Dey et al.
8. "Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias" - Jayanta Sadhu et al.
9. "Annotating Cognates and Etymological Origin in Turkic Languages" - Benjamin S. Mericli, Michael Bloodgood
10. "KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language" - Barack W. Wanjawa et al.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment