hugobowne · November 13, 2025 11:29
diff --git a/Research Summary on Large Language Models for Low Resource Languages b/Research Summary on Large Language Models for Low Resource Languages
 # Research on Large Language Models (LLMs) for Low Resource Languages

 ## Short Description of Research Question
 How do recent studies address the challenges and improvements of LLMs and other language technologies for low-resource languages?

 ## Summary of Work

 Recent research on LLMs and NLP for low-resource languages addresses diverse challenges including data scarcity, linguistic bias, domain specificity, and evaluation dataset creation. Major themes include:

 1. Workshop Overview: The LoResLM 2025 workshop showcased 35 papers focusing on linguistic inclusivity in NLP for low-resource languages across multiple language families and research areas.
 2. Dataset Creation: Automated annotation pipelines facilitate building evaluation datasets for semantic search in low-resource domain-specific languages, enhancing performance.
 3. Plagiarism Detection: Combining traditional TF-IDF features with BERT embeddings improves plagiarism detection accuracy in Marathi, demonstrating benefits of neural methods in regional languages.
 4. Resource Quality & Ethics: Emphasizing the involvement of native speakers in resource creation ensures meaningful, accurate, and ethical development of language resources.
 5. Machine Translation: Strategies to overcome rare word problems and augment data with monolingual corpora enhance multilingual MT for low-resource pairs like French-English-Vietnamese.
 6. Model Performance Factors: Beyond data quantity and size, token similarity and country similarity are critical factors for effective multilingual language model performance.
 7. Speech Recognition: Reviews of Indian spoken language recognition highlight challenges and advances in low-resource multilingual speech processing.
 8. Bias Studies: Investigation of social biases (gender, religion) in Bangla LLMs uncovers biases and provides a curated benchmark dataset for bias evaluation.
 9. Lexical Annotation: Methodologies for annotating cognates and etymological origins in Turkic languages support automated translation lexicon induction.
 10. QA Datasets: Development of QA datasets like KenSwQuAD for Swahili addresses the need for machine comprehension resources in low-resource languages.

 ## Papers

 1. "Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)" - Hansi Hettiarachchi et al.
 2. "Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language" - Anastasia Zhukova et al.
 3. "Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing" - Atharva Mutsaddi, Aditya Choudhary
 4. "Toward More Meaningful Resources for Lower-resourced Languages" - Constantine Lignos et al.
 5. "Improving Multilingual Neural Machine Translation For Low-Resource Languages: French, English - Vietnamese" - Thi-Vinh Ngo et al.
 6. "Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models" - Sina Bagheri Nezhad et al.
 7. "An Overview of Indian Spoken Language Recognition from Machine Learning Perspective" - Spandan Dey et al.
 8. "Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias" - Jayanta Sadhu et al.
 9. "Annotating Cognates and Etymological Origin in Turkic Languages" - Benjamin S. Mericli, Michael Bloodgood
 10. "KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language" - Barack W. Wanjawa et al.
	# Research on Large Language Models (LLMs) for Low Resource Languages

	## Short Description of Research Question
	How do recent studies address the challenges and improvements of LLMs and other language technologies for low-resource languages?

	## Summary of Work

	Recent research on LLMs and NLP for low-resource languages addresses diverse challenges including data scarcity, linguistic bias, domain specificity, and evaluation dataset creation. Major themes include:

	1. Workshop Overview: The LoResLM 2025 workshop showcased 35 papers focusing on linguistic inclusivity in NLP for low-resource languages across multiple language families and research areas.
	2. Dataset Creation: Automated annotation pipelines facilitate building evaluation datasets for semantic search in low-resource domain-specific languages, enhancing performance.
	3. Plagiarism Detection: Combining traditional TF-IDF features with BERT embeddings improves plagiarism detection accuracy in Marathi, demonstrating benefits of neural methods in regional languages.
	4. Resource Quality & Ethics: Emphasizing the involvement of native speakers in resource creation ensures meaningful, accurate, and ethical development of language resources.
	5. Machine Translation: Strategies to overcome rare word problems and augment data with monolingual corpora enhance multilingual MT for low-resource pairs like French-English-Vietnamese.
	6. Model Performance Factors: Beyond data quantity and size, token similarity and country similarity are critical factors for effective multilingual language model performance.
	7. Speech Recognition: Reviews of Indian spoken language recognition highlight challenges and advances in low-resource multilingual speech processing.
	8. Bias Studies: Investigation of social biases (gender, religion) in Bangla LLMs uncovers biases and provides a curated benchmark dataset for bias evaluation.
	9. Lexical Annotation: Methodologies for annotating cognates and etymological origins in Turkic languages support automated translation lexicon induction.
	10. QA Datasets: Development of QA datasets like KenSwQuAD for Swahili addresses the need for machine comprehension resources in low-resource languages.

	## Papers

	1. "Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)" - Hansi Hettiarachchi et al.
	2. "Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language" - Anastasia Zhukova et al.
	3. "Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing" - Atharva Mutsaddi, Aditya Choudhary
	4. "Toward More Meaningful Resources for Lower-resourced Languages" - Constantine Lignos et al.
	5. "Improving Multilingual Neural Machine Translation For Low-Resource Languages: French, English - Vietnamese" - Thi-Vinh Ngo et al.
	6. "Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models" - Sina Bagheri Nezhad et al.
	7. "An Overview of Indian Spoken Language Recognition from Machine Learning Perspective" - Spandan Dey et al.
	8. "Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias" - Jayanta Sadhu et al.
	9. "Annotating Cognates and Etymological Origin in Turkic Languages" - Benjamin S. Mericli, Michael Bloodgood
	10. "KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language" - Barack W. Wanjawa et al.
No results found