arademaker · March 10, 2025 14:04
diff --git a/abstracts.py b/abstracts.py
 import matplotlib.pyplot as plt
 from wordcloud import WordCloud, STOPWORDS

 # Sample text
 with open('abstracts.txt', 'r') as f:
    text = f.read()

 my = set(['multi', 'expressions','word', 'MWE'])
 stops = set(STOPWORDS).union(my)

 # Generate word cloud
 wordcloud = WordCloud(width=800, height=400, background_color="white", stopwords=stops).generate(text)

 # Display the word cloud
 plt.figure(figsize=(10, 5))
 plt.imshow(wordcloud, interpolation="bilinear")
 plt.axis("off")  # Hide the axes
 plt.show()
diff --git a/abstracts.txt b/abstracts.txt
 The Irish language has been deemed ‘definitely endangered’ (Moseley, 2012) and has been classified as having ‘weak or no support’ (Lynn, 2023) regarding digital resources in spite of its status as the first official and national language of the Republic of Ireland. This research develops the first named entity recognition (NER) tool for the Irish language, one of the essential tasks identified by the Digital Plan for Irish (Ní Chasaide et al., 2022). In this study, we produce a small gold-standard NER-annotated corpus and compare both monolingual and multilingual BERT models fine-tuned on this task. We experiment with different model architectures and low-resource language approaches to enrich our dataset. We test our models on a mix of single- and multi-word named entities as well as a specific multi-word named entity test set. Our proposed gaBERT model with the implementation of random data augmentation and a conditional random fields layer demonstrates significant performance improvements over baseline models, alternative architectures, and multilingual models, achieving an F1 score of 76.52. This study contributes to advancing Irish language technologies and supporting Irish language digital resources, providing a basis for Irish NER and identification of other MWE types.

 Language models are able to handle compositionality and, to some extent, non-compositional phenomena such as semantic idiosyncrasy --a feature most prominent in the case of idioms. This work presents the MultiCoPIE corpus that includes potentially idiomatic expressions as well as non-ambiguous idioms in Catalan, Italian, and Russian, extending language coverage of PIE corpus data. The new corpus provides additional information on idiom features, such as their semantic transparency, part-of-speech of idiom head as well as their idiomatic counterparts in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English and then study cross-lingual transfer to the languages represented in MultiCoPIE to evaluate the model’s capability to generalize an idiom-related task to other languages not observed during fine-tuning. We show the effect of ‘cross-lingual lexical overlap’: the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying idiomatic expressions that have direct counterparts in English. While this may raise concerns regarding cross-lingual learning and generalization, the outcomes of the experiment on PIEs without English equivalents provide clear evidence of cross-lingual transfer.

 This paper presents the construction of VIDiom-PT, a corpus in European Portuguese annotated for verbal idioms (e.g. O Rui bateu a bota, lit.: Rui hit the boot `Rui died'). This linguistic resource aims to support the development of systems capable of processing such constructions in this language variety. To assist in the annotation effort, two tools were built. The first allows for the detection of possible instances of verbal idioms in texts, while the second provides a graphical interface for annotating them. This effort culminated in the annotation of a total of 5,178 instances of 747 different verbal idioms in more than 200,000 sentences in European Portuguese. A highly reliable inter-annotator agreement was achieved, using Krippendorff's alpha for nominal data (0.869) with 5% of the data independently annotated by 3 experts. Part of the annotated corpus is also made publicly available.

 Lexica of MWEs have always been a valuable resource for various NLP tasks. This paper presents the results of a comprehensive survey on multiword lexical resources that extends a previous one from 2016 to the present. We analyze a diverse set of lexica across multiple languages, reporting on aspects such as creation date, intended usage, languages covered and linguality type, content, acquisition method, accessibility, and linkage to other language resources. Our findings highlight trends in MWE lexicon development focusing on the representation level of languages. This survey aims to support future efforts in creating MWE lexica for NLP applications by identifying these gaps and opportunities.

 Multiword expressions pose numerous challenges to most NLP tasks, and so do their compositionality and semantic ambiguity. The need for resources that make it possible to explore such phenomena is rather pressing, even more so in the case of low-resource languages. In this paper, we present a dataset of noun-adjective compounds in Galician with compositionality scores at token level. These MWEs are ambiguous due to being potentially idiomatic expressions, as well as due to the ambiguity and productivity of their constituents. The dataset comprises 240 MWEs that amount to 322 senses, which are contextualized in two sets of sentences, manually created, and extracted from corpora, totaling 1,858 examples. For this dataset, we gathered human judgments on compositionality levels for compounds, heads, and modifiers. Furthermore, we obtained frequency, ambiguity, and productivity data for compounds and their constituents, and we explored potential correlations between mean compositionality scores and these three properties in terms of compounds, heads, and modifiers. This valuable resource helps evaluate language models on (non-)compositionality and ambiguity, key challenges in NLP, and is especially relevant for Galician, a low-resource variety lacking annotated datasets for such linguistic phenomena.

 Idiom corpora typically include both idiomatic and literal examples of potentially idiomatic expressions, but creating such corpora traditionally requires substantial expert effort and cost. In this article, we explore the use of large language models (LLMs) to generate synthetic idiom corpora as a more time- and cost-efficient alternative. We evaluate the effectiveness of synthetic data in training task-specific models and testing GPT-4 in few-shot prompting setting using synthetic data for idiomaticity detection. Our findings reveal that although models trained on synthetic data perform worse than those trained on human-generated data, synthetic data generation offers considerable advantages in terms of cost and time. Specifically, task-specific idiomaticity detection models trained on synthetic data outperform the general-purpose LLM that generated the data when evaluated in a zero-shot setting, achieving an average improvement of 11 percentage points across four languages. Moreover, synthetic data enhances the LLM's performance, enabling it to match the task-specific models trained with synthetic data when few-shot prompting is applied.

 UD_Greek-GUD (GUD) is the most recent Universal Dependencies (UD) treebank for Standard Modern Greek (SMG) and the first SMG UD treebank to annotate Verbal Multiword Expressions (VMWE). GUD contains material from fiction texts and various sites that use colloquial SMG. We describe the special annotation decisions we implemented with GUD, the pipeline we developed to facilitate the active annotation of new material, and we report on the method we designed to evaluate the performance of models trained on GUD as regards VMWE identification tasks.

 This study investigates the internal representations of verb-particle combinations, called multi-word verbs, within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic properties at different neural network layers. Using the BERT architecture, we analyze the representations of its layers for two different verb-particle constructions: phrasal verbs like "give up" and prepositional verbs like "look at". Our methodology includes training probing classifiers on the model output to classify these categories at both word and sentence levels. The results indicate that the model’s middle layers achieve the highest classification accuracies. To further analyze the nature of these distinctions, we conduct a data separability test using the Generalized Discrimination Value (GDV). While GDV results show weak linear separability between the two verb types, probing classifiers still achieve high accuracy, suggesting that representations of these linguistic categories may be "non-linearly separable". This aligns with previous research indicating that linguistic distinctions in neural networks are not always encoded in a linearly separable manner. These findings computationally support usage-based claims on the representation of verb-particle constructions and highlight the complex interaction between neural network architectures and linguistic structures.

 This paper presents an analysis of the syntagmatic productivity (SynProd) of different classes of multiword expressions (MWEs) in English scientific writing over time (mid 17th to 20th c.). SynProd refers to the variability of the syntagmatic context in which a word or other kind of linguistic unit is used. To measure SynProd, we use entropy. The study reveals that, similar to single-token units of various parts of speech, MWEs exhibit an increasing trend in syntagmatic productivity over time, particularly after the mid-19th century. Furthermore, when compared to similar parts of speech (PoS), MWEs show a more pronounced increase in SynProd over time.

diff --git a/wordcloud.py b/wordcloud.py
 import matplotlib.pyplot as plt
 from wordcloud import WordCloud

 # Sample text
 text = """ 
 named entity recognition, named entities, multi-word expressions, BERT, data augmentation, low-resource language
 language resources, idiomatic expressions, language models, cross-lingual transfer
 Verbal Idioms, European Portuguese, corpus, Lexicon-Grammar
 multiword expressions, mwe-aware lexicons, MWE-dedicated lexicons, representativeness, linguistic diversity
 multiword expressions, compositionality, semantic ambiguity, Galician
 idiomaticity detection, synthetic data generation, large language models
 Standard Modern Greek treebank, VMWE identification
 Multi-word verbs, Probing classifiers, Data separability
 multi-word expressions, syntagmatic productivity, entropy
 """

 stops = ['multi', 'expressions','word']

 # Generate word cloud
 wordcloud = WordCloud(width=800, height=400, background_color="white", stopwords=stops).generate(text)

 # Display the word cloud
 plt.figure(figsize=(10, 5))
 plt.imshow(wordcloud, interpolation="bilinear")
 plt.axis("off")  # Hide the axes
 plt.show()
	import matplotlib.pyplot as plt
	from wordcloud import WordCloud, STOPWORDS

	# Sample text
	with open('abstracts.txt', 'r') as f:
	text = f.read()

	my = set(['multi', 'expressions','word', 'MWE'])
	stops = set(STOPWORDS).union(my)

	# Generate word cloud
	wordcloud = WordCloud(width=800, height=400, background_color="white", stopwords=stops).generate(text)

	# Display the word cloud
	plt.figure(figsize=(10, 5))
	plt.imshow(wordcloud, interpolation="bilinear")
	plt.axis("off") # Hide the axes
	plt.show()
	The Irish language has been deemed ‘definitely endangered’ (Moseley, 2012) and has been classified as having ‘weak or no support’ (Lynn, 2023) regarding digital resources in spite of its status as the first official and national language of the Republic of Ireland. This research develops the first named entity recognition (NER) tool for the Irish language, one of the essential tasks identified by the Digital Plan for Irish (Ní Chasaide et al., 2022). In this study, we produce a small gold-standard NER-annotated corpus and compare both monolingual and multilingual BERT models fine-tuned on this task. We experiment with different model architectures and low-resource language approaches to enrich our dataset. We test our models on a mix of single- and multi-word named entities as well as a specific multi-word named entity test set. Our proposed gaBERT model with the implementation of random data augmentation and a conditional random fields layer demonstrates significant performance improvements over baseline models, alternative architectures, and multilingual models, achieving an F1 score of 76.52. This study contributes to advancing Irish language technologies and supporting Irish language digital resources, providing a basis for Irish NER and identification of other MWE types.

	Language models are able to handle compositionality and, to some extent, non-compositional phenomena such as semantic idiosyncrasy --a feature most prominent in the case of idioms. This work presents the MultiCoPIE corpus that includes potentially idiomatic expressions as well as non-ambiguous idioms in Catalan, Italian, and Russian, extending language coverage of PIE corpus data. The new corpus provides additional information on idiom features, such as their semantic transparency, part-of-speech of idiom head as well as their idiomatic counterparts in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English and then study cross-lingual transfer to the languages represented in MultiCoPIE to evaluate the model’s capability to generalize an idiom-related task to other languages not observed during fine-tuning. We show the effect of ‘cross-lingual lexical overlap’: the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying idiomatic expressions that have direct counterparts in English. While this may raise concerns regarding cross-lingual learning and generalization, the outcomes of the experiment on PIEs without English equivalents provide clear evidence of cross-lingual transfer.

	This paper presents the construction of VIDiom-PT, a corpus in European Portuguese annotated for verbal idioms (e.g. O Rui bateu a bota, lit.: Rui hit the boot `Rui died'). This linguistic resource aims to support the development of systems capable of processing such constructions in this language variety. To assist in the annotation effort, two tools were built. The first allows for the detection of possible instances of verbal idioms in texts, while the second provides a graphical interface for annotating them. This effort culminated in the annotation of a total of 5,178 instances of 747 different verbal idioms in more than 200,000 sentences in European Portuguese. A highly reliable inter-annotator agreement was achieved, using Krippendorff's alpha for nominal data (0.869) with 5% of the data independently annotated by 3 experts. Part of the annotated corpus is also made publicly available.

	Lexica of MWEs have always been a valuable resource for various NLP tasks. This paper presents the results of a comprehensive survey on multiword lexical resources that extends a previous one from 2016 to the present. We analyze a diverse set of lexica across multiple languages, reporting on aspects such as creation date, intended usage, languages covered and linguality type, content, acquisition method, accessibility, and linkage to other language resources. Our findings highlight trends in MWE lexicon development focusing on the representation level of languages. This survey aims to support future efforts in creating MWE lexica for NLP applications by identifying these gaps and opportunities.

	Multiword expressions pose numerous challenges to most NLP tasks, and so do their compositionality and semantic ambiguity. The need for resources that make it possible to explore such phenomena is rather pressing, even more so in the case of low-resource languages. In this paper, we present a dataset of noun-adjective compounds in Galician with compositionality scores at token level. These MWEs are ambiguous due to being potentially idiomatic expressions, as well as due to the ambiguity and productivity of their constituents. The dataset comprises 240 MWEs that amount to 322 senses, which are contextualized in two sets of sentences, manually created, and extracted from corpora, totaling 1,858 examples. For this dataset, we gathered human judgments on compositionality levels for compounds, heads, and modifiers. Furthermore, we obtained frequency, ambiguity, and productivity data for compounds and their constituents, and we explored potential correlations between mean compositionality scores and these three properties in terms of compounds, heads, and modifiers. This valuable resource helps evaluate language models on (non-)compositionality and ambiguity, key challenges in NLP, and is especially relevant for Galician, a low-resource variety lacking annotated datasets for such linguistic phenomena.

	Idiom corpora typically include both idiomatic and literal examples of potentially idiomatic expressions, but creating such corpora traditionally requires substantial expert effort and cost. In this article, we explore the use of large language models (LLMs) to generate synthetic idiom corpora as a more time- and cost-efficient alternative. We evaluate the effectiveness of synthetic data in training task-specific models and testing GPT-4 in few-shot prompting setting using synthetic data for idiomaticity detection. Our findings reveal that although models trained on synthetic data perform worse than those trained on human-generated data, synthetic data generation offers considerable advantages in terms of cost and time. Specifically, task-specific idiomaticity detection models trained on synthetic data outperform the general-purpose LLM that generated the data when evaluated in a zero-shot setting, achieving an average improvement of 11 percentage points across four languages. Moreover, synthetic data enhances the LLM's performance, enabling it to match the task-specific models trained with synthetic data when few-shot prompting is applied.

	UD_Greek-GUD (GUD) is the most recent Universal Dependencies (UD) treebank for Standard Modern Greek (SMG) and the first SMG UD treebank to annotate Verbal Multiword Expressions (VMWE). GUD contains material from fiction texts and various sites that use colloquial SMG. We describe the special annotation decisions we implemented with GUD, the pipeline we developed to facilitate the active annotation of new material, and we report on the method we designed to evaluate the performance of models trained on GUD as regards VMWE identification tasks.

	This study investigates the internal representations of verb-particle combinations, called multi-word verbs, within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic properties at different neural network layers. Using the BERT architecture, we analyze the representations of its layers for two different verb-particle constructions: phrasal verbs like "give up" and prepositional verbs like "look at". Our methodology includes training probing classifiers on the model output to classify these categories at both word and sentence levels. The results indicate that the model’s middle layers achieve the highest classification accuracies. To further analyze the nature of these distinctions, we conduct a data separability test using the Generalized Discrimination Value (GDV). While GDV results show weak linear separability between the two verb types, probing classifiers still achieve high accuracy, suggesting that representations of these linguistic categories may be "non-linearly separable". This aligns with previous research indicating that linguistic distinctions in neural networks are not always encoded in a linearly separable manner. These findings computationally support usage-based claims on the representation of verb-particle constructions and highlight the complex interaction between neural network architectures and linguistic structures.

	This paper presents an analysis of the syntagmatic productivity (SynProd) of different classes of multiword expressions (MWEs) in English scientific writing over time (mid 17th to 20th c.). SynProd refers to the variability of the syntagmatic context in which a word or other kind of linguistic unit is used. To measure SynProd, we use entropy. The study reveals that, similar to single-token units of various parts of speech, MWEs exhibit an increasing trend in syntagmatic productivity over time, particularly after the mid-19th century. Furthermore, when compared to similar parts of speech (PoS), MWEs show a more pronounced increase in SynProd over time.
	import matplotlib.pyplot as plt
	from wordcloud import WordCloud

	# Sample text
	text = """
	named entity recognition, named entities, multi-word expressions, BERT, data augmentation, low-resource language
	language resources, idiomatic expressions, language models, cross-lingual transfer
	Verbal Idioms, European Portuguese, corpus, Lexicon-Grammar
	multiword expressions, mwe-aware lexicons, MWE-dedicated lexicons, representativeness, linguistic diversity
	multiword expressions, compositionality, semantic ambiguity, Galician
	idiomaticity detection, synthetic data generation, large language models
	Standard Modern Greek treebank, VMWE identification
	Multi-word verbs, Probing classifiers, Data separability
	multi-word expressions, syntagmatic productivity, entropy
	"""

	stops = ['multi', 'expressions','word']

	# Generate word cloud
	wordcloud = WordCloud(width=800, height=400, background_color="white", stopwords=stops).generate(text)

	# Display the word cloud
	plt.figure(figsize=(10, 5))
	plt.imshow(wordcloud, interpolation="bilinear")
	plt.axis("off") # Hide the axes
	plt.show()