Last active
July 14, 2023 14:17
-
-
Save willirath/ee78d101296339b97ae2cc5bd2337fd2 to your computer and use it in GitHub Desktop.
Quantitative Linguistics Examples — https://mybinder.org/v2/gist/willirath/ee78d101296339b97ae2cc5bd2337fd2/HEAD
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "c22818de-13c2-41b1-a4f0-edddcfee82e3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# %pip install nltk pandas matplotlib numpy" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "59b43f0e-bf57-4b14-ab8d-abb860248cb8", | |
"metadata": {}, | |
"source": [ | |
"# Stats about the use of articles in English and Portugese\n", | |
"\n", | |
"(With many pinches of salt....)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2d07b9eb-cdd3-4bc1-b4dd-ef56a602181a", | |
"metadata": {}, | |
"source": [ | |
"## Imports, downloads\n", | |
"\n", | |
"Note we first import the downloader and only later import the corpora." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "37e4f03a-09bd-4be7-92a3-6bb99aebd822", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy as np" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "cda7a5ff-9c9a-478f-ad32-1ee64d843330", | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [], | |
"source": [ | |
"import nltk" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "7e7f7091-1740-417a-bb43-80401f1cb8dc", | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[nltk_data] Downloading package floresta to /home/jovyan/nltk_data...\n", | |
"[nltk_data] Package floresta is already up-to-date!\n", | |
"[nltk_data] Downloading package brown to /home/jovyan/nltk_data...\n", | |
"[nltk_data] Package brown is already up-to-date!\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"nltk.download(\"floresta\")\n", | |
"nltk.download(\"brown\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "408fd3fa-a93c-4518-a17b-7ce6bf4b8b3f", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from nltk.corpus import brown, floresta" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c5e9f115-3056-4b29-99b3-a236af0cbb86", | |
"metadata": {}, | |
"source": [ | |
"## Analysing the use of articles in English and Portugese\n", | |
"\n", | |
"The English _Brown_ corpus tags articles with `\"AT\"`, the Portugese _Floresta_ corpus uses `\"art\"`." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "0d10d66c-2a7e-49b5-b5a1-a2423595fd80", | |
"metadata": {}, | |
"source": [ | |
"### Let's quantify the frequency of articles." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6260cd93-e2c6-4590-bdd1-71254a346ceb", | |
"metadata": {}, | |
"source": [ | |
"The tagged words are returned as lists of tuples with the first tuple element containing the word and the second element containing the tag. As Python starts counting elements at 0, we want to count, how often the element with the number 1 contains either `\"AT\"` for the _Brown_ corpus or `\"art\"` for the _Floresta_ corpus.\n", | |
"\n", | |
"We do this using a list comprehension mapping each tagged word to either `True` for articles or `False` for all others and then summing and normalizing over all words." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "36e88599-4993-4372-960f-3a15b698f724", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0.08541912104113704\n" | |
] | |
} | |
], | |
"source": [ | |
"number_words_in_brown = len(brown.words())\n", | |
"number_articles_in_brown = sum((\"AT\" in p[1] for p in brown.tagged_words()))\n", | |
"fraction_articles_in_brown = number_articles_in_brown / number_words_in_brown\n", | |
"print(fraction_articles_in_brown)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "98368377-a708-4969-b1be-10376c14312f", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0.1385873156732058\n" | |
] | |
} | |
], | |
"source": [ | |
"number_words_in_floresta = len(floresta.words())\n", | |
"number_articles_in_floresta = sum((\"art\" in p[1] for p in floresta.tagged_words()))\n", | |
"fraction_articles_in_floresta = number_articles_in_floresta / number_words_in_floresta\n", | |
"print(fraction_articles_in_floresta)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "468021a5-8494-4f72-b40e-1ea39f2a727d", | |
"metadata": {}, | |
"source": [ | |
"### Using a function\n", | |
"\n", | |
"As we've repeated almost exactly the same code twice (and as we might want to do the same for other copora), we could try and find a better way of re-using this logic. This is what functions are for." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "66ed7699-d117-487a-8d6e-d88920a2b673", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def fraction_of_articles(corpus, article_tag=None):\n", | |
" number_words = len(corpus.words())\n", | |
" number_articles = sum((article_tag in p[1] for p in corpus.tagged_words()))\n", | |
" fraction_articles = number_articles / number_words\n", | |
" return fraction_articles" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "506d60b5-ceaf-479d-b219-95f01b69d81f", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0.08541912104113704\n" | |
] | |
} | |
], | |
"source": [ | |
"fraction_articles_in_brown = fraction_of_articles(brown, article_tag=\"AT\")\n", | |
"print(fraction_articles_in_brown)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "c7a0fb35-6e50-4b38-bc98-80276d087a25", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0.1385873156732058\n" | |
] | |
} | |
], | |
"source": [ | |
"fraction_articles_in_floresta = fraction_of_articles(floresta, article_tag=\"art\")\n", | |
"print(fraction_articles_in_floresta)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c6cd10b0-ba6d-4535-93a1-7e52c06bc351", | |
"metadata": {}, | |
"source": [ | |
"### Distance between two uses of articles\n", | |
"\n", | |
"Let's do statistics about the typical distance between two subsequent uses of (the same or different) articles." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "79e21e0c-ec6d-4ebc-a7f0-4a3edcc2663e", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def words_since_last_article(corpus, article_tag=None):\n", | |
" tagged_words = corpus.tagged_words()\n", | |
" distance = 0\n", | |
" for w in tagged_words:\n", | |
" if article_tag in w[1]:\n", | |
" yield distance # will create a generator\n", | |
" distance = 0\n", | |
" else:\n", | |
" distance = distance + 1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "f6c4f338-538e-46db-bea4-1554974a18b9", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0\n", | |
"6\n", | |
"8\n" | |
] | |
} | |
], | |
"source": [ | |
"dist = words_since_last_article(brown, article_tag=\"AT\")\n", | |
"print(next(dist))\n", | |
"print(next(dist))\n", | |
"print(next(dist))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f6d0d1bb-2831-4917-8336-9b3f662766bf", | |
"metadata": {}, | |
"source": [ | |
"We want to put this into a Pandas datatype which has built in methods for statistics and visualisation:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "19ae948c-795f-4ee5-a9ad-983844a5091c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"count 99188.000000\n", | |
"mean 10.706920\n", | |
"std 10.745149\n", | |
"min 0.000000\n", | |
"25% 4.000000\n", | |
"50% 7.000000\n", | |
"75% 14.000000\n", | |
"max 387.000000\n", | |
"Name: dist_since_last_article, dtype: float64" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"distances_brown = pd.Series(\n", | |
" words_since_last_article(brown, article_tag=\"AT\"),\n", | |
" name=\"dist_since_last_article\",\n", | |
")\n", | |
"\n", | |
"distances_brown.describe()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"id": "7903e43b-5c54-4a5d-a973-ddc868e875c4", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"count 29360.000000\n", | |
"mean 6.215463\n", | |
"std 8.009240\n", | |
"min 0.000000\n", | |
"25% 3.000000\n", | |
"50% 4.000000\n", | |
"75% 8.000000\n", | |
"max 1045.000000\n", | |
"Name: dist_since_last_article, dtype: float64" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"distances_floresta = pd.Series(\n", | |
" words_since_last_article(floresta, article_tag=\"art\"),\n", | |
" name=\"dist_since_last_article\",\n", | |
")\n", | |
"\n", | |
"distances_floresta.describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "4f867fa4-5466-400b-84c2-69b6f0d986eb", | |
"metadata": {}, | |
"source": [ | |
"And some visualisation: We'll look at the quantiles of the distances between use of any article." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"id": "c191f763-dc9e-473c-a86a-4a0d77813d37", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 700x300 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"ax = distances_brown.quantile(np.arange(0, 1, 0.1)).plot(\n", | |
" label=\"Brown, EN\", legend=True,\n", | |
" figsize=(7, 3),\n", | |
")\n", | |
"distances_floresta.quantile(np.arange(0, 1, 0.1)).plot(\n", | |
" ax=ax,\n", | |
" label=\"Floresta, PT\", legend=True,\n", | |
" ylabel=\"distance btw. articles\",\n", | |
" xlabel=\"quantile\",\n", | |
" grid=True,\n", | |
");" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
matplotlib | |
nltk | |
numpy | |
pandas |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment