Last active
May 19, 2020 12:42
-
-
Save artreven/cd7f781c55124bdf1c4301bd737149cb to your computer and use it in GitHub Desktop.
WSID with pre-trained Language Model
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 0. Preparations\n", | |
"\n", | |
"## Virtual Environment\n", | |
"Create a virtual environment. Run the following code in terminal:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"virtualenv --python=python3 ~/<venv folder>\n", | |
"source ~/<venv folder>/bin/activate\n", | |
"pip install -e git://github.com/semantic-web-company/ptlm_wsid.git#egg=ptlm_wsid" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## NLTK and spacy\n", | |
"We need to download some useful nltk and spaCy data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"python -m nltk.downloader punkt stopwords averaged_perceptron_tagger wordnet\n", | |
"python -m spacy download en_core_web_sm" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## iPython\n", | |
"Next we install iPython and we execute all the subsequent commands in iPython shell" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pip install ipython\n", | |
"ipython" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 1. Execution\n", | |
"First we import some useful functionalities and define a function:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import re\n", | |
"from typing import List\n", | |
"\n", | |
"import ptlm_wsid.target_context as tc\n", | |
"import ptlm_wsid.generative_factors as gf\n", | |
"\n", | |
"\n", | |
"def prepare_target_contexts(cxt_strs: List[str],\n", | |
" target_word: str,\n", | |
" verbose: bool = True) -> List[tc.TargetContext]:\n", | |
" \"\"\"\n", | |
" The function creates a simple regex from the target word and searches this\n", | |
" pattern in the context strings. If found then the start and end indices are\n", | |
" used to produce a TargetContext.\n", | |
"\n", | |
" :param cxt_strs: list of context strings\n", | |
" :param target_word: the target word\n", | |
" :param verbose: print also individual predictions\n", | |
" \"\"\"\n", | |
" tcs = []\n", | |
" for cxt_str in cxt_strs:\n", | |
" re_match = re.search(target_word, cxt_str, re.IGNORECASE)\n", | |
" if re_match is None:\n", | |
" raise ValueError(f'In \"{cxt_str}\" the target '\n", | |
" f'\"{target_word}\" was not found')\n", | |
" start_ind, end_ind = re_match.start(), re_match.end()\n", | |
" new_tc = tc.TargetContext(\n", | |
" context=cxt_str, target_start_end_inds=(start_ind, end_ind))\n", | |
" if verbose:\n", | |
" top_predictions = new_tc.get_topn_predictions()\n", | |
" print(f'Predictions for {target_word} in {cxt_str}: '\n", | |
" f'{top_predictions}')\n", | |
" tcs.append(new_tc)\n", | |
" return tcs\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Finally you can try to induce and print the senses:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"cxts_dicts = {\n", | |
" 1: \"The jaguar's present range extends from Southwestern United States and Mexico in North America, across much of Central America, and south to Paraguay and northern Argentina in South America.\",\n", | |
" 2: \"Overall, the jaguar is the largest native cat species of the New World and the third largest in the world.\",\n", | |
" 3: \"Given its historical distribution, the jaguar has featured prominently in the mythology of numerous indigenous American cultures, including those of the Maya and Aztec.\",\n", | |
" 4: \"The jaguar is a compact and well-muscled animal.\",\n", | |
" 5: \"Melanistic jaguars are informally known as black panthers, but as with all forms of polymorphism they do not form a separate species.\",\n", | |
" 6: \"The jaguar uses scrape marks, urine, and feces to mark its territory.\",\n", | |
" 7: \"The word 'jaguar' is thought to derive from the Tupian word yaguara, meaning 'beast of prey'.\",\n", | |
" 8: \"Jaguar's business was founded as the Swallow Sidecar Company in 1922, originally making motorcycle sidecars before developing bodies for passenger cars.\",\n", | |
" 9: \"In 1990 Ford acquired Jaguar Cars and it remained in their ownership, joined in 2000 by Land Rover, till 2008.\",\n", | |
" 10: \"Two of the proudest moments in Jaguar's long history in motor sport involved winning the Le Mans 24 hours race, firstly in 1951 and again in 1953.\",\n", | |
" 11: \"He therefore accepted BMC's offer to merge with Jaguar to form British Motor (Holdings) Limited.\",\n", | |
" 12: \"The Jaguar E-Pace is a compact SUV, officially revealed on 13 July 2017.\"}\n", | |
"\n", | |
"titles, cxts = list(zip(*cxts_dicts.items())) # convert to 2 lists\n", | |
"tcs = prepare_target_contexts(cxt_strs=cxts, target_word='jaguar')\n", | |
"senses = gf.induce(\n", | |
" contexts=[tc.context for tc in tcs],\n", | |
" target_start_end_tuples=[tc.target_start_end_inds for tc in tcs],\n", | |
" titles=titles,\n", | |
" target_pos='N', # we want only nouns\n", | |
" n_sense_indicators=5, # how many substitutes for each sense in the output\n", | |
" top_n_pred=25) # the number of substitutes for each context\n", | |
"for i, sense in enumerate(senses):\n", | |
" print(f'Sense #{i+1}')\n", | |
" print(f'Sense indicators: {\", \".join(str(x) for x in sense.intent)}')\n", | |
" print(f'Found in contexts: {\", \".join(str(x) for x in sense.extent)}')\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And then disambiguate" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"sense_indicators = [list(sense.intent) for sense in senses]\n", | |
"for tc, title in zip(tcs, titles):\n", | |
" scores = tc.disambiguate(sense_clusters=sense_indicators)\n", | |
" print(f'For context: \"{str(title).upper()}. {tc.context}\" '\n", | |
" f'the sense: {sense_indicators[scores.index(max(scores))]} '\n", | |
" f'is chosen.')" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment