Last active
October 21, 2018 01:44
-
-
Save mjbommar/e2a019e346b879c13d3d to your computer and use it in GitHub Desktop.
Fuzzy sentence matching in Python - Bommarito Consulting, LLC: http://bommaritollc.com/2014/06/fuzzy-matching-sentences-in-python
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "", | |
"signature": "sha256:f38d99b86a857487f099a5da7dde9ab7c16866c251640fe5e80b4b7218e43f90" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## IPython Notebook for [Bommarito Consulting](http://bommaritollc.com/) Blog Post\n", | |
"\n", | |
"### **Link**: [Fuzzy sentence matching in Python](http://bommaritollc.com/2014/06/fuzzy-match-sentences-python): http://bommaritollc.com/2014/06/fuzzy-match-sentences-python\n", | |
"\n", | |
"**Author**: [Michael J. Bommarito II](https://www.linkedin.com/in/bommarito/)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Imports\n", | |
"import difflib\n", | |
"import nltk" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 11 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"target_sentence = \"In the eighteenth century it was often convenient to regard man as a clockwork automaton.\"\n", | |
"\n", | |
"sentences = [\"In the eighteenth century it was often convenient to regard man as a clockwork automaton.\",\n", | |
" \"in the eighteenth century it was often convenient to regard man as a clockwork automaton\",\n", | |
" \"In the eighteenth century, it was often convenient to regard man as a clockwork automaton.\",\n", | |
" \"In the eighteenth century, it was not accepted to regard man as a clockwork automaton.\",\n", | |
" \"In the eighteenth century, it was often convenient to regard man as clockwork automata.\",\n", | |
" \"In the eighteenth century, it was often convenient to regard man as clockwork automatons.\",\n", | |
" \"It was convenient to regard man as a clockwork automaton in the eighteenth century.\",\n", | |
" \"In the 1700s, it was common to regard man as a clockwork automaton.\",\n", | |
" \"In the 1700s, it was convenient to regard man as a clockwork automaton.\",\n", | |
" \"In the eighteenth century.\",\n", | |
" \"Man as a clockwork automaton.\",\n", | |
" \"In past centuries, man was often regarded as a clockwork automaton.\",\n", | |
" \"The eighteenth century was characterized by man as a clockwork automaton.\",\n", | |
" \"Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.\",]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 12 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Example 1 - Exact Match" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"def is_exact_match(a, b):\n", | |
" \"\"\"Check if a and b are matches.\"\"\"\n", | |
" return (a == b)\n", | |
"\n", | |
"for sentence in sentences:\n", | |
" print(is_exact_match(target_sentence, sentence), sentence)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"(True, 'In the eighteenth century it was often convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'in the eighteenth century it was often convenient to regard man as a clockwork automaton')\n", | |
"(False, 'In the eighteenth century, it was often convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century, it was not accepted to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century, it was often convenient to regard man as clockwork automata.')\n", | |
"(False, 'In the eighteenth century, it was often convenient to regard man as clockwork automatons.')\n", | |
"(False, 'It was convenient to regard man as a clockwork automaton in the eighteenth century.')\n", | |
"(False, 'In the 1700s, it was common to regard man as a clockwork automaton.')\n", | |
"(False, 'In the 1700s, it was convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century.')\n", | |
"(False, 'Man as a clockwork automaton.')\n", | |
"(False, 'In past centuries, man was often regarded as a clockwork automaton.')\n", | |
"(False, 'The eighteenth century was characterized by man as a clockwork automaton.')\n", | |
"(False, 'Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.')\n" | |
] | |
} | |
], | |
"prompt_number": 13 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Example 2.a - Exact Case-Insensitive Token Match" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Imports\n", | |
"import nltk.corpus\n", | |
"import nltk.tokenize.punkt\n", | |
"import string\n", | |
"\n", | |
"# Get default English stopwords and extend with punctuation\n", | |
"stopwords = nltk.corpus.stopwords.words('english')\n", | |
"stopwords.extend(string.punctuation)\n", | |
"stopwords.append('')\n", | |
"\n", | |
"# Create tokenizer\n", | |
"tokenizer = nltk.tokenize.punkt.PunktWordTokenizer()\n", | |
"\n", | |
"def is_ci_token_match(a, b):\n", | |
" \"\"\"Check if a and b are matches.\"\"\"\n", | |
" tokens_a = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(a)]\n", | |
" tokens_b = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(b)]\n", | |
"\n", | |
" return (tokens_a == tokens_b)\n", | |
"\n", | |
"for sentence in sentences:\n", | |
" print(is_ci_token_match(target_sentence, sentence), sentence)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"(True, 'In the eighteenth century it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'in the eighteenth century it was often convenient to regard man as a clockwork automaton')\n", | |
"(False, 'In the eighteenth century, it was often convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century, it was not accepted to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century, it was often convenient to regard man as clockwork automata.')\n", | |
"(False, 'In the eighteenth century, it was often convenient to regard man as clockwork automatons.')\n", | |
"(False, 'It was convenient to regard man as a clockwork automaton in the eighteenth century.')\n", | |
"(False, 'In the 1700s, it was common to regard man as a clockwork automaton.')\n", | |
"(False, 'In the 1700s, it was convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century.')\n", | |
"(False, 'Man as a clockwork automaton.')\n", | |
"(False, 'In past centuries, man was often regarded as a clockwork automaton.')\n", | |
"(False, 'The eighteenth century was characterized by man as a clockwork automaton.')\n", | |
"(False, 'Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.')\n" | |
] | |
} | |
], | |
"prompt_number": 14 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Example 2.b - Exact Case-Insensitive Token Match after Stopwording" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Imports\n", | |
"import nltk.corpus\n", | |
"import nltk.tokenize.punkt\n", | |
"import string\n", | |
"\n", | |
"# Get default English stopwords and extend with punctuation\n", | |
"stopwords = nltk.corpus.stopwords.words('english')\n", | |
"stopwords.extend(string.punctuation)\n", | |
"stopwords.append('')\n", | |
"\n", | |
"# Create tokenizer\n", | |
"tokenizer = nltk.tokenize.punkt.PunktWordTokenizer()\n", | |
"\n", | |
"def is_ci_token_stopword_match(a, b):\n", | |
" \"\"\"Check if a and b are matches.\"\"\"\n", | |
" tokens_a = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(a) \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
" tokens_b = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(b) \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
" \n", | |
" return (tokens_a == tokens_b)\n", | |
"\n", | |
"for sentence in sentences:\n", | |
" print(is_ci_token_stopword_match(target_sentence, sentence), sentence)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"(True, 'In the eighteenth century it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'in the eighteenth century it was often convenient to regard man as a clockwork automaton')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century, it was not accepted to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century, it was often convenient to regard man as clockwork automata.')\n", | |
"(False, 'In the eighteenth century, it was often convenient to regard man as clockwork automatons.')\n", | |
"(False, 'It was convenient to regard man as a clockwork automaton in the eighteenth century.')\n", | |
"(False, 'In the 1700s, it was common to regard man as a clockwork automaton.')\n", | |
"(False, 'In the 1700s, it was convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century.')\n", | |
"(False, 'Man as a clockwork automaton.')\n", | |
"(False, 'In past centuries, man was often regarded as a clockwork automaton.')\n", | |
"(False, 'The eighteenth century was characterized by man as a clockwork automaton.')\n", | |
"(False, 'Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.')\n" | |
] | |
} | |
], | |
"prompt_number": 15 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Example 3 - Exact Token Match after Stopwording and Stemming" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Imports\n", | |
"import nltk.corpus\n", | |
"import nltk.tokenize.punkt\n", | |
"import nltk.stem.snowball\n", | |
"import string\n", | |
"\n", | |
"# Get default English stopwords and extend with punctuation\n", | |
"stopwords = nltk.corpus.stopwords.words('english')\n", | |
"stopwords.extend(string.punctuation)\n", | |
"stopwords.append('')\n", | |
"\n", | |
"# Create tokenizer and stemmer\n", | |
"tokenizer = nltk.tokenize.punkt.PunktWordTokenizer()\n", | |
"stemmer = nltk.stem.snowball.SnowballStemmer('english')\n", | |
"\n", | |
"def is_ci_token_stopword_stem_match(a, b):\n", | |
" \"\"\"Check if a and b are matches.\"\"\"\n", | |
" tokens_a = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(a) \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
" tokens_b = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(b) \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
" stems_a = [stemmer.stem(token) for token in tokens_a]\n", | |
" stems_b = [stemmer.stem(token) for token in tokens_b]\n", | |
"\n", | |
" return (stems_a == stems_b)\n", | |
"\n", | |
"for sentence in sentences:\n", | |
" print(is_ci_token_stopword_stem_match(target_sentence, sentence), sentence)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"(True, 'In the eighteenth century it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'in the eighteenth century it was often convenient to regard man as a clockwork automaton')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century, it was not accepted to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century, it was often convenient to regard man as clockwork automata.')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as clockwork automatons.')\n", | |
"(False, 'It was convenient to regard man as a clockwork automaton in the eighteenth century.')\n", | |
"(False, 'In the 1700s, it was common to regard man as a clockwork automaton.')\n", | |
"(False, 'In the 1700s, it was convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century.')\n", | |
"(False, 'Man as a clockwork automaton.')\n", | |
"(False, 'In past centuries, man was often regarded as a clockwork automaton.')\n", | |
"(False, 'The eighteenth century was characterized by man as a clockwork automaton.')\n", | |
"(False, 'Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.')\n" | |
] | |
} | |
], | |
"prompt_number": 16 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Example 4 - Exact Token Match after Stopwording and Lemmatizing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Imports\n", | |
"import nltk.corpus\n", | |
"import nltk.tokenize.punkt\n", | |
"import nltk.stem.snowball\n", | |
"from nltk.corpus import wordnet\n", | |
"import string\n", | |
"\n", | |
"# Get default English stopwords and extend with punctuation\n", | |
"stopwords = nltk.corpus.stopwords.words('english')\n", | |
"stopwords.extend(string.punctuation)\n", | |
"stopwords.append('')\n", | |
"\n", | |
"def get_wordnet_pos(pos_tag):\n", | |
" if pos_tag[1].startswith('J'):\n", | |
" return (pos_tag[0], wordnet.ADJ)\n", | |
" elif pos_tag[1].startswith('V'):\n", | |
" return (pos_tag[0], wordnet.VERB)\n", | |
" elif pos_tag[1].startswith('N'):\n", | |
" return (pos_tag[0], wordnet.NOUN)\n", | |
" elif pos_tag[1].startswith('R'):\n", | |
" return (pos_tag[0], wordnet.ADV)\n", | |
" else:\n", | |
" return (pos_tag[0], wordnet.NOUN)\n", | |
"\n", | |
"# Create tokenizer and stemmer\n", | |
"tokenizer = nltk.tokenize.punkt.PunktWordTokenizer()\n", | |
"lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()\n", | |
"\n", | |
"def is_ci_token_stopword_lemma_match(a, b):\n", | |
" \"\"\"Check if a and b are matches.\"\"\"\n", | |
" pos_a = map(get_wordnet_pos, nltk.pos_tag(tokenizer.tokenize(a)))\n", | |
" pos_b = map(get_wordnet_pos, nltk.pos_tag(tokenizer.tokenize(b)))\n", | |
" lemmae_a = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_a \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
" lemmae_b = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_b \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
"\n", | |
" return (lemmae_a == lemmae_b)\n", | |
"\n", | |
"for sentence in sentences:\n", | |
" print(is_ci_token_stopword_lemma_match(target_sentence, sentence), sentence)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"(True, 'In the eighteenth century it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'in the eighteenth century it was often convenient to regard man as a clockwork automaton')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century, it was not accepted to regard man as a clockwork automaton.')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as clockwork automata.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as clockwork automatons.')\n", | |
"(False, 'It was convenient to regard man as a clockwork automaton in the eighteenth century.')\n", | |
"(False, 'In the 1700s, it was common to regard man as a clockwork automaton.')\n", | |
"(False, 'In the 1700s, it was convenient to regard man as a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(False, 'In the eighteenth century.')\n", | |
"(False, 'Man as a clockwork automaton.')\n", | |
"(False, 'In past centuries, man was often regarded as a clockwork automaton.')\n", | |
"(False, 'The eighteenth century was characterized by man as a clockwork automaton.')\n", | |
"(False, 'Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 17 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Example 5 - Partial Sequence Match after Stopwording and Lemmatizing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Imports\n", | |
"import nltk.corpus\n", | |
"import nltk.tokenize.punkt\n", | |
"import nltk.stem.snowball\n", | |
"import string\n", | |
"\n", | |
"# Get default English stopwords and extend with punctuation\n", | |
"stopwords = nltk.corpus.stopwords.words('english')\n", | |
"stopwords.extend(string.punctuation)\n", | |
"stopwords.append('')\n", | |
"\n", | |
"def get_wordnet_pos(pos_tag):\n", | |
" if pos_tag[1].startswith('J'):\n", | |
" return (pos_tag[0], wordnet.ADJ)\n", | |
" elif pos_tag[1].startswith('V'):\n", | |
" return (pos_tag[0], wordnet.VERB)\n", | |
" elif pos_tag[1].startswith('N'):\n", | |
" return (pos_tag[0], wordnet.NOUN)\n", | |
" elif pos_tag[1].startswith('R'):\n", | |
" return (pos_tag[0], wordnet.ADV)\n", | |
" else:\n", | |
" return (pos_tag[0], wordnet.NOUN)\n", | |
"\n", | |
"# Create tokenizer and stemmer\n", | |
"tokenizer = nltk.tokenize.punkt.PunktWordTokenizer()\n", | |
"lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()\n", | |
"\n", | |
"def is_ci_partial_seq_token_stopword_lemma_match(a, b):\n", | |
" \"\"\"Check if a and b are matches.\"\"\"\n", | |
" pos_a = map(get_wordnet_pos, nltk.pos_tag(tokenizer.tokenize(a)))\n", | |
" pos_b = map(get_wordnet_pos, nltk.pos_tag(tokenizer.tokenize(b)))\n", | |
" lemmae_a = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_a \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
" lemmae_b = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_b \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
"\n", | |
" # Create sequence matcher\n", | |
" s = difflib.SequenceMatcher(None, lemmae_a, lemmae_b)\n", | |
" return (s.ratio() > 0.66)\n", | |
"\n", | |
"for sentence in sentences:\n", | |
" print(is_ci_partial_seq_token_stopword_lemma_match(target_sentence, sentence), sentence)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"(True, 'In the eighteenth century it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'in the eighteenth century it was often convenient to regard man as a clockwork automaton')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'In the eighteenth century, it was not accepted to regard man as a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as clockwork automata.')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as clockwork automatons.')\n", | |
"(True, 'It was convenient to regard man as a clockwork automaton in the eighteenth century.')\n", | |
"(False, 'In the 1700s, it was common to regard man as a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(True, 'In the 1700s, it was convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century.')\n", | |
"(False, 'Man as a clockwork automaton.')\n", | |
"(True, 'In past centuries, man was often regarded as a clockwork automaton.')\n", | |
"(True, 'The eighteenth century was characterized by man as a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(False, 'Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.')\n" | |
] | |
} | |
], | |
"prompt_number": 18 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Example 6 - Partial Set Match after Stopwording and Lemmatizing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": true, | |
"input": [ | |
"# Imports\n", | |
"import nltk.corpus\n", | |
"import nltk.tokenize.punkt\n", | |
"import nltk.stem.snowball\n", | |
"import string\n", | |
"\n", | |
"# Get default English stopwords and extend with punctuation\n", | |
"stopwords = nltk.corpus.stopwords.words('english')\n", | |
"stopwords.extend(string.punctuation)\n", | |
"stopwords.append('')\n", | |
"\n", | |
"def get_wordnet_pos(pos_tag):\n", | |
" if pos_tag[1].startswith('J'):\n", | |
" return (pos_tag[0], wordnet.ADJ)\n", | |
" elif pos_tag[1].startswith('V'):\n", | |
" return (pos_tag[0], wordnet.VERB)\n", | |
" elif pos_tag[1].startswith('N'):\n", | |
" return (pos_tag[0], wordnet.NOUN)\n", | |
" elif pos_tag[1].startswith('R'):\n", | |
" return (pos_tag[0], wordnet.ADV)\n", | |
" else:\n", | |
" return (pos_tag[0], wordnet.NOUN)\n", | |
"\n", | |
"# Create tokenizer and stemmer\n", | |
"tokenizer = nltk.tokenize.punkt.PunktWordTokenizer()\n", | |
"lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()\n", | |
"\n", | |
"def is_ci_partial_set_token_stopword_lemma_match(a, b):\n", | |
" \"\"\"Check if a and b are matches.\"\"\"\n", | |
" pos_a = map(get_wordnet_pos, nltk.pos_tag(tokenizer.tokenize(a)))\n", | |
" pos_b = map(get_wordnet_pos, nltk.pos_tag(tokenizer.tokenize(b)))\n", | |
" lemmae_a = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_a \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
" lemmae_b = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_b \\\n", | |
" if token.lower().strip(string.punctuation) not in stopwords]\n", | |
"\n", | |
" # Calculate Jaccard similarity\n", | |
" ratio = len(set(lemmae_a).intersection(lemmae_b)) / float(len(set(lemmae_a).union(lemmae_b)))\n", | |
" return (ratio > 0.66)\n", | |
"\n", | |
"for sentence in sentences:\n", | |
" print(is_ci_partial_set_token_stopword_lemma_match(target_sentence, sentence), sentence)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"(True, 'In the eighteenth century it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'in the eighteenth century it was often convenient to regard man as a clockwork automaton')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'In the eighteenth century, it was not accepted to regard man as a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as clockwork automata.')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as clockwork automatons.')\n", | |
"(True, 'It was convenient to regard man as a clockwork automaton in the eighteenth century.')\n", | |
"(False, 'In the 1700s, it was common to regard man as a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(False, 'In the 1700s, it was convenient to regard man as a clockwork automaton.')\n", | |
"(False, 'In the eighteenth century.')\n", | |
"(False, 'Man as a clockwork automaton.')\n", | |
"(True, 'In past centuries, man was often regarded as a clockwork automaton.')\n", | |
"(False, 'The eighteenth century was characterized by man as a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(False, 'Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.')\n" | |
] | |
} | |
], | |
"prompt_number": 19 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Example 7 - Partial Noun Set Match after Stopwording and Lemmatizing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Imports\n", | |
"import nltk.corpus\n", | |
"import nltk.tokenize.punkt\n", | |
"import nltk.stem.snowball\n", | |
"import string\n", | |
"\n", | |
"# Get default English stopwords and extend with punctuation\n", | |
"stopwords = nltk.corpus.stopwords.words('english')\n", | |
"stopwords.extend(string.punctuation)\n", | |
"stopwords.append('')\n", | |
"\n", | |
"def get_wordnet_pos(pos_tag):\n", | |
" if pos_tag[1].startswith('J'):\n", | |
" return (pos_tag[0], wordnet.ADJ)\n", | |
" elif pos_tag[1].startswith('V'):\n", | |
" return (pos_tag[0], wordnet.VERB)\n", | |
" elif pos_tag[1].startswith('N'):\n", | |
" return (pos_tag[0], wordnet.NOUN)\n", | |
" elif pos_tag[1].startswith('R'):\n", | |
" return (pos_tag[0], wordnet.ADV)\n", | |
" else:\n", | |
" return (pos_tag[0], wordnet.NOUN)\n", | |
"\n", | |
"# Create tokenizer and stemmer\n", | |
"tokenizer = nltk.tokenize.punkt.PunktWordTokenizer()\n", | |
"lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()\n", | |
"\n", | |
"def is_ci_partial_noun_set_token_stopword_lemma_match(a, b):\n", | |
" \"\"\"Check if a and b are matches.\"\"\"\n", | |
" pos_a = map(get_wordnet_pos, nltk.pos_tag(tokenizer.tokenize(a)))\n", | |
" pos_b = map(get_wordnet_pos, nltk.pos_tag(tokenizer.tokenize(b)))\n", | |
" lemmae_a = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_a \\\n", | |
" if pos == wordnet.NOUN and token.lower().strip(string.punctuation) not in stopwords]\n", | |
" lemmae_b = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_b \\\n", | |
" if pos == wordnet.NOUN and token.lower().strip(string.punctuation) not in stopwords]\n", | |
"\n", | |
" # Calculate Jaccard similarity\n", | |
" ratio = len(set(lemmae_a).intersection(lemmae_b)) / float(len(set(lemmae_a).union(lemmae_b)))\n", | |
" return (ratio > 0.66)\n", | |
"\n", | |
"for sentence in sentences:\n", | |
" print(is_ci_partial_noun_set_token_stopword_lemma_match(target_sentence, sentence), sentence)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"(True, 'In the eighteenth century it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'in the eighteenth century it was often convenient to regard man as a clockwork automaton')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as a clockwork automaton.')\n", | |
"(True, 'In the eighteenth century, it was not accepted to regard man as a clockwork automaton.')\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as clockwork automata.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(True, 'In the eighteenth century, it was often convenient to regard man as clockwork automatons.')\n", | |
"(True, 'It was convenient to regard man as a clockwork automaton in the eighteenth century.')\n", | |
"(False, 'In the 1700s, it was common to regard man as a clockwork automaton.')\n", | |
"(False, 'In the 1700s, it was convenient to regard man as a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"(False, 'In the eighteenth century.')\n", | |
"(False, 'Man as a clockwork automaton.')\n", | |
"(True, 'In past centuries, man was often regarded as a clockwork automaton.')\n", | |
"(True, 'The eighteenth century was characterized by man as a clockwork automaton.')\n", | |
"(True, 'Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.')" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 20 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am getting an error as AttributeError: 'module' object has no attribute 'PunktWordTokenizer' searched for possible solution but couldn't find. Would really appreciate some help