Skip to content

Instantly share code, notes, and snippets.

@manugarri
Created May 5, 2016 13:42
Show Gist options
  • Save manugarri/0fdd4e52f074d61d633ca23eee6da052 to your computer and use it in GitHub Desktop.
Save manugarri/0fdd4e52f074d61d633ca23eee6da052 to your computer and use it in GitHub Desktop.
NYT ingredient tagger implementation with pyCRFSuite
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2016-05-05T15:37:46\n",
"\n",
"CPython 2.7.11\n",
"IPython 4.0.3\n",
"\n",
"compiler : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)\n",
"system : Linux\n",
"release : 3.19.0-58-generic\n",
"machine : x86_64\n",
"processor : x86_64\n",
"CPU cores : 8\n",
"interpreter: 64bit\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"This is a notebook showing a modification of the original [NYT Ingredient Phrase tagger](https://github.com/NYTimes/ingredient-phrase-tagger). [Here](http://open.blogs.nytimes.com/2016/04/27/structured-ingredients-data-tagging/) is the article where they talk about it.\n",
"\n",
"That github repository contains New York Time's tool for performing Named Entity Recognition via Conditional Random Fields on food recipes to extract the ingredients used on those recipes as well as the quantities.\n",
"\n",
"On their implementation they use a [CRF++](https://taku910.github.io/crfpp/) as the extractor."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here I will use pycrfsuite instead of CRF++, the main reasons being:\n",
"\n",
"* by using a full python solution (even though pycrfsuite is just a wrapper around [crfsuite](http://www.chokkan.org/software/crfsuite/)) we can deploy the model more easily, and \n",
"\n",
"* installing CRF++ proved to be a challenge in Ubuntu 14.04\n",
"\n",
"You can install pycrfsuite by doing:\n",
"\n",
"`pip install python-crfsuite`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We load the train_file with features produced by calling *(as it appears on the README)*:\n",
"\n",
"```\n",
"bin/generate_data --data-path=input.csv --count=180000 --offset=0 > tmp/train_file\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import re\n",
"import json\n",
"\n",
"from itertools import chain\n",
"import nltk\n",
"import pycrfsuite\n",
"\n",
"from lib.training import utils"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"with open('tmp/train_file') as fname:\n",
" lines = fname.readlines()\n",
" items = [line.strip('\\n').split('\\t') for line in lines]\n",
" items = [item for item in items if len(item)==6]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[['1$1/4', 'I1', 'L20', 'NoCAP', 'NoPAREN', 'B-QTY'],\n",
" ['cups', 'I2', 'L20', 'NoCAP', 'NoPAREN', 'B-UNIT'],\n",
" ['cooked', 'I3', 'L20', 'NoCAP', 'NoPAREN', 'B-COMMENT'],\n",
" ['and', 'I4', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'],\n",
" ['pureed', 'I5', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'],\n",
" ['fresh', 'I6', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT'],\n",
" ['butternut', 'I7', 'L20', 'NoCAP', 'NoPAREN', 'B-NAME'],\n",
" ['squash', 'I8', 'L20', 'NoCAP', 'NoPAREN', 'I-NAME'],\n",
" [',', 'I9', 'L20', 'NoCAP', 'NoPAREN', 'OTHER'],\n",
" ['or', 'I10', 'L20', 'NoCAP', 'NoPAREN', 'I-COMMENT']]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"items[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, each line of the train_file follows the format:\n",
"\n",
"- token\n",
"- position on the phrase. (I1 would be first word, I2 the second, and so on)\n",
"- LX , being the length group of the token (defined by [LengthGroup](https://github.com/NYTimes/ingredient-phrase-tagger/blob/master/lib/training/utils.py#L140))\n",
"- NoCAP or YesCAP, whether the token is capitalized or not\n",
"- YesParen or NoParen, whether the token is inside parenthesis or not"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PyCRFSuite expects the input to be a list of the structured items and their respective tags. So we process the items from the train file and bucket them into sentences"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"177029"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sentences = []\n",
"\n",
"sent = [items[0]]\n",
"for item in items[1:]:\n",
" if 'I1' in item:\n",
" sentences.append(sent)\n",
" sent = [item]\n",
" else:\n",
" sent.append(item)\n",
"len(sentences)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import random\n",
"random.shuffle(sentences)\n",
"test_size = 0.1\n",
"data_size = len(sentences)\n",
"\n",
"test_data = sentences[:int(test_size*data_size)]\n",
"train_data = sentences[int(test_size*data_size):]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[['Orange', 'I1', 'L8', 'YesCAP', 'NoPAREN'],\n",
" ['peel', 'I2', 'L8', 'NoCAP', 'NoPAREN'],\n",
" [',', 'I3', 'L8', 'NoCAP', 'NoPAREN'],\n",
" ['sliced.', 'I4', 'L8', 'NoCAP', 'NoPAREN']]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def sent2labels(sent):\n",
" return [word[-1] for word in sent]\n",
"\n",
"def sent2features(sent):\n",
" return [word[:-1] for word in sent]\n",
"\n",
"def sent2tokens(sent):\n",
" return [word[0] for word in sent] \n",
"\n",
"y_train = [sent2labels(s) for s in train_data]\n",
"X_train = [sent2features(s) for s in train_data]\n",
"X_train[1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"We set up the CRF trainer. We will use the default values and include all the possible joint features"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"trainer = pycrfsuite.Trainer(verbose=False)\n",
"\n",
"for xseq, yseq in zip(X_train, y_train):\n",
" trainer.append(xseq, yseq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I obtained the following hyperparameters by performing a GridSearchCV with the scikit learn implementation of pycrfsuite."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"trainer.set_params(\n",
"{\n",
" 'c1': 0.43,\n",
" 'c2': 0.012,\n",
" 'max_iterations': 100,\n",
" 'feature.possible_transitions': True,\n",
" 'feature.possible_states': True,\n",
" 'linesearch': 'StrongBacktracking'\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We train the model (this might take a while)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"trainer.train('tmp/trained_pycrfsuite')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have a pretrained model that we can just deploy"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<contextlib.closing at 0x7f2984586990>"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tagger = pycrfsuite.Tagger()\n",
"tagger.open('tmp/trained_pycrfsuite')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we just add a wrapper function for the script found in **lib/testing/convert_to_json.py** and create a convient way to parse an ingredient sentence"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import re\n",
"import json\n",
"from lib.training import utils\n",
"from string import punctuation\n",
"\n",
"from nltk.tokenize import PunktSentenceTokenizer\n",
"\n",
"tokenizer = PunktSentenceTokenizer()\n",
"\n",
"def get_sentence_features(sent):\n",
" \"\"\"Gets the features of the sentence\"\"\"\n",
" sent_tokens = utils.tokenize(utils.cleanUnicodeFractions(sent))\n",
"\n",
" sent_features = []\n",
" for i, token in enumerate(sent_tokens):\n",
" token_features = [token]\n",
" token_features.extend(utils.getFeatures(token, i+1, sent_tokens))\n",
" sent_features.append(token_features)\n",
" return sent_features\n",
"\n",
"def format_ingredient_output(tagger_output, display=False):\n",
" \"\"\"Formats the tagger output into a more convenient dictionary\"\"\"\n",
" data = [{}]\n",
" display = [[]]\n",
" prevTag = None\n",
"\n",
"\n",
" for token, tag in tagger_output:\n",
" # turn B-NAME/123 back into \"name\"\n",
" tag = re.sub(r'^[BI]\\-', \"\", tag).lower()\n",
"\n",
" # ---- DISPLAY ----\n",
" # build a structure which groups each token by its tag, so we can\n",
" # rebuild the original display name later.\n",
"\n",
" if prevTag != tag:\n",
" display[-1].append((tag, [token]))\n",
" prevTag = tag\n",
" else:\n",
" display[-1][-1][1].append(token)\n",
" # ^- token\n",
" # ^---- tag\n",
" # ^-------- ingredient\n",
"\n",
" # ---- DATA ----\n",
" # build a dict grouping tokens by their tag\n",
"\n",
" # initialize this attribute if this is the first token of its kind\n",
" if tag not in data[-1]:\n",
" data[-1][tag] = []\n",
"\n",
" # HACK: If this token is a unit, singularize it so Scoop accepts it.\n",
" if tag == \"unit\":\n",
" token = utils.singularize(token)\n",
"\n",
" data[-1][tag].append(token)\n",
"\n",
" # reassemble the output into a list of dicts.\n",
" output = [\n",
" dict([(k, utils.smartJoin(tokens)) for k, tokens in ingredient.iteritems()])\n",
" for ingredient in data\n",
" if len(ingredient)\n",
" ]\n",
"\n",
" # Add the raw ingredient phrase\n",
" for i, v in enumerate(output):\n",
" output[i][\"input\"] = utils.smartJoin(\n",
" [\" \".join(tokens) for k, tokens in display[i]])\n",
"\n",
" return output\n",
"\n",
"def parse_ingredient(sent):\n",
" \"\"\"ingredient parsing logic\"\"\"\n",
" sentence_features = get_sentence_features(sent)\n",
" tags = tagger.tag(sentence_features)\n",
" tagger_output = zip(sent2tokens(sentence_features), tags)\n",
" parsed_ingredient = format_ingredient_output(tagger_output)\n",
" if parsed_ingredient:\n",
" parsed_ingredient[0]['name'] = parsed_ingredient[0].get('name','').strip('.')\n",
" return parsed_ingredient\n",
"\n",
"def parse_recipe_ingredients(ingredient_list):\n",
" \"\"\"Wrapper around parse_ingredient so we can call it on an ingredient list\"\"\"\n",
" sentences = tokenizer.tokenize(q)\n",
" sentences = [sent.strip('\\n') for sent in sentences]\n",
" ingredients = []\n",
" for sent in sentences:\n",
" ingredients.extend(parse_ingredient(sent))\n",
" return ingredients"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[{'input': u'2$1/4 cups all-purpose flour.',\n",
" 'name': u'all-purpose flour',\n",
" 'qty': u'2$1/4',\n",
" 'unit': u'cup'},\n",
" {'input': u'1/2 teaspoon baking soda.',\n",
" 'name': u'baking',\n",
" 'other': u'soda.',\n",
" 'qty': u'1/2',\n",
" 'unit': u'teaspoon'},\n",
" {'comment': u'(2 sticks)',\n",
" 'input': u'1 cup (2 sticks) unsalted butter, room temperature.',\n",
" 'name': u'unsalted butter',\n",
" 'other': u', room temperature.',\n",
" 'qty': u'1',\n",
" 'unit': u'cup'},\n",
" {'input': u'1/2 cup granulated sugar.',\n",
" 'name': u'granulated sugar',\n",
" 'qty': u'1/2',\n",
" 'unit': u'cup'},\n",
" {'comment': u'packed',\n",
" 'input': u'1 cup packed light-brown sugar.',\n",
" 'name': '',\n",
" 'other': u'light-brown sugar.',\n",
" 'qty': u'1',\n",
" 'unit': u'cup'},\n",
" {'input': u'1 teaspoon salt.',\n",
" 'name': '',\n",
" 'other': u'salt.',\n",
" 'qty': u'1',\n",
" 'unit': u'teaspoon'},\n",
" {'comment': u'pure',\n",
" 'input': u'2 teaspoons pure vanilla extract.',\n",
" 'name': u'vanilla',\n",
" 'other': u'extract.',\n",
" 'qty': u'2',\n",
" 'unit': u'teaspoon'},\n",
" {'comment': u'large',\n",
" 'input': u'2 large eggs.',\n",
" 'name': u'eggs',\n",
" 'qty': u'2'},\n",
" {'comment': u'(about 12 ounces) semisweet and/or',\n",
" 'input': u'2 cups (about 12 ounces) semisweet and/or milk chocolate chips.',\n",
" 'name': u'milk chocolate chips',\n",
" 'qty': u'2',\n",
" 'unit': u'cup'}]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"q = '''\n",
"2 1/4 cups all-purpose flour.\n",
"1/2 teaspoon baking soda.\n",
"1 cup (2 sticks) unsalted butter, room temperature.\n",
"1/2 cup granulated sugar.\n",
"1 cup packed light-brown sugar.\n",
"1 teaspoon salt.\n",
"2 teaspoons pure vanilla extract.\n",
"2 large eggs.\n",
"2 cups (about 12 ounces) semisweet and/or milk chocolate chips.\n",
"'''\n",
"\n",
"parse_recipe_ingredients(q)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@jfrux
Copy link

jfrux commented Aug 11, 2017

Did you ever post a fork of your changes to github somewhere that we can tap into with pip?
I'd love to use your library without the CRF++ dependency...?

@CubeRoshan
Copy link

utils.tokenize() returns only units informations , from where name and other details comes from?

@kavyvetri
Copy link

I have the same issue. I am only getting output as this
[{'qty': '2$1/4', 'input': '2$1/4', 'name': ''}]
Not getting other information. Any pointers?

@manugarri
Copy link
Author

hey @kavyveytri, its been 4 years since i did this, i dont remember it at all 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment