attibalazs · April 6, 2018 15:29
diff --git a/Recipes - Entity Extraction.ipynb b/Recipes - Entity Extraction.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Entity Recognition and Extraction Recipes\n",
    "\n",
    "A collection of code fragments for performing:\n",
    "\n",
    "- simple entity extraction from a text;\n",
    "- partial and fuzzing string matching of specified entities in a text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple named entity recognition\n",
    "\n",
    "[`spaCy`](https://spacy.io/) is a natural language processing library for Python library that includes a basic model capable of recognising (ish!) names of people, places and organisations, as well as dates and financial amounts.\n",
    "\n",
    "According to the [`spaCy` entity recognition](https://spacy.io/docs/usage/entity-recognition) documentation, the built in model recognises the following types of entity:\n",
    "\n",
    "- `PERSON`\tPeople, including fictional.\n",
    "- `NORP`\tNationalities or religious or political groups.\n",
    "- `FACILITY`\tBuildings, airports, highways, bridges, etc.\n",
    "- `ORG`\tCompanies, agencies, institutions, etc.\n",
    "- `GPE`\tCountries, cities, states. (That is, *Geo-Political Entitites*)\n",
    "- `LOC`\tNon-GPE locations, mountain ranges, bodies of water.\n",
    "- `PRODUCT`\tObjects, vehicles, foods, etc. (Not services.)\n",
    "- `EVENT`\tNamed hurricanes, battles, wars, sports events, etc.\n",
    "- `WORK_OF_ART`\tTitles of books, songs, etc.\n",
    "- `LANGUAGE`\tAny named language.\n",
    "- `LAW` A legislation related entity(?)\n",
    "\n",
    "Quantities are also recognised:\n",
    "\n",
    "- `DATE`\tAbsolute or relative dates or periods.\n",
    "- `TIME`\tTimes smaller than a day.\n",
    "- `PERCENT`\tPercentage, including \"%\".\n",
    "- `MONEY`\tMonetary values, including unit.\n",
    "- `QUANTITY`\tMeasurements, as of weight or distance.\n",
    "- `ORDINAL`\t\"first\", \"second\", etc.\n",
    "- `CARDINAL`\tNumerals that do not fall under another type.\n",
    "\n",
    "Custom models can also be trained, but this requires annotated training documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip3 install spacy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from spacy.en import English\n",
    "parser = English()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "example='''\n",
    "That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in \n",
    "York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; \n",
    "acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase \n",
    "over the same period in 2016; further notes 156 of these job losses will be in York, a city that in \n",
    "the last six months has seen 2,000 job losses announced and has become the most inequitable city outside \n",
    "of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of\n",
    "triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with \n",
    "the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and \n",
    "production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future \n",
    "relationship with the single market and customs union; and calls on the Government to intervene and work\n",
    "with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent \n",
    "further job losses across Nestlé.\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Code \"borrowed\" from somewhere?!\n",
    "def entities(example, show=False):\n",
    "    if show: print(example)\n",
    "    parsedEx = parser(example)\n",
    "\n",
    "    print(\"-------------- entities only ---------------\")\n",
    "    # if you just want the entities and nothing else, you can do access the parsed examples \"ents\" property like this:\n",
    "    ents = list(parsedEx.ents)\n",
    "    tags={}\n",
    "    for entity in ents:\n",
    "        #print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))\n",
    "        term=' '.join(t.orth_ for t in entity)\n",
    "        if ' '.join(term) not in tags:\n",
    "            tags[term]=[(entity.label, entity.label_)]\n",
    "        else:\n",
    "            tags[term].append((entity.label, entity.label_))\n",
    "    print(tags)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------- entities only ---------------\n",
      "{'House': [(380, 'ORG')], '300': [(393, 'CARDINAL')], 'Nestlé': [(380, 'ORG')], '\\n York , Fawdon': [(381, 'GPE')], 'Halifax': [(381, 'GPE')], 'Girvan': [(381, 'GPE')], 'the Blue Riband': [(380, 'ORG')], 'Poland': [(381, 'GPE')], '\\n': [(381, 'GPE'), (381, 'GPE')], 'the first three months of 2017': [(387, 'DATE')], '£ 21 billion': [(390, 'MONEY')], '0.4 per': [(390, 'MONEY')], 'the same period in 2016': [(387, 'DATE')], '156': [(393, 'CARDINAL')], 'York': [(381, 'GPE')], '\\n the': [(381, 'GPE')], 'six': [(393, 'CARDINAL')], '2,000': [(393, 'CARDINAL')], 'the South East': [(382, 'LOC')], '110': [(393, 'CARDINAL')], 'Fawdon': [(381, 'GPE')], 'Newcastle': [(380, 'ORG')], 'a month of': [(387, 'DATE')], 'Article 50': [(21153, 'LAW')], 'EU': [(380, 'ORG')], 'UK': [(381, 'GPE')], 'GMB': [(380, 'ORG')], 'Unite': [(381, 'GPE')]}\n"
     ]
    }
   ],
   "source": [
    "entities(example)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------- entities only ---------------\n",
      "{'Bob Smith': [(377, 'PERSON')]}\n"
     ]
    }
   ],
   "source": [
    "q= \"Bob Smith was in the Houses of Parliament the other day\"\n",
    "entities(q)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the way that models are trained typically realises on cues from the correct capitalisation of named entities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------- entities only ---------------\n",
      "{}\n"
     ]
    }
   ],
   "source": [
    "entities(q.lower())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## polyglot\n",
    "\n",
    "A simplistic, and quite slow, tagger, supporting limited recognition of *Locations* (`I-LOC`), *Organizations* (`I-ORG`) and *Persons* (`I-PER`).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip3 install polyglot\n",
    "\n",
    "##Mac ??\n",
    "#!brew install icu4c\n",
    "#I found I needed: pip3 install pyicu, pycld2, morfessor\n",
    "##Linux\n",
    "#apt-get install libicu-dev"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[polyglot_data] Downloading package embeddings2.en to\n",
      "[polyglot_data]     /Users/ajh59/polyglot_data...\n",
      "[polyglot_data] Downloading package ner2.en to\n",
      "[polyglot_data]     /Users/ajh59/polyglot_data...\n"
     ]
    }
   ],
   "source": [
    "!polyglot download embeddings2.en ner2.en"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[I-LOC(['York']),\n",
       " I-LOC(['Fawdon']),\n",
       " I-LOC(['Halifax']),\n",
       " I-LOC(['Girvan']),\n",
       " I-LOC(['Poland']),\n",
       " I-PER(['Nestlé']),\n",
       " I-LOC(['York']),\n",
       " I-LOC(['Fawdon']),\n",
       " I-LOC(['Newcastle']),\n",
       " I-ORG(['EU']),\n",
       " I-ORG(['EU']),\n",
       " I-ORG(['Government']),\n",
       " I-ORG(['GMB']),\n",
       " I-LOC(['Nestlé'])]"
      ]
     },
     "execution_count": 109,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from polyglot.text import Text\n",
    "\n",
    "text = Text(example)\n",
    "text.entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[I-PER(['Bob', 'Smith'])]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Text(q).entities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Partial Matching Specific Entities\n",
    "\n",
    "Sometimes we may have a list of entities that we wish to match in a text. For example, suppose we have a list of MPs' names, or a list of ogranisations of subject terms identified in a thesaurus, and we want to tag a set of documents with those entities if the entity exists in the document.\n",
    "\n",
    "To do this, we can search a text for strings that exactly match any of the specified terms or where any of the specified terms match part of a longer string in the text.\n",
    "\n",
    "Naive implementations can take a signifcant time to find multiple strings within a tact, but the *Aho-Corasick* algorithm will efficiently match a large set of key values within a particular text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## The following recipe was hinted at via @pudo\n",
    "\n",
    "#!pip3 install pyahocorasick\n",
    "#https://github.com/alephdata/aleph/blob/master/aleph/analyze/corasick_entity.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, construct an automaton that identifies the terms you want to detect in the target text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ahocorasick import Automaton\n",
    "\n",
    "A=Automaton()\n",
    "A.add_word(\"Europe\",('VOCAB','Europe'))\n",
    "A.add_word(\"European Union\",('VOCAB','European Union'))\n",
    "A.add_word(\"Boris Johnson\",('PERSON','Boris Johnson'))\n",
    "A.add_word(\"Boris\",('PERSON','Boris Johnson'))\n",
    "A.add_word(\"boris johnson\",('PERSON','Boris Johnson (LC)'))\n",
    "\n",
    "A.make_automaton()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4, ('PERSON', 'Boris Johnson')) Boris\n",
      "(12, ('PERSON', 'Boris Johnson')) Boris Johnson\n",
      "(31, ('VOCAB', 'Europe')) Boris Johnson went off to Europe\n",
      "(60, ('VOCAB', 'Europe')) Boris Johnson went off to Europe to complain about the Europe\n",
      "(68, ('VOCAB', 'European Union')) Boris Johnson went off to Europe to complain about the European Union\n"
     ]
    }
   ],
   "source": [
    "q2='Boris Johnson went off to Europe to complain about the European Union'\n",
    "for item in A.iter(q2):\n",
    "    print(item, q2[:item[0]+1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, case is important."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(12, ('PERSON', 'Boris Johnson (LC)')) boris johnson\n"
     ]
    }
   ],
   "source": [
    "q2l = q2.lower()\n",
    "for item in A.iter(q2l):\n",
    "    print(item, q2l[:item[0]+1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can tweak the automata patterns to capture the length of the string match term, so we can annotate the text with matches more exactly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "A=Automaton()\n",
    "A.add_word(\"Europe\",(('VOCAB', len(\"Europe\")),'Europe'))\n",
    "A.add_word(\"European Union\",(('VOCAB', len(\"European Union\")),'European Union'))\n",
    "A.add_word(\"Boris Johnson\",(('PERSON', len(\"Boris Johnson\")),'Boris Johnson'))\n",
    "A.add_word(\"Boris\",(('PERSON', len(\"Boris\")),'Boris Johnson'))\n",
    "\n",
    "A.make_automaton()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4, (('PERSON', 5), 'Boris Johnson')) *Boris* Jo\n",
      "(12, (('PERSON', 13), 'Boris Johnson')) *Boris Johnson* we\n",
      "(31, (('VOCAB', 6), 'Europe')) to *Europe* to\n",
      "(60, (('VOCAB', 6), 'Europe')) he *Europe*an \n",
      "(68, (('VOCAB', 14), 'European Union')) he *European Union*\n"
     ]
    }
   ],
   "source": [
    "for item in A.iter(q2):\n",
    "    start=item[0]-item[1][0][1]+1\n",
    "    end=item[0]+1\n",
    "    print(item, '{}*{}*{}'.format(q2[start-3:start],q2[start:end],q2[end:end+3]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fuzzy Matching\n",
    "\n",
    "Whilst the *Aho-Corasick* approach will return hits for strings in the text that partially match the exact match key terms, sometimes we want to know whether there are terms in a text that *almost* match terms in specific set of terms."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Imagine a situation where we have managed to extract arbitrary named entities from a text, but they do not match strings in a specified list in an exact or partially exact way. Our next step might be to attempt to further match those entities in a *fuzzy* way with entities in a specified list."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  `fuzzyset`\n",
    "The python [`fuzzyset`](https://github.com/axiak/fuzzyset) package will try to match a specified string to similar strings in a list of target strings, returning a single item from a specified target list that best matches the provided term.\n",
    "\n",
    "\n",
    "For example, if we extract the name *Boris Johnstone* in a text, we might then try to further match that string, in a fuzzy way, with a list of correctly spelled MP names.\n",
    "\n",
    "A confidence value expresses the degree of match to terms in the fuzzy match set list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "([(0.8666666666666667, 'Boris Johnson')],\n",
       " [(0.8333333333333334, 'Diane Abbott')],\n",
       " [(0.23076923076923073, 'Diane Abbott')])"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import fuzzyset\n",
    "\n",
    "fz = fuzzyset.FuzzySet()\n",
    "#Create a list of terms we would like to match against in a fuzzy way\n",
    "for l in [\"Diane Abbott\", \"Boris Johnson\"]:\n",
    "    fz.add(l)\n",
    "\n",
    "#Now see if our sample term fuzzy matches any of those specified terms\n",
    "sample_term='Boris Johnstone'\n",
    "fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `fuzzywuzzy`\n",
    "If we want to try to find a fuzzy match for a term *within* a text, we can use the python [`fuzzywuzzy`](https://github.com/seatgeek/fuzzywuzzy) library. Once again, we spcify a list of target items we want to try to match against."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fuzzywuzzy import process\n",
    "from fuzzywuzzy import fuzz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Houses of Parliament', 90), ('Diane Abbott', 90), ('Boris Johnson', 86)]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "terms=['Houses of Parliament', 'Diane Abbott', 'Boris Johnson']\n",
    "\n",
    "q= \"Diane Abbott, Theresa May and Boris Johnstone were in the Houses of Parliament the other day\"\n",
    "process.extract(q,terms)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, we get match confidence levels for each term in the target match set, although we can limit the response to a maximum number of matches:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Houses of Parliament', 90), ('Boris Johnson', 85)]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "process.extract(q,terms,scorer=fuzz.partial_ratio, limit=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A range of fuzzy match scroing algorithms are supported:\n",
    "\n",
    "- `WRatio` - measure of the sequences' similarity between 0 and 100, using different algorithms\n",
    "- `QRatio` - Quick ratio comparison between two strings\n",
    "- `UWRatio` - a measure of the sequences' similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicode\n",
    "- `UQRatio` - Unicode quick ratio\n",
    "- `ratio` - \n",
    "- `partial_ratio - ratio of the most similar substring as a number between 0 and 100\n",
    "- `token_sort_ratio` - a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing\n",
    "- `partial_token_set_ratio` - \n",
    "- `partial_token_sort_ratio` - ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More usefully, perhaps, is to return items that match above a particular confidence level:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Houses of Parliament', 90), ('Diane Abbott', 90)]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "process.extractBests(q,terms,score_cutoff=90)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, one problem with the `fuzzywuzzy` matcher is that it doesn't tell us where in the supplied text string the match occurred, or what string in the text was matched."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `fuzzywuzzy` package can also be used to try to deduplicate a list of items, returning the longest item in the duplicate list. (It might be more useful if this is optionally the *first* item in the original list?)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "names=['Diane Abbott', 'Boris Johnson','Boris Johnstone','Diana Abbot', 'Boris Johnston','Joanna Lumley']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Joanna Lumley', 'Boris Johnstone', 'Diane Abbott']"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "process.dedupe(names, threshold=80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It might also be useful to see the candidate strings associated with each deduped item, treating the first item in the list as the canonical one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('Diane Abbott', 100), ('Diana Abbot', 87)]\n",
      "[('Boris Johnson', 100), ('Boris Johnstone', 93), ('Boris Johnston', 96)]\n"
     ]
    }
   ],
   "source": [
    "import hashlib\n",
    "\n",
    "clusters={}\n",
    "fuzzed=[]\n",
    "for t in names:\n",
    "    fuzzyset=process.extractBests(t,names,score_cutoff=85)\n",
    "    #Generate a key based on the sorted members of the set\n",
    "    keyvals=sorted(set([x[0] for x in fuzzyset]),key=lambda x:names.index(x),reverse=False)\n",
    "    keytxt=''.join(keyvals)\n",
    "    key=hashlib.md5(keytxt).hexdigest()\n",
    "    if len(keyvals)>1 and key not in fuzzed:\n",
    "        clusters[key]=sorted(set([x for x in fuzzyset]),key=lambda x:names.index(x[0]),reverse=False)\n",
    "        fuzzed.append(key)\n",
    "for cluster in clusters:\n",
    "    print(clusters[cluster])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## OpenRefine Clustering\n",
    "\n",
    "As well as running as a browser accessed application, [OpenRefine](http://openrefine.org/) also runs as a service that can be accessed from Python using the [refine-client.py](https://github.com/PaulMakepeace/refine-client-py) client libary.\n",
    "\n",
    "In particular, we can use the OpenRefine service to cluster fuzzily matched items within a list of items."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install git+https://github.com/PaulMakepeace/refine-client-py.git\n",
    "#NOTE - this requires a python 2 kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Initialise the connection to the server using default or environment variable defined server settings\n",
    "#REFINE_HOST = os.environ.get('OPENREFINE_HOST', os.environ.get('GOOGLE_REFINE_HOST', '127.0.0.1'))\n",
    "#REFINE_PORT = os.environ.get('OPENREFINE_PORT', os.environ.get('GOOGLE_REFINE_PORT', '3333'))\n",
    "from google.refine import refine, facet\n",
    "server = refine.RefineServer()\n",
    "orefine = refine.Refine(server)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name\r\n",
      "Diane Abbott\r\n",
      "Boris Johnson\r\n",
      "Boris Johnstone\r\n",
      "Diana Abbot\r\n",
      "Boris Johnston\r\n",
      "Joanna Lumley\r\n",
      "Boris Johnstone\r\n"
     ]
    }
   ],
   "source": [
    "#Create an example CSV file to load into a test OpenRefine project\n",
    "project_file = 'simpledemo.csv'\n",
    "with open(project_file,'w') as f:\n",
    "    for t in ['Name']+names+['Boris Johnstone']:\n",
    "        f.write(t+ '\\n')\n",
    "!cat {project_file}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'Name']"
      ]
     },
     "execution_count": 134,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p=orefine.new_project(project_file=project_file)\n",
    "p.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OpenRefine supports a range of clustering functions:\n",
    "\n",
    "```\n",
    "- clusterer_type: binning; function: fingerprint|metaphone3|cologne-phonetic\n",
    "- clusterer_type: binning; function: ngram-fingerprint; params: {'ngram-size': INT}\n",
    "- clusterer_type: knn; function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'count': 1, 'value': u'Diana Abbot'}, {'count': 1, 'value': u'Diane Abbott'}]\n",
      "[{'count': 2, 'value': u'Boris Johnstone'}, {'count': 1, 'value': u'Boris Johnston'}]\n"
     ]
    }
   ],
   "source": [
    "clusters=p.compute_clusters('Name',clusterer_type='binning',function='cologne-phonetic')\n",
    "for cluster in clusters:\n",
    "    print(cluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Topic Models\n",
    "\n",
    "Topic models are statistical models that attempts to categorise different \"topics\" that occur across a set of docments.\n",
    "\n",
    "Several python libraries provide a simple interface for the generation of topic models from text contained in multiple documents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `gensim`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip3 install gensim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "#https://github.com/sgsinclair/alta/blob/e5bc94f7898b3bcaf872069f164bc6534769925b/ipynb/TopicModelling.ipynb\n",
    "from gensim import corpora, models\n",
    "\n",
    "def get_lda_from_lists_of_words(lists_of_words, **kwargs):\n",
    "    dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers\n",
    "    corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document\n",
    "    tfidf = models.TfidfModel(corpus) # this models the significance of words using term frequency inverse document frequency\n",
    "    corpus_tfidf = tfidf[corpus]\n",
    "    kwargs[\"id2word\"] = dictionary # set the dictionary\n",
    "    return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling\n",
    "\n",
    "def print_top_terms(lda, num_terms=10):\n",
    "    txt=[]\n",
    "    num_terms=min([num_terms,lda.num_topics])\n",
    "    for i in range(0, num_terms):\n",
    "        terms = [term for term,val in lda.show_topic(i,num_terms)]\n",
    "        txt.append(\"\\t - top {} terms for topic #{}: {}\".format(num_terms,i,' '.join(terms)))\n",
    "    return '\\n'.join(txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To start with, let's create a list of dummy documents and then generate word lists for each document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs=['The banks still have a lot to answer for the financial crisis.',\n",
    "     'This MP and that Member of Parliament were both active in the debate.',\n",
    "     'The companies that work in finance need to be responsible.',\n",
    "     'There is a reponsibility incumber on all participants for high quality debate in Parliament.',\n",
    "     'Corporate finance is a big responsibility.']\n",
    "\n",
    "#Create lists of words from the text in each document\n",
    "from nltk.tokenize import word_tokenize\n",
    "docs = [ word_tokenize(doc.lower()) for doc in docs ]\n",
    "\n",
    "#Remove stop words from the wordlists\n",
    "from nltk.corpus import stopwords\n",
    "docs = [ [word for word in doc if word not in stopwords.words('english') ] for doc in docs ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can generate the topic models from the list of word lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\t - top 3 terms for topic #0: parliament debate active\n",
      "\t - top 3 terms for topic #1: responsible work need\n",
      "\t - top 3 terms for topic #2: corporate big responsibility\n"
     ]
    }
   ],
   "source": [
    "topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)\n",
    "print( print_top_terms(topicsLda))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model is randomised - if we run it again we are likely to get a different result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\t - top 3 terms for topic #0: finance corporate responsibility\n",
      "\t - top 3 terms for topic #1: participants quality high\n",
      "\t - top 3 terms for topic #2: member mp active\n"
     ]
    }
   ],
   "source": [
    "topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)\n",
    "print( print_top_terms(topicsLda))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }