Created
December 4, 2017 16:12
-
-
Save wpm/3d08ceb815dec11ec3b56caf363f439c to your computer and use it in GitHub Desktop.
Entity Highlighting in Context
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Entity Highlighting in Context" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The visualization tools in the [spaCy](https://spacy.io/) natural language toolkit can display entity annotations for an entire document.\n", | |
"Here we produce highlight just those sentences in the document that contain the specified entities.\n", | |
"\n", | |
"(You will have to [install the large English language model](https://spacy.io/usage/models) separately.)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 120, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import spacy\n", | |
"from spacy import displacy\n", | |
"from itertools import groupby\n", | |
"\n", | |
"nlp = spacy.load(\"en_core_web_lg\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The following function displays all the sentences in a parsed document containing the specified entity types. If no entity types are specified, all entities are highlighted. If a sentence does not contain any entities of interest, it is not displayed." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 121, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def entities_in_context(doc, *entity_types):\n", | |
" def highlight_entity(entity_label):\n", | |
" if not entity_types:\n", | |
" return True\n", | |
" return entity_label in entity_types\n", | |
" \n", | |
" for context, group in groupby([(entity.sent, entity) for entity in doc.ents if highlight_entity(entity.label_)], \n", | |
" key=lambda t:t[0]):\n", | |
" entities = [{\"start\": (entity.start_char - context.start_char), \n", | |
" \"end\":entity.end_char - context.end_char, \n", | |
" \"label\":entity.label_} for _, entity in group]\n", | |
" context_document = {\"text\": str(context), \"ents\": entities, \"title\": None}\n", | |
" displacy.render(context_document, style=\"ent\", jupyter=True, manual=True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The following document consists of three sentences, two of which contain dates." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 122, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"text = u\"\"\"Miles Davis was born on May 26, 1926 and died on September 28, 1991.\n", | |
" He was a world-renowned musician.\n", | |
" His album Kind of Blue was released on August 17, 1959.\"\"\"\n", | |
"doc = nlp(text)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Print only those sentences that contain DATE or PERSON entities. Note that the second sentence in the document is not printed." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"entities_in_context(doc, \"DATE\", \"PERSON\")" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment