Skip to content

Instantly share code, notes, and snippets.

@omri374
Last active May 8, 2021 23:27
Show Gist options
  • Save omri374/95087c4b5bbae959a82a0887769cbfd9 to your computer and use it in GitHub Desktop.
Save omri374/95087c4b5bbae959a82a0887769cbfd9 to your computer and use it in GitHub Desktop.
Notebook for post on biblical names with NER models
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "3c38912a",
"metadata": {},
"source": [
"# What is it with spaCy and biblical names?\n",
"This notebook highlights an issue with spaCy and other Named Entity Recognition models not being able to accurately detect person names, especially if they are biblical names. The detection differences between regular names and biblical names are quite overwhelming.\n",
"I tried to get to the bottom of this and believe I have an answer. But first, let's do a short experiment with two spaCy models (using spaCy version 3.0.5)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ebc1277",
"metadata": {},
"outputs": [],
"source": [
"from typing import List\n",
"import itertools\n",
"import pprint\n",
"\n",
"import spacy\n",
"from spacy.lang.en import English\n",
"import pandas as pd\n",
"\n",
"\n",
"spacy.__version__"
]
},
{
"cell_type": "markdown",
"id": "aa05d314",
"metadata": {},
"source": [
"## Compare detection rates of biblical vs. other names\n",
"\n",
"Why is there a difference in the first place?\n",
"The reason for the different detection rates could arise from:\n",
"1. The fact that biblical names are sometimes older and less common (therefore might be less frequent in the dataset the model was trained on).\n",
"2. That the surrounding sentence is less likely to co-occur with the specific name.\n",
"3. Issue with the dataset itself (over/under representation, labeling errors and more).\n",
"\n",
"To (simplistically) test hypotheses 1 and 2, we compared biblical names with both old and new names, and three templates:\n",
"- \"My name is X\"\n",
"- \"And X said, Why hast thou troubled us?\". \n",
"- \"And she conceived again, a bare a son; and she called his name X.\"\n",
"\n",
"Let's start by creating name lists and templates:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca0ea318",
"metadata": {},
"outputs": [],
"source": [
"biblical_names = [\"David\", \"Moses\", \"Abraham\", \"Samuel\", \"Jacob\", \n",
" \"Isaac\", \"Jesus\", \"Matthew\", \n",
" \"John\", \"Judas\",\"Simon\", \"Mary\"] # Random biblical names\n",
"\n",
"other_names = [\"Beyonce\", \"Ariana\", \"Katy\", # Singers\n",
" \"Michael\", \"Lebron\", \"Coby\", # NBA players\n",
" \"William\", \"Charles\",\"Robert\", \"Margaret\",\"Frank\", \"Helen\", # Popular (non biblical) names in 1900 (https://www.ssa.gov/oact/babynames/decades/names1900s.html)\n",
" \"Ronald\", \"George\", \"Bill\", \"Barack\", \"Donald\", \"Joe\" # Presidents\n",
" ]\n",
"\n",
"template1 = \"My name is {}\"\n",
"template2 = \"And {} said, Why hast thou troubled us?\"\n",
"template3 = \"And she conceived again, a bare a son; and she called his name {}.\"\n",
"\n",
"name_sets = {\"Biblical\": biblical_names, \"Other\": other_names}\n",
"templates = (template1, template2, template3)"
]
},
{
"cell_type": "markdown",
"id": "86fd9bc2",
"metadata": {},
"source": [
"Method for running the spaCy model and checking if \"PERSON\" was detected."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e6a700b2",
"metadata": {},
"outputs": [],
"source": [
"def names_recall(nlp: spacy.lang.en.English, names: List[str], template: str):\n",
" \"\"\"\n",
" Run the spaCy NLP model on the template + name, \n",
" calculate recall for detecting the \"PERSON\" entity \n",
" and return a detailed list of detection\n",
" :param nlp: spaCy nlp model\n",
" :param names: list of names to run model on\n",
" :param template: sentence with placeholder for name (e.g. \"He calls himself {}\")\n",
" \"\"\"\n",
" results = {}\n",
" for name in names:\n",
" doc = nlp(template.format(name))\n",
" name_token = [token for token in doc if token.text == name][0]\n",
" results[name] = name_token.ent_type_ == \"PERSON\"\n",
" recall = sum(results.values()) / len(results)\n",
" print(f\"Recall: {recall:.2f}\\n\")\n",
" return results"
]
},
{
"cell_type": "markdown",
"id": "987888c8",
"metadata": {},
"source": [
"#### Model 1: spaCy's `en_core_web_lg` model\n",
"\n",
"- This model uses the original (non-transformers based) spaCy architecture. \n",
"- It was trained on the [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) dataset and features [0.86 F-measure on named entities](https://spacy.io/models/en#en_core_web_lg).\n",
"\n",
"Load the model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6874cb5c",
"metadata": {},
"outputs": [],
"source": [
"en_core_web_lg = spacy.load(\"en_core_web_lg\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e4751cc",
"metadata": {},
"outputs": [],
"source": [
"detailed_results = {}\n",
"nlp = en_core_web_lg\n",
"\n",
"print(\"Model name: en_core_web_lg\")\n",
"for template, name_set in itertools.product(templates, name_sets.items()):\n",
" print(f\"Name set: {name_set[0]}, Template: \\\"{template}\\\"\")\n",
" results = names_recall(nlp, name_set[1], template)\n",
" detailed_results[template, name_set[0]] = results\n",
"\n",
"print(\"\\nDetailed results:\")\n",
"pprint.pprint(detailed_results)"
]
},
{
"cell_type": "markdown",
"id": "f2bff4d4",
"metadata": {},
"source": [
"So there's a pretty big difference between biblical names detection and other names. \n",
"\n",
"#### Model 2: spaCy's `en_core_web_trf` model\n",
"\n",
"spaCy recently released a new model, `en_core_web_trf`, based on the huggingface transformers library, and also trained on OntoNotes 5. \n",
"\n",
"Let's try this model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "348894ad",
"metadata": {},
"outputs": [],
"source": [
"nlp = spacy.load(\"en_core_web_trf\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c9a0812",
"metadata": {},
"outputs": [],
"source": [
"detailed_results = {}\n",
"print(\"Model name: en_core_web_trf\")\n",
"for template, name_set in itertools.product(templates, name_sets.items()):\n",
" print(f\"Name set: {name_set[0]}, Template: \\\"{template}\\\"\")\n",
" results = names_recall(nlp, name_set[1], template)\n",
" detailed_results[template, name_set[0]] = results\n",
"\n",
"print(\"Detailed results:\")\n",
"pprint.pprint(detailed_results)\n"
]
},
{
"cell_type": "markdown",
"id": "f6888098",
"metadata": {},
"source": [
"Although the numbers are different, we still see a difference between the two sets. However, this time it seems that old names (like Helen, William or Charles) are something the model is also struggling with.\n"
]
},
{
"cell_type": "markdown",
"id": "25e9c173",
"metadata": {},
"source": [
"Let's double check our results on a few samples:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0be2b49",
"metadata": {},
"outputs": [],
"source": [
"name = \"Simon\"\n",
"doc=nlp(f\"My name is {name}\")\n",
"print(f\"Name = {name}. Detected entities: {doc.ents}\")\n",
"\n",
"name = \"Katy\"\n",
"doc=nlp(f\"My name is {name}\")\n",
"print(f\"Name = {name}. Detected entities: {doc.ents}\")\n",
"\n",
"name = \"Moses\"\n",
"doc=nlp(f\"This is what God said to {name}\")\n",
"print(f\"Name = {name}. Detected entities: {doc.ents}\")\n",
"\n",
"name = \"Ronald\"\n",
"doc=nlp(f\"This is what God said to {name}\")\n",
"print(f\"Name = {name}. Detected entities: {doc.ents}\")\n"
]
},
{
"cell_type": "markdown",
"id": "a88a1e94",
"metadata": {},
"source": [
"### So what's going on here?\n",
"\n",
"As part of our work on [Presidio](https://aka.ms/presidio) (a tool for data-deidentification), we develop models to detect PII entities. For that purpose, [we extract template sentences](https://aka.ms/presidio-research) out of existing NER datasets, including CONLL03 and OntoNotes 5. The idea is to augment these datasets with additional entity values, for better coverage of names, cultures and ethnicities. In other words, every time we see a sentence with a tagged person name on a dataset, we extract a template sentence (e.g. `The name is [LAST_NAME], [FIRST_NAME] [LAST_NAME]`) and later replace it with multiple samples each containing different first and last names. \n",
"\n",
"When we manually went over the templating results, we figured out that there are still many names in our new templates dataset which didn't turn into templates. A majority of these names came from the biblical sentences that OntoNotes 5 contains. So many of the samples in the OntoNotes 5 did not contain any PERSON labels, even though they did contain names, an entity type the OntoNotes dataset claims to support. It seems like these models actually learn the errors in the dataset, in this case to ignore names if they are biblical.\n",
"\n",
"Obviously, these errors are found in both the train and test set, so a model that would learn that biblical names are not really names would also succeed on a similar test set. This is yet another example why [SOTA](https://paperswithcode.com/sota/named-entity-recognition-ner-on-ontonotes-v5) results are not necessarily the best way to show progress in science.\n"
]
},
{
"cell_type": "markdown",
"id": "26bcb269",
"metadata": {},
"source": [
"## Is it only spaCy?\n",
"A similar evaluation on two Flair models show that the a model trained on OntoNotes achieves significantly lower results on this test. The CONLL based model actually does pretty well!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3086d41c",
"metadata": {},
"outputs": [],
"source": [
"from flair.data import Sentence\n",
"from flair.models import SequenceTagger\n",
"from flair.tokenization import SpacyTokenizer\n",
"\n",
"ner_ontonotes = SequenceTagger.load('ner-ontonotes')\n",
"\n",
"print(\"Model name: ner-ontonotes\")\n",
"for template, name_set in itertools.product(templates, name_sets.items()):\n",
" print(f\"Name set: {name_set[0]}, Template: \\\"{template}\\\"\")\n",
" results = {}\n",
" for name in name_set[1]:\n",
" sentence = Sentence(text=template.format(name))\n",
" ner_ontonotes.predict(sentence)\n",
" name_token = [token for token in sentence if token.text == name][0]\n",
" results[name] = 'PER' in name_token.get_tag('ner').value\n",
" recall = sum(results.values()) / len(results)\n",
" print(f\"Recall: {recall:.2f}\\n\")\n",
" #pprint.pprint(results)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c4151742",
"metadata": {},
"outputs": [],
"source": [
"ner_conll = SequenceTagger.load('flair/ner-english')\n",
"\n",
"print(\"Model name: ner-english (CONLL)\")\n",
"for template, name_set in itertools.product(templates, name_sets.items()):\n",
" print(f\"Name set: {name_set[0]}, Template: \\\"{template}\\\"\")\n",
" results = {}\n",
" for name in name_set[1]:\n",
" sentence = Sentence(text=template.format(name))\n",
" ner_conll.predict(sentence)\n",
" name_token = [token for token in sentence if token.text == name][0]\n",
" results[name] = 'PER' in name_token.get_tag('ner').value\n",
" recall = sum(results.values()) / len(results)\n",
" print(f\"Recall: {recall:.2f}\\n\")\n",
" #pprint.pprint(results)\n"
]
},
{
"cell_type": "markdown",
"id": "91c6f6ee",
"metadata": {},
"source": [
"Using Flair models, it seems that the CONLL based model is superior to OntoNotes based model for this specific test."
]
},
{
"cell_type": "markdown",
"id": "8da8f45e",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"First, I'd like to say that this is by no means a complaint to the developers and contributors of spaCy. spaCy is one of the most exciting things happening in NLP today and it's considered one of the most mature, accurate, fast and well documented NLP libraries in the world. As shown with the Flair example, this is an inherent problem in ML models and especially ML datasets.\n",
"\n",
"\n",
"Three relevant pointers to conclude:\n",
"\n",
"1. Andrew NG recently argued that [the ML community should be more data-centric and less model-centric](https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/). This post is another example of why this is true.\n",
"2. This is another example of an [issue with a major ML dataset](https://www.csail.mit.edu/news/major-ml-datasets-have-tens-thousands-errors).\n",
"3. A tool like [Checklist](https://github.com/marcotcr/checklist) is really helpful to validate that your model or data doesn't suffer from similar issues. Make sure you check it out.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "PyCharm (presidio-research)",
"language": "python",
"name": "pycharm-c8930cf3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment