Skip to content

Instantly share code, notes, and snippets.

@dan-zheng
Last active June 12, 2017 17:20
Show Gist options
  • Save dan-zheng/9cef43165d706666431fac69c5190287 to your computer and use it in GitHub Desktop.
Save dan-zheng/9cef43165d706666431fac69c5190287 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Data Augmentation: Near Words Strategy\n\nThese are examples of data augmentation using a \"near words\" perturbation strategy.\n\n`spaCy` is used to identify the named entities in a sentence. Then, for each named entity, `sense2vec` is used to find similar entities. These similar entities are filtered based on a similarity threshold. Perturbations are formed by replacing the original entity with entities similar to it.\n\nNext steps:\n- Generate perturbations that contain multiple near words (combinatorics)\n- Use a probabilistic approach (easy because the \"similarities\" can be used as probabilities, probability of multiple near words = product of their similarities)\n- Allow fine-grained thresholds for different entities\n- Enable perturbations for words that are not named entities (verbs/adj/adv)\n\nNotes:\n- The model for `sense2vec` is trained on a corpus of Reddit comments from 2015. Perhaps creating a model from a different corpus (e.g. Wikipedia) may be better - this is worth testing."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Example 1"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "```\nOriginal sentence: `Beijing is the favorite city of Yao Ming.`\nSimilarity threshold: 0.8\n\nPerturbations:\n\nOriginal entity: `Beijing|GPE`\nSimilar entities: `[('Beijing|GPE', 0.9999998807907104), ('Shanghai|GPE', 0.9067670106887817), ('Guangzhou|GPE', 0.900281548500061), ('Taipei|GPE', 0.897108793258667), ('Seoul|GPE', 0.8812267780303955), ('Hong_Kong|GPE', 0.8809748888015747), ('Tianjin|GPE', 0.8762707114219666), ('Hangzhou|GPE', 0.8597891330718994), ('Shenzhen|GPE', 0.8579272627830505), ('Chengdu|GPE', 0.8511044979095459)]`\nSimilar entities (as words): `{'Hong Kong', 'Chengdu', 'Tianjin', 'Guangzhou', 'Taipei', 'Hangzhou', 'Shenzhen', 'Shanghai', 'Seoul'}`\nHong Kong is the favorite city of Yao Ming.\nChengdu is the favorite city of Yao Ming.\nTianjin is the favorite city of Yao Ming.\nGuangzhou is the favorite city of Yao Ming.\nTaipei is the favorite city of Yao Ming.\nHangzhou is the favorite city of Yao Ming.\nShenzhen is the favorite city of Yao Ming.\nShanghai is the favorite city of Yao Ming.\nSeoul is the favorite city of Yao Ming.\n\nOriginal entity: `Yao_Ming|PERSON`\nSimilar entities: `[('Yao_Ming|PERSON', 1.0), ('Steve_Nash|PERSON', 0.8354597091674805), ('Michael_Jordan|PERSON', 0.8310054540634155), ('Shaq|NOUN', 0.8274571299552917), ('Yao|PERSON', 0.8235625624656677), ('Lebron_James|PERSON', 0.8233107924461365), ('Bill_Russell|PERSON', 0.8205811381340027), ('Ben_Wallace|PERSON', 0.818983793258667), ('Karl_Malone|PERSON', 0.8135460019111633), ('Allen_Iverson|PERSON', 0.8113048672676086)]`\nSimilar entities (as words): `{'Shaq', 'Lebron James', 'Bill Russell', 'Steve Nash', 'Yao', 'Allen Iverson', 'Michael Jordan', 'Ben Wallace', 'Karl Malone'}`\nBeijing is the favorite city of Shaq.\nBeijing is the favorite city of Lebron James.\nBeijing is the favorite city of Bill Russell.\nBeijing is the favorite city of Steve Nash.\nBeijing is the favorite city of Yao.\nBeijing is the favorite city of Allen Iverson.\nBeijing is the favorite city of Michael Jordan.\nBeijing is the favorite city of Ben Wallace.\nBeijing is the favorite city of Karl Malone.\n```"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Example 2"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "```\nOriginal sentence: `Elon Musk is secretly meeting in San Francisco with executives at Udacity.`\nSimilarity threshold: 0.8\n\nPerturbations:\n\nOriginal entity: `Elon_Musk|PERSON`\nSimilar entities: `[('Elon_Musk|PERSON', 1.0), ('Elon_Musk|NOUN', 0.925355076789856), ('Musk|PERSON', 0.8884874582290649), ('Elon_Musk|ORG', 0.8819447159767151), ('Elon|PERSON', 0.8731524348258972), ('Musk|ORG', 0.8508132100105286), ('Elon_musk|NOUN', 0.8425478339195251), ('Musk|GPE', 0.8399838805198669), ('Elon|NOUN', 0.8373492956161499), ('Elon|LOC', 0.8332192897796631)]`\nSimilar entities (as words): `{'Elon', 'Elon musk', 'Musk'}`\nElon is secretly meeting in San Francisco with executives at Udacity.\nElon musk is secretly meeting in San Francisco with executives at Udacity.\nMusk is secretly meeting in San Francisco with executives at Udacity.\n\nOriginal entity: `San_Francisco|GPE`\nSimilar entities: `[('San_Francisco|GPE', 0.9999998807907104), ('Los_Angeles|GPE', 0.9458021521568298), ('San_Diego|GPE', 0.9291024804115295), ('San_Fransisco|GPE', 0.926132082939148), ('San_Fran|GPE', 0.9151375889778137), ('Chicago|GPE', 0.9096716642379761), ('New_York|GPE', 0.9040541052818298), ('Minneapolis|GPE', 0.9033447504043579), ('Orange_County|GPE', 0.9032890200614929), ('L.A.|GPE', 0.9026137590408325)]`\nSimilar entities (as words): `{'San Fran', 'Chicago', 'Minneapolis', 'Orange County', 'New York', 'Los Angeles', 'San Fransisco', 'L.A.', 'San Diego'}`\nElon Musk is secretly meeting in San Fran with executives at Udacity.\nElon Musk is secretly meeting in Chicago with executives at Udacity.\nElon Musk is secretly meeting in Minneapolis with executives at Udacity.\nElon Musk is secretly meeting in Orange County with executives at Udacity.\nElon Musk is secretly meeting in New York with executives at Udacity.\nElon Musk is secretly meeting in Los Angeles with executives at Udacity.\nElon Musk is secretly meeting in San Fransisco with executives at Udacity.\nElon Musk is secretly meeting in L.A. with executives at Udacity.\nElon Musk is secretly meeting in San Diego with executives at Udacity.\n\nOriginal entity: `Udacity|GPE`\nSimilar entities: `[('Udacity|GPE', 1.0000001192092896), ('Coursera|PERSON', 0.8691951036453247), ('Coursera|ORG', 0.866468608379364), ('Udacity|ORG', 0.8635085225105286), ('Coursera|GPE', 0.8596188426017761), ('coursera|NOUN', 0.826501727104187), ('EdX|NOUN', 0.8229740858078003), ('edX|NOUN', 0.8110880851745605), ('Udacity|NOUN', 0.797046422958374), ('Udemy|GPE', 0.7920954823493958)]`\nSimilar entities (as words): `{'coursera', 'EdX', 'edX', 'Coursera'}`\nElon Musk is secretly meeting in San Francisco with executives at coursera.\nElon Musk is secretly meeting in San Francisco with executives at EdX.\nElon Musk is secretly meeting in San Francisco with executives at edX.\nElon Musk is secretly meeting in San Francisco with executives at Coursera.\n```"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Example 3"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "```\n# Example of failure: perturbation fails on named entities with no near words\n# Alternative for \"rare\" named entities is to use words from the same NER category\nOriginal sentence: `Tim Estes is the CEO of Digital Reasoning.`\nSimilarity threshold: 0.8\n\nPerturbations:\n\nOriginal entity: `Tim_Estes|PERSON`\nSimilar entities: `[]`\nSimilar entities (as words): `[]`\n\nOriginal entity: `Digital_Reasoning|ORG`\nSimilar entities: `[]`\nSimilar entities (as words): `[]`\n```"
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1",
"file_extension": ".py",
"codemirror_mode": {
"version": 3,
"name": "ipython"
}
},
"gist_id": "9cef43165d706666431fac69c5190287"
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment