Skip to content

Instantly share code, notes, and snippets.

@dan-zheng
Last active June 28, 2017 16:28
Show Gist options
  • Save dan-zheng/2b551ba100bb026975e3c591814ecdd7 to your computer and use it in GitHub Desktop.
Save dan-zheng/2b551ba100bb026975e3c591814ecdd7 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Turby - Data Augmentation for Text\nTurby is a Python library that provides data augmentation for text using a variety of strategies. It is highly customizable and can be easily added to a machine learning pipeline after a bit of configuration."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Spelling Perturbation\n\n### Description\nTurby provides spelling perturbation for producing random spelling errors in a text.\n\nTypo generation can be configured with a spelling dictionary that maps incorrectly spelled words to their correctly spelled version. The spelling dictionary is parsed and the frequency, location, and types of typos are extracted based on string edit distance and stored using a character level n-gram model.\n\nThis model represents a probability distribution for typos based on position in a word (beginning/middle/end), types of characters of involved (vowels/consonants/punctuation/whitespace), and the type of typo (character insertion/deletion/substitution/transposition). In spelling perturbation, typos are probabilistically generated in a text based on this model.\n\nBy default, the spelling model uses the [Wikipedia list of common misspellings](https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines) as the spelling dictionary and a n-gram order of 3 characters.\n\n### Configuration\n- `typo_dist`: typo probability distribution produced from spelling dictionary\n - default: *trained based on Wikipedia list of common misspellings*\n- `ngram_order`: order of the character level n-gram, heavily dependent on `typo_dist`\n - default: 3\n- `typo_freq_factor`: float between 0 and 1 that controls frequency of typos\n - default: 0.3\n- `match_char_norm`: attempt matching characters based on their normalized version (uppercase characters become lowercase, used as a fallback for exact character matching)\n - default: `false`, matching is as strict as possible\n- `match_char_type`: attempt matching characters based on character type, such as vowels/consonants (used as a fallback for exact character matching)\n - default: `false`, matching is as strict as possible\n- `decapitalize_prob`: probability of randomly decapitalizing capital letters\n - default: 0\n- `drop_punct_prob`: probability of randomly dropping punctuation characters\n - default: 0\n\n### Demo\n```\nOriginal text: `The quick brown fox jumped over the lazy dog. Teens perceive unsurprisingly.`\nThe quick brown fox jumped over the lazy dog. Teens perceive unsurprizingly.\nThe quick brown fox jumped over the lazy dog Teens percieve unsurprisingly.\nTe quick brown fox jumpd over hte lazy dog. Teens perceive unsuprisingly.\nThe quick brwon fox jumped over tghe lazy dog. Teens preceive unsuprisingly.\nthe quick brown fox jumped over tjhe lazy dog. Teens perceive unsuprisingly.\nThe quick brown fox jumed over the lazy dog. Teens percieve unsupresingy.\nThge quick brown fox jumpped over the lazy dog. Teens precieve unsuprisingly.\nThe quick brown fox jumped over tjhe lazy dog. Teens perceive unsurprisingly.\nThe quick brwon fox jumpped over the lazy dog. Teens percieve unsurprisingly.\nThe quick brwon fox jumed over the lazy dog. Teens percieve unsurprisingly.\nThe quick brown fox jumped over the lazy dog. Teens percieve unsurprisingly.\nThe quick brown fox jumped over the lazy dog. Teens perceive unsurprisingly.\nThe quick brown fox jumped ovor tjhe lazy dog. Teens perceive unsurprisingy.\nThe quick brown fox jumped over tghe lazy dog. Teens percive unsuprisingy.\nThe quick brown fox jumpped ovr te lazy dog. Teens perceive unsurprisingly.\n```"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Coreference-Related Perturbation\n\n### Description\n*in progress*\n\n### Configuration\n- `permute_prob`: probability of randomly permuting coreferent mentions\n - default: 1\n- `representative_prob`: probability of randomly replacing a mention with its representative mention\n - default: 0\n- `use_representative_first`: force representative mention to appear first in a coreference chain (prevents pronoun mention from occuring before its antecendents)\n - default: True\n\n### Demo\n```\nOriginal sentences:\n(0): `Bob Smith works at Digital Reasoning.`\n(1): `Smith's work ethic is excellent and he is a team player.`\n(2): `Sometimes Bob brings cookies.`\n\nCoreference chains:\nCHAIN1-[\"Digital Reasoning\" in sentence 0]\nCHAIN3-[\"Smith's work ethic\" in sentence 1]\nCHAIN5-[\"a team player\" in sentence 1]\nCHAIN6-[\"Bob Smith\" in sentence 0, \"Smith's\" in sentence 1, \"he\" in sentence 1, \"Bob\" in sentence 2]\nCHAIN7-[\"cookies\" in sentence 2]\n\nBob Smith works at Digital Reasoning. Smith's work ethic is excellent and he is a team player. Sometimes Bob brings cookies.\nBob Smith works at Digital Reasoning. His work ethic is excellent and Bob is a team player. Sometimes Smith brings cookies.\nBob Smith works at Digital Reasoning. His work ethic is excellent and Smith is a team player. Sometimes Bob brings cookies.\nBob Smith works at Digital Reasoning. Bob's work ethic is excellent and Smith is a team player. Sometimes he brings cookies.\nBob Smith works at Digital Reasoning. Bob's work ethic is excellent and Smith is a team player. Sometimes he brings cookies.\nBob Smith works at Digital Reasoning. Bob's work ethic is excellent and he is a team player. Sometimes Smith brings cookies.\nBob Smith works at Digital Reasoning. His work ethic is excellent and Smith is a team player. Sometimes Bob brings cookies.\nBob Smith works at Digital Reasoning. Bob's work ethic is excellent and he is a team player. Sometimes Smith brings cookies.\nBob Smith works at Digital Reasoning. Smith's work ethic is excellent and Bob is a team player. Sometimes he brings cookies.\nBob Smith works at Digital Reasoning. His work ethic is excellent and Smith is a team player. Sometimes Bob brings cookies.\n```"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Near-Words Perturbation\n\n### Description\n*in progress*\n\n### Todo\n\n- Make near-words perturbation sensitive to coreference chains so that all coreferent mentions are changed and stay consistent\n- Develop a probabilistic model for near-words perturbation based on similarity between sense2vec vectors.\n\n### Demo\n```\nOriginal text: `Elon Musk is secretly meeting in San Francisco with executives at Udacity.`\nNear word perturbations:\nElon is secretly meeting in San Francisco with executives at Udacity.\nElon Musk is secretly meeting in New York with executives at Udacity.\nElon Musk is secretly meeting in Orange County with executives at Udacity.\nElon Musk is secretly meeting in San Fransisco with executives at Udacity.\nElon Musk is secretly meeting in Minneapolis with executives at Udacity.\nElon Musk is secretly meeting in Los Angeles with executives at Udacity.\nElon Musk is secretly meeting in Chicago with executives at Udacity.\nElon Musk is secretly meeting in L.A. with executives at Udacity.\nElon Musk is secretly meeting in San Diego with executives at Udacity.\nElon Musk is secretly meeting in San Fran with executives at Udacity.\nElon Musk is secretly meeting in San Francisco with CEOs at Udacity.\nElon Musk is secretly meeting in San Francisco with execs at Udacity.\nElon Musk is secretly meeting in San Francisco with executives at EdX.\nElon Musk is secretly meeting in San Francisco with executives at Coursera.\n```"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Full Pipeline Demonstration\n\n```\nImported turby: 1.7846322059631348\nOpened text: 1.7846689224243164\nIntialized TextPerturber: 15.281015157699585\n\noriginal: `Bob Smith works at Digital Reasoning. Smith is happy, his work ethic is excellent, and Bob is a team player. And he brings cookies.`\n\ncoref: `Bob Smith works at Digital Reasoning. Bob is happy, his work ethic is excellent, and Smith is a team player. And he brings cookies.`\n\nnearwords: `Bob Smith works at Digital Reasoning. Bob is happy, his work ethic is excellent, and Adams is a team player. And he brings cookies.`\n\nspelling: `Bob Smith works at Digital Reasong. Bob is happy, his wokr ethic is excellent, and Adams is a team player. Ad he brrings cookes.`\n\nfinal result: `Bob Smith works at Digital Reasong. Bob is happy, his wokr ethic is excellent, and Adams is a team player. Ad he brrings cookes.`\n\nFinished printing perturbations: 18.229521989822388\n```"
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13",
"file_extension": ".py",
"codemirror_mode": {
"version": 2,
"name": "ipython"
}
},
"gist_id": "2b551ba100bb026975e3c591814ecdd7"
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment