Created
May 21, 2018 19:29
-
-
Save PhanDuc/1adb260ed39adad06e15d63198284ec0 to your computer and use it in GitHub Desktop.
Experiment with Google Ngram
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Try to work with Google Books Ngramm and evaluate the reliability of the data obtained (on your examples).**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I want to track the rise of a word like \"telephone\" and its clipped form \"phone.\" But what if you're only interested in how \"telephone\" and \"phone\" developed as verbs? The graph indicates that \"telephone\" held strong as a verb for much of the 20th century but is now on its way out." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from IPython.display import Image\n", | |
"from IPython.display import IFrame" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"scrolled": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"\n", | |
" <iframe\n", | |
" width=\"1000\"\n", | |
" height=\"500\"\n", | |
" src=\"https://books.google.com/ngrams/interactive_chart?content=phone_VERB%2C+telephone_VERB&year_start=1880&year_end=2008&corpus=17&smoothing=3&share=&direct_url=t1%3B%2Cphone_VERB%3B%2Cc0%3B.t1%3B%2Ctelephone_VERB%3B%2Cc0\"\n", | |
" frameborder=\"0\"\n", | |
" allowfullscreen\n", | |
" ></iframe>\n", | |
" " | |
], | |
"text/plain": [ | |
"<IPython.lib.display.IFrame at 0x20552767470>" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"IFrame(\"https://books.google.com/ngrams/interactive_chart?\" \\\n", | |
" \"content=phone_VERB%2C+telephone_VERB&year_start=1880&year_end=2008\"\\\n", | |
" \"&corpus=17&smoothing=3&share=&direct_url=t1%3B%2Cphone_VERB%3B%2Cc0%3B.t1%3B%2Ctelephone_VERB%3B%2Cc0\",\n", | |
" width = 1000,\n", | |
" height = 500)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### How \"Star Was\" appearance in the time between 1880 to 2008" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 178, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"\n", | |
" <iframe\n", | |
" width=\"1000\"\n", | |
" height=\"500\"\n", | |
" src=\"https://books.google.com/ngrams/interactive_chart?content=Star+Wars&year_start=1880&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2CStar%20Wars%3B%2Cc0\"\n", | |
" frameborder=\"0\"\n", | |
" allowfullscreen\n", | |
" ></iframe>\n", | |
" " | |
], | |
"text/plain": [ | |
"<IPython.lib.display.IFrame at 0x20557432f28>" | |
] | |
}, | |
"execution_count": 178, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"IFrame(\"https://books.google.com/ngrams/interactive_chart?\"\\\n", | |
" \"content=Star+Wars&year_start=1880&year_end=2008&\"\\\n", | |
" \"corpus=15&smoothing=3&share=&direct_url=t1%3B%2CStar%20Wars%3B%2Cc0\",\n", | |
" width=1000, height=500 )" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"From wikipedia https://en.wikipedia.org/wiki/Star_Wars, series \"Star Wars\" first showed in 1977. From the graph, we can see before that time, there are no information about \"Star Wars\"." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The standard setting for smoothing is 3 which means that the value for a given year is the average of that year itself as well as the 3 preceeding and following years. The problem with this setting is that it makes rare terms, which may appear 100 times in 1 year and only 5 times in the following to be stable. \n", | |
"\n", | |
"Later, I changed smoothing from 3 to 0 and see how reliable findings are." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 180, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"\n", | |
" <iframe\n", | |
" width=\"1000\"\n", | |
" height=\"500\"\n", | |
" src=\"https://books.google.com/ngrams/interactive_chart?content=Star+Wars&year_start=1880&year_end=2008&corpus=15&smoothing=0&share=&direct_url=t1%3B%2CStar%20Wars%3B%2Cc0\"\n", | |
" frameborder=\"0\"\n", | |
" allowfullscreen\n", | |
" ></iframe>\n", | |
" " | |
], | |
"text/plain": [ | |
"<IPython.lib.display.IFrame at 0x20557432cf8>" | |
] | |
}, | |
"execution_count": 180, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"IFrame(\"https://books.google.com/ngrams/interactive_chart?\"\\\n", | |
" \"content=Star+Wars&year_start=1880&year_end=2008&\"\\\n", | |
" \"corpus=15&smoothing=0&share=&direct_url=t1%3B%2CStar%20Wars%3B%2Cc0\",\n", | |
" width=1000, height=500 )" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"It seem showed the right values how people addicted with vodka." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I'm downloaded \"total_counts\" for English corpus to check the absolute number of that \"Star Wars\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 84, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"total_counts = pd.read_table(\"googlebooks-eng-all-totalcounts-20120701.txt\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 85, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th>1505,32059,231,1</th>\n", | |
" <th>1507,49586,477,1</th>\n", | |
" <th>1515,289011,2197,1</th>\n", | |
" <th>1520,51783,223,1</th>\n", | |
" <th>1524,287177,1275,1</th>\n", | |
" <th>1525,3559,69,1</th>\n", | |
" <th>1527,4375,39,1</th>\n", | |
" <th>1541,5272,59,1</th>\n", | |
" <th>1563,213843,931,1</th>\n", | |
" <th>...</th>\n", | |
" <th>2000,11190986329,54799233,103405</th>\n", | |
" <th>2001,11349375656,55886251,104147</th>\n", | |
" <th>2002,12519922882,62335467,117207</th>\n", | |
" <th>2003,13632028136,68561620,127066</th>\n", | |
" <th>2004,14705541576,73346714,139616</th>\n", | |
" <th>2005,14425183957,72756812,138132</th>\n", | |
" <th>2006,15310495914,77883896,148342</th>\n", | |
" <th>2007,16206118071,82969746,155472</th>\n", | |
" <th>2008,19482936409,108811006,206272</th>\n", | |
" <th>Unnamed: 426</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>0 rows × 427 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
"Empty DataFrame\n", | |
"Columns: [ , 1505,32059,231,1, 1507,49586,477,1, 1515,289011,2197,1, 1520,51783,223,1, 1524,287177,1275,1, 1525,3559,69,1, 1527,4375,39,1, 1541,5272,59,1, 1563,213843,931,1, 1564,70755,387,1, 1568,153095,1124,2, 1572,177484,797,1, 1574,62235,689,1, 1575,186706,1067,1, 1579,203074,1143,3, 1581,708458,2824,6, 1582,151000,537,1, 1584,151925,393,1, 1587,248361,762,2, 1588,41548,634,2, 1589,36290,238,2, 1590,564921,2260,2, 1592,96955,814,4, 1593,39997,328,2, 1594,11106,67,1, 1595,33664,347,3, 1597,10923,101,1, 1598,85051,768,2, 1600,405205,985,1, 1602,3292,47,1, 1603,69050,561,1, 1605,14493,131,1, 1606,62921,601,3, 1607,381763,1600,2, 1610,6258,75,1, 1611,49641,457,1, 1612,52898,593,1, 1614,8777,57,1, 1618,20166,147,1, 1619,55192,467,1, 1620,229054,2371,3, 1621,64197,679,3, 1623,120443,896,2, 1624,145470,899,3, 1625,69296,551,1, 1626,41890,259,1, 1628,6425,43,1, 1629,288773,1250,2, 1630,152568,1463,3, 1631,474458,1899,1, 1632,43064,299,1, 1634,141378,777,3, 1635,244673,1385,3, 1636,31714,252,2, 1637,681719,2315,3, 1638,243942,876,2, 1640,60550,425,3, 1641,45397,536,2, 1642,137346,769,3, 1643,177489,1238,6, 1644,1018174,4031,5, 1645,252714,1263,3, 1646,55522,253,1, 1647,312270,2015,5, 1648,458975,2306,4, 1649,260987,1796,4, 1650,192820,1161,7, 1651,540758,2221,3, 1652,168692,1023,3, 1653,379618,2677,7, 1654,36496,256,2, 1655,280899,1789,5, 1656,688699,3142,4, 1657,310453,2551,5, 1658,834659,4509,9, 1659,543657,2331,3, 1660,130457,1085,5, 1661,128825,931,3, 1662,239762,1471,3, 1663,208750,2021,5, 1664,290743,2670,6, 1665,269608,2689,11, 1666,81564,843,3, 1667,751217,3449,9, 1668,1065563,3920,6, 1669,342820,2276,4, 1670,734354,3127,5, 1671,149851,1276,4, 1672,425998,2665,5, 1673,935178,5517,11, 1674,126602,643,3, 1675,1644156,8918,14, 1676,1801615,8433,15, 1677,799238,5380,12, 1678,1966870,8516,18, 1679,1112022,6347,13, 1680,1099854,6122,22, 1681,2614565,11444,28, 1682,3667945,15570,30, ...]\n", | |
"Index: []\n", | |
"\n", | |
"[0 rows x 427 columns]" | |
] | |
}, | |
"execution_count": 85, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"total_counts.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Evaluate the reliability" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Check the Absolute Count in 1988**" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 86, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[' ',\n", | |
" '1505,32059,231,1',\n", | |
" '1507,49586,477,1',\n", | |
" '1515,289011,2197,1',\n", | |
" '1520,51783,223,1',\n", | |
" '1524,287177,1275,1',\n", | |
" '1525,3559,69,1',\n", | |
" '1527,4375,39,1',\n", | |
" '1541,5272,59,1',\n", | |
" '1563,213843,931,1',\n", | |
" '1564,70755,387,1',\n", | |
" '1568,153095,1124,2',\n", | |
" '1572,177484,797,1',\n", | |
" '1574,62235,689,1',\n", | |
" '1575,186706,1067,1',\n", | |
" '1579,203074,1143,3',\n", | |
" '1581,708458,2824,6',\n", | |
" '1582,151000,537,1',\n", | |
" '1584,151925,393,1',\n", | |
" '1587,248361,762,2',\n", | |
" '1588,41548,634,2',\n", | |
" '1589,36290,238,2',\n", | |
" '1590,564921,2260,2',\n", | |
" '1592,96955,814,4',\n", | |
" '1593,39997,328,2',\n", | |
" '1594,11106,67,1',\n", | |
" '1595,33664,347,3',\n", | |
" '1597,10923,101,1',\n", | |
" '1598,85051,768,2',\n", | |
" '1600,405205,985,1',\n", | |
" '1602,3292,47,1',\n", | |
" '1603,69050,561,1',\n", | |
" '1605,14493,131,1',\n", | |
" '1606,62921,601,3',\n", | |
" '1607,381763,1600,2',\n", | |
" '1610,6258,75,1',\n", | |
" '1611,49641,457,1',\n", | |
" '1612,52898,593,1',\n", | |
" '1614,8777,57,1',\n", | |
" '1618,20166,147,1',\n", | |
" '1619,55192,467,1',\n", | |
" '1620,229054,2371,3',\n", | |
" '1621,64197,679,3',\n", | |
" '1623,120443,896,2',\n", | |
" '1624,145470,899,3',\n", | |
" '1625,69296,551,1',\n", | |
" '1626,41890,259,1',\n", | |
" '1628,6425,43,1',\n", | |
" '1629,288773,1250,2',\n", | |
" '1630,152568,1463,3',\n", | |
" '1631,474458,1899,1',\n", | |
" '1632,43064,299,1',\n", | |
" '1634,141378,777,3',\n", | |
" '1635,244673,1385,3',\n", | |
" '1636,31714,252,2',\n", | |
" '1637,681719,2315,3',\n", | |
" '1638,243942,876,2',\n", | |
" '1640,60550,425,3',\n", | |
" '1641,45397,536,2',\n", | |
" '1642,137346,769,3',\n", | |
" '1643,177489,1238,6',\n", | |
" '1644,1018174,4031,5',\n", | |
" '1645,252714,1263,3',\n", | |
" '1646,55522,253,1',\n", | |
" '1647,312270,2015,5',\n", | |
" '1648,458975,2306,4',\n", | |
" '1649,260987,1796,4',\n", | |
" '1650,192820,1161,7',\n", | |
" '1651,540758,2221,3',\n", | |
" '1652,168692,1023,3',\n", | |
" '1653,379618,2677,7',\n", | |
" '1654,36496,256,2',\n", | |
" '1655,280899,1789,5',\n", | |
" '1656,688699,3142,4',\n", | |
" '1657,310453,2551,5',\n", | |
" '1658,834659,4509,9',\n", | |
" '1659,543657,2331,3',\n", | |
" '1660,130457,1085,5',\n", | |
" '1661,128825,931,3',\n", | |
" '1662,239762,1471,3',\n", | |
" '1663,208750,2021,5',\n", | |
" '1664,290743,2670,6',\n", | |
" '1665,269608,2689,11',\n", | |
" '1666,81564,843,3',\n", | |
" '1667,751217,3449,9',\n", | |
" '1668,1065563,3920,6',\n", | |
" '1669,342820,2276,4',\n", | |
" '1670,734354,3127,5',\n", | |
" '1671,149851,1276,4',\n", | |
" '1672,425998,2665,5',\n", | |
" '1673,935178,5517,11',\n", | |
" '1674,126602,643,3',\n", | |
" '1675,1644156,8918,14',\n", | |
" '1676,1801615,8433,15',\n", | |
" '1677,799238,5380,12',\n", | |
" '1678,1966870,8516,18',\n", | |
" '1679,1112022,6347,13',\n", | |
" '1680,1099854,6122,22',\n", | |
" '1681,2614565,11444,28',\n", | |
" '1682,3667945,15570,30',\n", | |
" '1683,4175428,15946,30',\n", | |
" '1684,1707625,8701,19',\n", | |
" '1685,2350253,14504,28',\n", | |
" '1686,1263478,8862,22',\n", | |
" '1687,1185730,4521,13',\n", | |
" '1688,2548272,10593,21',\n", | |
" '1689,982547,6474,20',\n", | |
" '1690,909320,5392,14',\n", | |
" '1691,321865,2262,8',\n", | |
" '1692,1674892,8524,14',\n", | |
" '1693,1038415,7426,16',\n", | |
" '1694,2020553,13199,25',\n", | |
" '1695,1223730,8829,13',\n", | |
" '1696,829773,7095,21',\n", | |
" '1697,947914,4401,9',\n", | |
" '1698,3115797,19918,38',\n", | |
" '1699,2830668,17088,36',\n", | |
" '1700,3724080,23837,37',\n", | |
" '1701,3969408,26769,49',\n", | |
" '1702,4981091,27197,65',\n", | |
" '1703,4160884,26829,47',\n", | |
" '1704,4896743,30972,68',\n", | |
" '1705,4908749,28840,60',\n", | |
" '1706,6717731,36302,70',\n", | |
" '1707,5350926,26228,52',\n", | |
" '1708,6481151,37416,70',\n", | |
" '1709,3354295,24260,56',\n", | |
" '1710,6947443,35889,99',\n", | |
" '1711,6737146,40069,85',\n", | |
" '1712,3822481,22378,58',\n", | |
" '1713,4720647,25961,77',\n", | |
" '1714,7764527,42791,95',\n", | |
" '1715,6381321,40575,91',\n", | |
" '1716,5059979,23970,70',\n", | |
" '1717,6932237,37712,90',\n", | |
" '1718,5184576,36292,92',\n", | |
" '1719,4957704,30204,98',\n", | |
" '1720,9307091,51148,102',\n", | |
" '1721,6991857,37936,84',\n", | |
" '1722,10462138,45518,96',\n", | |
" '1723,7650075,43642,106',\n", | |
" '1724,8504688,53163,91',\n", | |
" '1725,10634464,54579,99',\n", | |
" '1726,10049695,66514,106',\n", | |
" '1727,12961617,73073,133',\n", | |
" '1728,11203433,72304,142',\n", | |
" '1729,12290699,65192,122',\n", | |
" '1730,12141708,69124,140',\n", | |
" '1731,12939697,67794,128',\n", | |
" '1732,10191917,66456,144',\n", | |
" '1733,5729194,33674,98',\n", | |
" '1734,10069531,62738,120',\n", | |
" '1735,9078498,59822,130',\n", | |
" '1736,8049773,47332,112',\n", | |
" '1737,13254037,74519,133',\n", | |
" '1738,13711768,67208,132',\n", | |
" '1739,11506472,73091,169',\n", | |
" '1740,11351999,63577,123',\n", | |
" '1741,8036677,50136,122',\n", | |
" '1742,11481001,68262,142',\n", | |
" '1743,9480804,68475,165',\n", | |
" '1744,13999448,86587,171',\n", | |
" '1745,8964077,62566,184',\n", | |
" '1746,7178475,42780,142',\n", | |
" '1747,15862088,87459,177',\n", | |
" '1748,15326914,83841,153',\n", | |
" '1749,12651711,88764,196',\n", | |
" '1750,19252447,105214,218',\n", | |
" '1751,20150324,112498,218',\n", | |
" '1752,14340951,87244,182',\n", | |
" '1753,19100911,113065,229',\n", | |
" '1754,20408128,131704,220',\n", | |
" '1755,20284102,135065,257',\n", | |
" '1756,8734579,59545,165',\n", | |
" '1757,13717180,93794,194',\n", | |
" '1758,16974336,104794,196',\n", | |
" '1759,21275484,125399,205',\n", | |
" '1760,14620367,104986,216',\n", | |
" '1761,17721029,107990,212',\n", | |
" '1762,11334996,73704,158',\n", | |
" '1763,20103289,111617,195',\n", | |
" '1764,18680471,112259,201',\n", | |
" '1765,15656943,101540,196',\n", | |
" '1766,26832144,166327,279',\n", | |
" '1767,19968484,137147,239',\n", | |
" '1768,27116433,186755,307',\n", | |
" '1769,18548978,128275,237',\n", | |
" '1770,21906473,156785,287',\n", | |
" '1771,20026146,148156,242',\n", | |
" '1772,20087322,151573,259',\n", | |
" '1773,18809127,131107,233',\n", | |
" '1774,19376100,140530,286',\n", | |
" '1775,25217307,163753,297',\n", | |
" '1776,26766563,182397,333',\n", | |
" '1777,22531379,164511,291',\n", | |
" '1778,20822070,130713,211',\n", | |
" '1779,18344680,132503,247',\n", | |
" '1780,19284173,137264,262',\n", | |
" '1781,21534708,165102,272',\n", | |
" '1782,21505581,148858,256',\n", | |
" '1783,21001833,154110,278',\n", | |
" '1784,26735435,196374,310',\n", | |
" '1785,26424206,195551,333',\n", | |
" '1786,27701969,201168,328',\n", | |
" '1787,41147754,274736,406',\n", | |
" '1788,43010567,317558,476',\n", | |
" '1789,37991018,274486,408',\n", | |
" '1790,40363128,290254,448',\n", | |
" '1791,44446487,303140,450',\n", | |
" '1792,47305037,334531,525',\n", | |
" '1793,41628412,306038,536',\n", | |
" '1794,48633342,334985,503',\n", | |
" '1795,46129522,306795,481',\n", | |
" '1796,56007600,402660,600',\n", | |
" '1797,47048067,327575,527',\n", | |
" '1798,46311447,317118,520',\n", | |
" '1799,50259992,358621,543',\n", | |
" '1800,70784405,481221,669',\n", | |
" '1801,107290136,720762,976',\n", | |
" '1802,95731997,593319,843',\n", | |
" '1803,104173226,703119,941',\n", | |
" '1804,114051906,773467,1079',\n", | |
" '1805,115330195,768720,1054',\n", | |
" '1806,118229517,820253,1139',\n", | |
" '1807,128904931,843799,1139',\n", | |
" '1808,129988114,825924,1172',\n", | |
" '1809,137911980,849578,1188',\n", | |
" '1810,150961261,942002,1280',\n", | |
" '1811,177318465,1089707,1425',\n", | |
" '1812,172538907,966207,1285',\n", | |
" '1813,144660671,848854,1148',\n", | |
" '1814,168441689,1005881,1325',\n", | |
" '1815,156318674,940919,1281',\n", | |
" '1816,161561836,993399,1375',\n", | |
" '1817,182422107,1112404,1608',\n", | |
" '1818,204446854,1249575,1711',\n", | |
" '1819,174156635,1074883,1603',\n", | |
" '1820,231277724,1428596,1876',\n", | |
" '1821,181677006,1090084,1530',\n", | |
" '1822,271213007,1582135,2049',\n", | |
" '1823,254327070,1531352,2096',\n", | |
" '1824,309237910,1818566,2402',\n", | |
" '1825,318701311,1931153,2571',\n", | |
" '1826,243758959,1459702,2006',\n", | |
" '1827,253677933,1540742,2124',\n", | |
" '1828,273678947,1616864,2320',\n", | |
" '1829,293815859,1682580,2338',\n", | |
" '1830,342378710,1893561,2615',\n", | |
" '1831,313388047,1693686,2458',\n", | |
" '1832,314184783,1697641,2501',\n", | |
" '1833,310441320,1768777,2655',\n", | |
" '1834,301383644,1685631,2585',\n", | |
" '1835,355491202,2000520,2946',\n", | |
" '1836,365982104,2016239,2951',\n", | |
" '1837,337485292,1897476,2642',\n", | |
" '1838,358600155,1973223,2813',\n", | |
" '1839,413876708,2268357,3195',\n", | |
" '1840,423904296,2214894,3196',\n", | |
" '1841,387286321,2083152,3048',\n", | |
" '1842,348396317,1825805,2711',\n", | |
" '1843,404133447,2000337,2899',\n", | |
" '1844,419311001,2164514,3086',\n", | |
" '1845,456885448,2327894,3294',\n", | |
" '1846,459546575,2351443,3305',\n", | |
" '1847,443868440,2210955,3291',\n", | |
" '1848,466134080,2417716,3648',\n", | |
" '1849,472315353,2428935,3539',\n", | |
" '1850,504143257,2601734,3910',\n", | |
" '1851,537705793,2787491,4021',\n", | |
" '1852,558718364,2900999,4461',\n", | |
" '1853,625159477,3248278,4706',\n", | |
" '1854,683559348,3445720,4810',\n", | |
" '1855,605758582,3126226,4404',\n", | |
" '1856,652385453,3360386,4728',\n", | |
" '1857,568489706,2971641,4319',\n", | |
" '1858,541848821,2794762,4108',\n", | |
" '1859,588343315,3047548,4572',\n", | |
" '1860,607952196,3291751,4921',\n", | |
" '1861,463190641,2457516,3664',\n", | |
" '1862,396839451,2162284,3364',\n", | |
" '1863,418297294,2280211,3527',\n", | |
" '1864,493159851,2742669,4089',\n", | |
" '1865,503022451,2754685,4265',\n", | |
" '1866,548257863,2970231,4373',\n", | |
" '1867,518622969,2798144,4168',\n", | |
" '1868,547590187,3004671,4509',\n", | |
" '1869,558291347,3052571,4589',\n", | |
" '1870,548870828,3010658,4588',\n", | |
" '1871,560339562,3109850,4674',\n", | |
" '1872,566620105,3133978,4768',\n", | |
" '1873,583981485,3210707,4799',\n", | |
" '1874,636667506,3496138,5190',\n", | |
" '1875,643873731,3513955,5335',\n", | |
" '1876,676820039,3717671,5691',\n", | |
" '1877,667722549,3635691,5657',\n", | |
" '1878,629401874,3475917,5521',\n", | |
" '1879,654448581,3648960,5912',\n", | |
" '1880,784223075,4339293,6659',\n", | |
" '1881,789254798,4377740,6836',\n", | |
" '1882,828502461,4594461,7295',\n", | |
" '1883,930196929,5188267,8091',\n", | |
" '1884,881638914,4821278,7906',\n", | |
" '1885,857166435,4796652,7804',\n", | |
" '1886,727723136,3978980,6198',\n", | |
" '1887,801865869,4578817,7215',\n", | |
" '1888,795886071,4489400,7054',\n", | |
" '1889,763170247,4217872,6480',\n", | |
" '1890,787152479,4446336,7006',\n", | |
" '1891,849750639,4772590,7600',\n", | |
" '1892,936056142,5340906,8320',\n", | |
" '1893,915629979,5204954,8214',\n", | |
" '1894,899615494,5190068,8132',\n", | |
" '1895,984856075,5699486,9184',\n", | |
" '1896,1050921103,6149427,9663',\n", | |
" '1897,1031909734,6036650,9632',\n", | |
" '1898,1109257706,6474893,10193',\n", | |
" '1899,1232717908,7319283,11421',\n", | |
" '1900,1341057959,7880706,12204',\n", | |
" '1901,1285712637,7611053,11923',\n", | |
" '1902,1311315033,7850395,12325',\n", | |
" '1903,1266236889,7672684,12386',\n", | |
" '1904,1405505328,8505994,13406',\n", | |
" '1905,1351302005,7982387,12833',\n", | |
" '1906,1397090480,8324581,13309',\n", | |
" '1907,1409945274,8352873,13533',\n", | |
" '1908,1417130893,8455420,13826',\n", | |
" '1909,1283265090,7678880,12638',\n", | |
" '1910,1354824248,8082350,13278',\n", | |
" '1911,1350964981,8146435,13659',\n", | |
" '1912,1431385638,8498210,14314',\n", | |
" '1913,1356693322,8272376,14064',\n", | |
" '1914,1324894757,8031654,13964',\n", | |
" '1915,1211361619,7359683,13357',\n", | |
" '1916,1175413415,7285233,13449',\n", | |
" '1917,1183132092,7301665,13535',\n", | |
" '1918,1039343103,6427497,12225',\n", | |
" '1919,1136614538,6939246,12588',\n", | |
" '1920,1388696469,8320305,14671',\n", | |
" '1921,1216676110,7129055,12681',\n", | |
" '1922,1413237707,8295471,14781',\n", | |
" '1923,1151386048,6679296,11962',\n", | |
" '1924,1069007206,6285325,11221',\n", | |
" '1925,1113107246,6436655,11609',\n", | |
" '1926,1053565430,6180969,11513',\n", | |
" '1927,1216023821,6992594,12560',\n", | |
" '1928,1212716430,6940650,12610',\n", | |
" '1929,1153722574,6757530,12430',\n", | |
" '1930,1244889331,7172751,13131',\n", | |
" '1931,1183806248,6746535,12339',\n", | |
" '1932,1057602772,5908248,10940',\n", | |
" '1933,915956659,5193167,10129',\n", | |
" '1934,1053600093,5813581,10781',\n", | |
" '1935,1157109310,6383929,11543',\n", | |
" '1936,1199843463,6704700,12168',\n", | |
" '1937,1232280287,6867867,12393',\n", | |
" '1938,1261812592,7006038,12494',\n", | |
" '1939,1249209591,6860069,12255',\n", | |
" '1940,1179404138,6458613,11539',\n", | |
" '1941,1084154164,5943516,10956',\n", | |
" '1942,1045379066,5652409,10561',\n", | |
" '1943,890214397,4754157,9221',\n", | |
" '1944,812192380,4254836,8696',\n", | |
" '1945,926378706,4754610,9542',\n", | |
" '1946,1203221497,6293844,12452',\n", | |
" '1947,1385834769,7297313,14115',\n", | |
" '1948,1486005621,7719563,14721',\n", | |
" '1949,1641024100,8474538,15754',\n", | |
" '1950,1644401950,8581523,15761',\n", | |
" '1951,1603394676,8369856,15418',\n", | |
" '1952,1621780754,8271139,15307',\n", | |
" '1953,1590464886,8243557,15325',\n", | |
" '1954,1662160145,8642537,16201',\n", | |
" '1955,1751719755,9009566,16994',\n", | |
" '1956,1817491821,9289947,17453',\n", | |
" '1957,1952474329,10050283,18977',\n", | |
" '1958,1976098333,10184584,19292',\n", | |
" '1959,2064236476,10667039,20781',\n", | |
" '1960,2341981521,12110214,24048',\n", | |
" '1961,2567977722,13168876,25762',\n", | |
" '1962,2818694749,14534596,27762',\n", | |
" '1963,2955051696,15289261,29569',\n", | |
" '1964,2931038992,15327267,30661',\n", | |
" '1965,3300623502,16925833,32999',\n", | |
" '1966,3466842517,17885635,35243',\n", | |
" '1967,3658119990,18856794,37636',\n", | |
" '1968,3968752101,20713781,40613',\n", | |
" '1969,3942222509,20605052,40154',\n", | |
" '1970,4086393350,21493334,42050',\n", | |
" '1971,4058576649,21022316,41676',\n", | |
" '1972,4174172415,21723303,43701',\n", | |
" '1973,4058707895,20934291,42413',\n", | |
" '1974,4045487401,20870625,42423',\n", | |
" '1975,4104379941,21163884,43866',\n", | |
" '1976,4242326406,21741811,44785',\n", | |
" '1977,4314577619,22131803,45231',\n", | |
" '1978,4365839878,22337808,45652',\n", | |
" '1979,4528331460,23121674,47094',\n", | |
" '1980,4611609946,23399729,47197',\n", | |
" '1981,4627406112,23181513,46107',\n", | |
" '1982,4839530894,24286876,48446',\n", | |
" '1983,4982167985,24855807,49481',\n", | |
" '1984,5309222580,26493896,52068',\n", | |
" '1985,5475269397,27311038,53730',\n", | |
" '1986,5793946882,28860058,56268',\n", | |
" '1987,5936558026,29600208,57856',\n", | |
" '1988,6191886939,30977704,60672',\n", | |
" '1989,6549339038,32665219,64029',\n", | |
" '1990,7075013106,35252588,69220',\n", | |
" '1991,6895715366,34521903,68159',\n", | |
" '1992,7596808027,37580665,72393',\n", | |
" '1993,7492130348,37154768,71658',\n", | |
" '1994,8027353540,39575664,76662',\n", | |
" '1995,8276258599,40863936,77890',\n", | |
" '1996,8745049453,42919779,82091',\n", | |
" '1997,8979708108,43952838,84104',\n", | |
" '1998,9406708249,45989297,87421',\n", | |
" '1999,9997156197,48914071,91983',\n", | |
" '2000,11190986329,54799233,103405',\n", | |
" '2001,11349375656,55886251,104147',\n", | |
" '2002,12519922882,62335467,117207',\n", | |
" '2003,13632028136,68561620,127066',\n", | |
" '2004,14705541576,73346714,139616',\n", | |
" '2005,14425183957,72756812,138132',\n", | |
" '2006,15310495914,77883896,148342',\n", | |
" '2007,16206118071,82969746,155472',\n", | |
" '2008,19482936409,108811006,206272',\n", | |
" 'Unnamed: 426']" | |
] | |
}, | |
"execution_count": 86, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"total_counts.columns.values.tolist()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 89, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"('1988,6191886939,30977704,60672',)" | |
] | |
}, | |
"execution_count": 89, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"'1988,6191886939,30977704,60672'," | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In 1988, there are 6191886939 total words on 30977704 pages in 60672 books." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"With this, we can calculate how often \"Star wars\" appeared when it spiked to the highest relative frequeny in 1986: " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 182, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"9894" | |
] | |
}, | |
"execution_count": 182, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"round(0.0001597920 * 6191886939 / 100)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In other words, In 1986 there were 473 appearances of \"Star Wars\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 163, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import requests\n", | |
"import csv\n", | |
"import matplotlib.pyplot as plt\n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 164, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def plot_absolute_counts(token, corpus='english', smoothing=0, start_year=1800, end_year=2008, log_scale=True):\n", | |
" '''\n", | |
" Some language can use\n", | |
" 'english', 'american english', 'british english', 'english fiction'\n", | |
" 'russian'\n", | |
" '''\n", | |
" # Load absolute counts of the totken\n", | |
" absolute_counts = retrieve_absolute_counts(token, corpus, smoothing, start_year, end_year)\n", | |
"\n", | |
" years = range(start_year, start_year + len(absolute_counts))\n", | |
"\n", | |
" plt.rcParams['figure.figsize'] = (15,8)\n", | |
" plt.rcParams['font.size'] = 10\n", | |
" ax= plt.axes()\n", | |
" if log_scale:\n", | |
" ax.set_yscale('log')\n", | |
" plt.plot(years, absolute_counts, label = '{}'.format(token))\n", | |
" title = 'Absolute Counts of \"{}\" in the \"{}\" corpus with smoothing={}.'.format(token, corpus,smoothing)\n", | |
" if log_scale:\n", | |
" title += ' Log Scale.'\n", | |
" plt.title(title)\n", | |
"\n", | |
" handles, labels = ax.get_legend_handles_labels()\n", | |
" ax.legend(handles, labels)\n", | |
"\n", | |
" legend_title = ax.get_legend().get_title()\n", | |
" legend_title.set_fontsize(15)\n", | |
"\n", | |
" plt.show()\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 165, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def print_absolute_counts(token, corpus='english', smoothing=0, start_year=1800, end_year=2000):\n", | |
" '''\n", | |
" Prints out the absolute counts (instead of plotting them)\n", | |
" '''\n", | |
" absolute_counts = retrieve_absolute_counts(token, corpus, smoothing, start_year, end_year)\n", | |
" print ('Absolute Counts for: {}'.format(token))\n", | |
" for i in range(len(absolute_counts)):\n", | |
" print ('{}: {}'.format(start_year + i, int(absolute_counts[i])))\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 166, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def load_total_counts(corpus_id, start_year, end_year):\n", | |
" '''\n", | |
" This function loads the total counts for a given corpus from Google's source data.\n", | |
" '''\n", | |
"\n", | |
" # map from id to url\n", | |
" id_to_url= {\n", | |
" 15: 'http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-totalcounts-20120701.txt',\n", | |
" 17: 'http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-us-all-totalcounts-20120701.txt',\n", | |
" 18: 'http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-gb-all-totalcounts-20120701.txt',\n", | |
" 16: 'http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-fiction-all-totalcounts-20120701.txt', \n", | |
" 25: 'http://storage.googleapis.com/books/ngrams/books/googlebooks-rus-all-totalcounts-20120701.txt',\n", | |
" }\n", | |
" \n", | |
" response = requests.get(id_to_url[corpus_id]).text\n", | |
" total_counts = []\n", | |
" data = response.split(\"\\t\")\n", | |
" # first and last rows are empty, so remove that elements\n", | |
" data = data[1:len(data)-1]\n", | |
" for row in data:\n", | |
" # try...except to make sure we got no error\n", | |
" try:\n", | |
" year, word_count, _, _ = row.split(',')\n", | |
" if int(year) >= start_year and int(year) <= end_year:\n", | |
" total_counts.append(int(word_count))\n", | |
"\n", | |
" except ValueError:\n", | |
" pass\n", | |
"\n", | |
" return total_counts" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 174, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def retrieve_absolute_counts(token, corpus, smoothing, start_year, end_year):\n", | |
" # dictionary maps from corpus name to corpus id\n", | |
" corpora = {\n", | |
" 'english' : 15,\n", | |
" 'american english': 17,\n", | |
" 'british english': 18,\n", | |
" 'english fiction': 16, \n", | |
" 'russian': 25, \n", | |
" }\n", | |
" corpus_id = corpora[corpus]\n", | |
" # load the frequency data\n", | |
" token = token.replace(' ', '+')\n", | |
" url = 'https://books.google.com/ngrams/interactive_chart?content={}&year_start={}&year_end={}' \\\n", | |
" '&corpus={}&smoothing={}'.format(token, start_year, end_year, corpus_id, smoothing)\n", | |
" # Load the data from the page\n", | |
" page = requests.get(url).text\n", | |
" start = page.find('var data = ')\n", | |
" end = page.find('];\\n', start)\n", | |
" \n", | |
" data = eval(page[start+12:end])\n", | |
" frequencies = data['timeseries']\n", | |
" \n", | |
" # load total number\n", | |
" total_counts = load_total_counts(corpus_id, start_year, end_year)\n", | |
" \n", | |
" # calculate the absolute number of appearances \n", | |
" # by multiplying the frequencies with the total number of tokens\n", | |
" absolute_counts = [round(frequencies[i] * total_counts[i]) for i in range(len(frequencies))]\n", | |
" return absolute_counts" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 175, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 1080x576 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plot_absolute_counts('Star Wars', 'english', smoothing=0, start_year=1800, end_year=2008)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Let see in each years, how many times the words \"Star Wars\" had been appearanced**" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 176, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Absolute Counts for: Star Wars\n", | |
"1800: 0\n", | |
"1801: 0\n", | |
"1802: 0\n", | |
"1803: 0\n", | |
"1804: 0\n", | |
"1805: 0\n", | |
"1806: 0\n", | |
"1807: 0\n", | |
"1808: 0\n", | |
"1809: 0\n", | |
"1810: 0\n", | |
"1811: 0\n", | |
"1812: 0\n", | |
"1813: 0\n", | |
"1814: 0\n", | |
"1815: 0\n", | |
"1816: 0\n", | |
"1817: 0\n", | |
"1818: 0\n", | |
"1819: 0\n", | |
"1820: 0\n", | |
"1821: 0\n", | |
"1822: 0\n", | |
"1823: 0\n", | |
"1824: 0\n", | |
"1825: 0\n", | |
"1826: 0\n", | |
"1827: 0\n", | |
"1828: 0\n", | |
"1829: 0\n", | |
"1830: 0\n", | |
"1831: 0\n", | |
"1832: 0\n", | |
"1833: 0\n", | |
"1834: 0\n", | |
"1835: 0\n", | |
"1836: 0\n", | |
"1837: 0\n", | |
"1838: 0\n", | |
"1839: 0\n", | |
"1840: 0\n", | |
"1841: 0\n", | |
"1842: 0\n", | |
"1843: 0\n", | |
"1844: 0\n", | |
"1845: 0\n", | |
"1846: 0\n", | |
"1847: 0\n", | |
"1848: 0\n", | |
"1849: 0\n", | |
"1850: 0\n", | |
"1851: 0\n", | |
"1852: 0\n", | |
"1853: 1\n", | |
"1854: 0\n", | |
"1855: 0\n", | |
"1856: 0\n", | |
"1857: 0\n", | |
"1858: 0\n", | |
"1859: 0\n", | |
"1860: 0\n", | |
"1861: 0\n", | |
"1862: 0\n", | |
"1863: 0\n", | |
"1864: 0\n", | |
"1865: 0\n", | |
"1866: 0\n", | |
"1867: 0\n", | |
"1868: 0\n", | |
"1869: 0\n", | |
"1870: 0\n", | |
"1871: 2\n", | |
"1872: 0\n", | |
"1873: 0\n", | |
"1874: 0\n", | |
"1875: 0\n", | |
"1876: 0\n", | |
"1877: 0\n", | |
"1878: 0\n", | |
"1879: 0\n", | |
"1880: 1\n", | |
"1881: 0\n", | |
"1882: 0\n", | |
"1883: 0\n", | |
"1884: 0\n", | |
"1885: 0\n", | |
"1886: 0\n", | |
"1887: 0\n", | |
"1888: 0\n", | |
"1889: 0\n", | |
"1890: 0\n", | |
"1891: 0\n", | |
"1892: 0\n", | |
"1893: 0\n", | |
"1894: 0\n", | |
"1895: 0\n", | |
"1896: 0\n", | |
"1897: 0\n", | |
"1898: 2\n", | |
"1899: 0\n", | |
"1900: 0\n", | |
"1901: 0\n", | |
"1902: 0\n", | |
"1903: 0\n", | |
"1904: 0\n", | |
"1905: 0\n", | |
"1906: 0\n", | |
"1907: 0\n", | |
"1908: 0\n", | |
"1909: 0\n", | |
"1910: 0\n", | |
"1911: 0\n", | |
"1912: 0\n", | |
"1913: 0\n", | |
"1914: 0\n", | |
"1915: 0\n", | |
"1916: 0\n", | |
"1917: 0\n", | |
"1918: 0\n", | |
"1919: 0\n", | |
"1920: 0\n", | |
"1921: 0\n", | |
"1922: 0\n", | |
"1923: 1\n", | |
"1924: 0\n", | |
"1925: 8\n", | |
"1926: 0\n", | |
"1927: 0\n", | |
"1928: 0\n", | |
"1929: 1\n", | |
"1930: 2\n", | |
"1931: 1\n", | |
"1932: 0\n", | |
"1933: 0\n", | |
"1934: 0\n", | |
"1935: 2\n", | |
"1936: 0\n", | |
"1937: 0\n", | |
"1938: 0\n", | |
"1939: 0\n", | |
"1940: 0\n", | |
"1941: 0\n", | |
"1942: 0\n", | |
"1943: 0\n", | |
"1944: 0\n", | |
"1945: 0\n", | |
"1946: 3\n", | |
"1947: 0\n", | |
"1948: 0\n", | |
"1949: 0\n", | |
"1950: 1\n", | |
"1951: 0\n", | |
"1952: 0\n", | |
"1953: 0\n", | |
"1954: 2\n", | |
"1955: 0\n", | |
"1956: 0\n", | |
"1957: 0\n", | |
"1958: 0\n", | |
"1959: 0\n", | |
"1960: 1\n", | |
"1961: 1\n", | |
"1962: 2\n", | |
"1963: 2\n", | |
"1964: 1\n", | |
"1965: 0\n", | |
"1966: 6\n", | |
"1967: 2\n", | |
"1968: 1\n", | |
"1969: 0\n", | |
"1970: 13\n", | |
"1971: 7\n", | |
"1972: 10\n", | |
"1973: 6\n", | |
"1974: 14\n", | |
"1975: 28\n", | |
"1976: 37\n", | |
"1977: 360\n", | |
"1978: 717\n", | |
"1979: 1308\n", | |
"1980: 1707\n", | |
"1981: 1706\n", | |
"1982: 2078\n", | |
"1983: 4300\n", | |
"1984: 3432\n", | |
"1985: 5937\n", | |
"1986: 9456\n", | |
"1987: 10851\n", | |
"1988: 9894\n", | |
"1989: 7881\n", | |
"1990: 7483\n", | |
"1991: 5880\n", | |
"1992: 6698\n", | |
"1993: 4961\n", | |
"1994: 4935\n", | |
"1995: 5756\n", | |
"1996: 6233\n", | |
"1997: 8019\n", | |
"1998: 8456\n", | |
"1999: 14481\n", | |
"2000: 13662\n", | |
"2001: 11094\n", | |
"2002: 14459\n", | |
"2003: 14025\n", | |
"2004: 15768\n", | |
"2005: 14908\n", | |
"2006: 17430\n", | |
"2007: 15723\n", | |
"2008: 15201\n" | |
] | |
} | |
], | |
"source": [ | |
"print_absolute_counts('Star Wars', 'english', smoothing=0, start_year=1800, end_year=2008)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"** With tested in the both ways, I can see we got the exactly result**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Some test with Russian language" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 172, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 1080x576 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plot_absolute_counts('Война и Мир', 'russian', smoothing=0, start_year=1800, end_year=2008)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 173, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Absolute Counts for: Война и Мир\n", | |
"1800: 0\n", | |
"1801: 0\n", | |
"1802: 0\n", | |
"1803: 0\n", | |
"1804: 0\n", | |
"1805: 0\n", | |
"1806: 0\n", | |
"1807: 0\n", | |
"1808: 0\n", | |
"1809: 0\n", | |
"1810: 0\n", | |
"1811: 0\n", | |
"1812: 0\n", | |
"1813: 0\n", | |
"1814: 0\n", | |
"1815: 0\n", | |
"1816: 0\n", | |
"1817: 0\n", | |
"1818: 0\n", | |
"1819: 0\n", | |
"1820: 0\n", | |
"1821: 0\n", | |
"1822: 0\n", | |
"1823: 0\n", | |
"1824: 0\n", | |
"1825: 0\n", | |
"1826: 0\n", | |
"1827: 0\n", | |
"1828: 0\n", | |
"1829: 0\n", | |
"1830: 0\n", | |
"1831: 0\n", | |
"1832: 0\n", | |
"1833: 0\n", | |
"1834: 0\n", | |
"1835: 0\n", | |
"1836: 0\n", | |
"1837: 0\n", | |
"1838: 0\n", | |
"1839: 0\n", | |
"1840: 0\n", | |
"1841: 0\n", | |
"1842: 0\n", | |
"1843: 0\n", | |
"1844: 0\n", | |
"1845: 0\n", | |
"1846: 0\n", | |
"1847: 0\n", | |
"1848: 0\n", | |
"1849: 0\n", | |
"1850: 0\n", | |
"1851: 0\n", | |
"1852: 0\n", | |
"1853: 0\n", | |
"1854: 0\n", | |
"1855: 0\n", | |
"1856: 0\n", | |
"1857: 0\n", | |
"1858: 0\n", | |
"1859: 0\n", | |
"1860: 0\n", | |
"1861: 0\n", | |
"1862: 0\n", | |
"1863: 0\n", | |
"1864: 0\n", | |
"1865: 0\n", | |
"1866: 0\n", | |
"1867: 0\n", | |
"1868: 0\n", | |
"1869: 0\n", | |
"1870: 0\n", | |
"1871: 0\n", | |
"1872: 0\n", | |
"1873: 0\n", | |
"1874: 0\n", | |
"1875: 0\n", | |
"1876: 0\n", | |
"1877: 0\n", | |
"1878: 0\n", | |
"1879: 0\n", | |
"1880: 0\n", | |
"1881: 0\n", | |
"1882: 0\n", | |
"1883: 0\n", | |
"1884: 0\n", | |
"1885: 0\n", | |
"1886: 0\n", | |
"1887: 0\n", | |
"1888: 0\n", | |
"1889: 0\n", | |
"1890: 0\n", | |
"1891: 0\n", | |
"1892: 0\n", | |
"1893: 0\n", | |
"1894: 0\n", | |
"1895: 0\n", | |
"1896: 0\n", | |
"1897: 0\n", | |
"1898: 0\n", | |
"1899: 0\n", | |
"1900: 0\n", | |
"1901: 3\n", | |
"1902: 1\n", | |
"1903: 0\n", | |
"1904: 0\n", | |
"1905: 0\n", | |
"1906: 1\n", | |
"1907: 0\n", | |
"1908: 0\n", | |
"1909: 0\n", | |
"1910: 5\n", | |
"1911: 0\n", | |
"1912: 0\n", | |
"1913: 0\n", | |
"1914: 0\n", | |
"1915: 0\n", | |
"1916: 0\n", | |
"1917: 0\n", | |
"1918: 0\n", | |
"1919: 12\n", | |
"1920: 1\n", | |
"1921: 5\n", | |
"1922: 28\n", | |
"1923: 93\n", | |
"1924: 87\n", | |
"1925: 96\n", | |
"1926: 52\n", | |
"1927: 72\n", | |
"1928: 110\n", | |
"1929: 45\n", | |
"1930: 18\n", | |
"1931: 14\n", | |
"1932: 13\n", | |
"1933: 9\n", | |
"1934: 5\n", | |
"1935: 12\n", | |
"1936: 13\n", | |
"1937: 21\n", | |
"1938: 39\n", | |
"1939: 0\n", | |
"1940: 20\n", | |
"1941: 7\n", | |
"1942: 0\n", | |
"1943: 6\n", | |
"1944: 2\n", | |
"1945: 7\n", | |
"1946: 38\n", | |
"1947: 14\n", | |
"1948: 14\n", | |
"1949: 14\n", | |
"1950: 30\n", | |
"1951: 10\n", | |
"1952: 16\n", | |
"1953: 35\n", | |
"1954: 13\n", | |
"1955: 62\n", | |
"1956: 19\n", | |
"1957: 14\n", | |
"1958: 34\n", | |
"1959: 22\n", | |
"1960: 30\n", | |
"1961: 21\n", | |
"1962: 18\n", | |
"1963: 21\n", | |
"1964: 34\n", | |
"1965: 12\n", | |
"1966: 70\n", | |
"1967: 8\n", | |
"1968: 9\n", | |
"1969: 30\n", | |
"1970: 44\n", | |
"1971: 9\n", | |
"1972: 11\n", | |
"1973: 14\n", | |
"1974: 21\n", | |
"1975: 17\n", | |
"1976: 14\n", | |
"1977: 10\n", | |
"1978: 26\n", | |
"1979: 10\n", | |
"1980: 5\n", | |
"1981: 10\n", | |
"1982: 16\n", | |
"1983: 17\n", | |
"1984: 6\n", | |
"1985: 6\n", | |
"1986: 7\n", | |
"1987: 11\n", | |
"1988: 8\n", | |
"1989: 3\n", | |
"1990: 27\n", | |
"1991: 22\n", | |
"1992: 21\n", | |
"1993: 21\n", | |
"1994: 14\n", | |
"1995: 14\n", | |
"1996: 35\n", | |
"1997: 28\n", | |
"1998: 16\n", | |
"1999: 51\n", | |
"2000: 68\n", | |
"2001: 50\n", | |
"2002: 46\n", | |
"2003: 30\n", | |
"2004: 43\n", | |
"2005: 34\n", | |
"2006: 29\n", | |
"2007: 40\n", | |
"2008: 18\n" | |
] | |
} | |
], | |
"source": [ | |
"print_absolute_counts('Война и Мир', 'russian', smoothing=0, start_year=1800, end_year=2008)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Some Problem With Google NGRAM" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Overabundance of Scientific Literature**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" Google Book’s English language corpus is a mishmash of fiction, nonfiction, reports, proceedings, and, as Dodds’ paper seems to show, a whole lot of scientific literature" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 185, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"\n", | |
" <iframe\n", | |
" width=\"800\"\n", | |
" height=\"500\"\n", | |
" src=\"https://books.google.com/ngrams/interactive_chart?content=Figure%2C+figure&year_start=1880&year_end=2008&corpus=17&smoothing=3&share=&direct_url=t1%3B%2CFigure%3B%2Cc0%3B.t1%3B%2Cfigure%3B%2Cc0\"\n", | |
" frameborder=\"0\"\n", | |
" allowfullscreen\n", | |
" ></iframe>\n", | |
" " | |
], | |
"text/plain": [ | |
"<IPython.lib.display.IFrame at 0x20557e22748>" | |
] | |
}, | |
"execution_count": 185, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"IFrame(\"https://books.google.com/ngrams/interactive_chart?content=Figure%2C+figure&year_start=1880&year_end=2008&corpus=17&smoothing=3&share=&direct_url=t1%3B%2CFigure%3B%2Cc0%3B.t1%3B%2Cfigure%3B%2Cc0\", \n", | |
" width=800, \n", | |
" height=500)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"** Old data, bad OCR**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In older books the medial-s (∫)\n", | |
"is often incorrectly recognized as an ‘f’ by the OCR software" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 187, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"\n", | |
" <iframe\n", | |
" width=\"800\"\n", | |
" height=\"500\"\n", | |
" src=\"https://books.google.com/ngrams/interactive_chart?content=beft&year_start=1880&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cbeft%3B%2Cc0\"\n", | |
" frameborder=\"0\"\n", | |
" allowfullscreen\n", | |
" ></iframe>\n", | |
" " | |
], | |
"text/plain": [ | |
"<IPython.lib.display.IFrame at 0x2055616bdd8>" | |
] | |
}, | |
"execution_count": 187, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"IFrame(\"https://books.google.com/ngrams/interactive_chart?content=beft&year_start=1880&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cbeft%3B%2Cc0\", \n", | |
" width=800, \n", | |
" height=500)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python [default]", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.5" | |
}, | |
"toc": { | |
"base_numbering": 1, | |
"nav_menu": {}, | |
"number_sections": true, | |
"sideBar": true, | |
"skip_h1_title": false, | |
"title_cell": "Table of Contents", | |
"title_sidebar": "Contents", | |
"toc_cell": false, | |
"toc_position": {}, | |
"toc_section_display": true, | |
"toc_window_display": false | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment