Skip to content

Instantly share code, notes, and snippets.

@jeffThompson
Last active March 14, 2022 08:52
Show Gist options
  • Save jeffThompson/b91aeb49abb53408d53cd58f94d4408e to your computer and use it in GitHub Desktop.
Save jeffThompson/b91aeb49abb53408d53cd58f94d4408e to your computer and use it in GitHub Desktop.
A Python function to test if a noun is countable. Too many requests will get you locked out, so use sparingly. EDIT: probably won't work with changes to the NGram viewer; see comments below
# -*- coding: utf-8 -*-
import re, urllib2, json
def countable_noun(thing):
'''
searches Google NGram to see if a word is a countable/mass noun
returns True if countable, False if not
ex: cats are countable (many cats)
bread is not (much bread)
'''
# format into url (replace spaces with + for url)
thing = re.sub(' ', '\+', thing)
url = 'https://books.google.com/ngrams/graph?content=many+' + thing + '%2C+much+' + thing + '&year_start=1800&year_end=2000'
response = urllib2.urlopen(url)
html = response.read()
# extract timeseries data from html source
# if an error thrown, it's likely there's no match for the term
thing = re.sub('\+', ' ', thing)
try:
many_data = json.loads(re.search('\{"ngram": "many ' + thing + '".*?\}', html, re.IGNORECASE).group(0))['timeseries']
many = sum(many_data) / float(len(many_data))
except:
many = 0.0
try:
much_data = json.loads(re.search('\{"ngram": "much ' + thing + '".*?\}', html, re.IGNORECASE).group(0))['timeseries']
much = sum(much_data) / float(len(much_data))
except:
much = 0.0
# return True if countable; False if not
if many > much:
return True
return False
@jeffThompson
Copy link
Author

@pujaji – are you running Python 3? I think there are probably a few things here that will need to be changed to work

@pujaji
Copy link

pujaji commented Jun 28, 2021

Yes Jeff, I think the Googe Ngram viewers page has changed it is showing only Plots, Maybe I should work on getting numerical data from Plot shown on the html page I am confused how do I work on it. Pl advice if you take time THanks

@jeffThompson
Copy link
Author

Ah you're right! They totally changed how it works that this example will need changes. I did a quick look in the source for the page after doing a search, and luckily the info is in a super easy to grab format:

ngrams.data = [{"ngram": "machinist", "parent": "", "type": "NGRAM", "timeseries": [1.4739899700799697e-07, 1.348211903007268e-07, 1.1842876723733298e-07, 1.1379537058685076e-07, 5.182040295283384e-08, 6.510889757497093e-08, 7.12379727271387e-08, 7.572651838927413e-08, 7.118528902846524e-08, 9.003876806039963e-08, 9.300940513006545e-08, 9.808380596609823e-08, 9.377242855228879e-08, 1.0987565853984701e-07, 1.1320609719486388e-07, 1.1803548005673292e-07, 1.0500249812496674e-07, 1.02940623898056e-07, 9.539117889240905e-08, 9.215533276168181e-08, 9.655714500971563e-08, 1.1061403018938368e-07, 1.4117352518740388e-07, 1.6820891701740948e-07, 1.8103443874143261e-07, 1.9545032107056874e-07, 2.1221589463493729e-07, 2.141315990554306e-07, 2.222391291947393e-07, 2.5915020834028315e-07, 3.02379575600753e-07, 3.4788918412687053e-07, 3.7962420305705435e-07, 4.1261475912181466e-07, 4.229427759128157e-07, 4.3118236752499275e-07, 3.981889652290452e-07, 3.6521762630010404e-07, 3.5520728098552876e-07, 3.5216554903984486e-07, 3.4268149257903655e-07, 3.414977527427254e-07, 3.2718940141356665e-07, 3.43302768815842e-07, 3.600926135050031e-07, 3.5356681036903736e-07, 3.868270762755596e-07, 4.1533521206409727e-07, 4.6060549761607816e-07, 6.041838348989716e-07, 6.357448764252435e-07, 6.80853077028587e-07, 7.355283823276945e-07, 7.604222917377359e-07, 8.44948090161779e-07, 8.713043371244567e-07, 7.843251929833579e-07, 7.755396609354648e-07, 7.632441533392662e-07, 7.721446958644915e-07, 7.700916577440304e-07, 7.248112134610502e-07, 7.699344450884382e-07, 8.420516012717729e-07, 8.907340267277115e-07, 9.125293201707661e-07, 9.223441078834835e-07, 1.007712252690648e-06, 1.0959428859028516e-06, 1.1034169509522535e-06, 1.1550223462108988e-06, 1.2183400193472543e-06, 1.265537938317201e-06, 1.298767821416342e-06, 1.267481317362191e-06, 1.2285755214439372e-06, 1.2507779436938499e-06, 1.2193463589937892e-06, 1.277503559192285e-06, 1.3690164126143957e-06, 1.4778632281117358e-06, 1.4962817626188292e-06, 1.5256110827976954e-06, 1.4975806997037061e-06, 1.570056351088819e-06, 1.4656154527464032e-06, 1.3693768810948574e-06, 1.2387723212928644e-06, 1.2006022741063913e-06, 1.1718832411133917e-06, 1.2111266057347945e-06, 1.100091269888383e-06, 1.1177015234612294e-06, 1.1607799836253565e-06, 1.2216061285081587e-06, 1.2590061585017663e-06, 1.3199066220944848e-06, 1.3489552332300394e-06, 1.5057829289487148e-06, 1.5837644338846855e-06, 1.6554606645513559e-06, 1.7229338839440191e-06, 1.8231381123509242e-06, 1.893546417315e-06, 1.96132389516736e-06, 1.9339303370153563e-06, 1.970586660198543e-06, 1.9648225263933583e-06, 1.974395721455429e-06, 1.979509712717637e-06, 2.001972153006396e-06, 1.9791127020393366e-06, 1.9880930527246425e-06, 2.033439516578385e-06, 2.0917057911771863e-06, 2.199455707341258e-06, 2.277113204789722e-06, 2.287562844165742e-06, 2.43619417363204e-06, 2.42775149672525e-06, 2.3384793946336558e-06, 2.2519496139596285e-06, 2.0859102960457676e-06, 1.9284370377395784e-06, 1.7891081012878982e-06, 1.5648698438651211e-06, 1.488634520033624e-06, 1.4408942336428611e-06, 1.3745447436771689e-06, 1.287856507912823e-06, 1.208424773072433e-06, 1.1526888589027555e-06, 1.0905553722295735e-06, 1.0887735584219627e-06, 1.1215165532770958e-06, 1.1348597987697367e-06, 1.243936708306137e-06, 1.3858857934272756e-06, 1.584449112864864e-06, 1.8271579652459228e-06, 2.040407025560853e-06, 2.1458366420899566e-06, 2.243341214125394e-06, 2.283818194622914e-06, 2.2802489476297135e-06, 2.174424123170736e-06, 2.0150623478392456e-06, 1.8263988944714323e-06, 1.709266663445825e-06, 1.659539975662483e-06, 1.588295390320127e-06, 1.5279037565310967e-06, 1.4833603115351122e-06, 1.4534114630545706e-06, 1.3881045138727392e-06, 1.3355913454558634e-06, 1.2504312962846598e-06, 1.1849686123761266e-06, 1.1212977629579006e-06, 1.065529683533636e-06, 1.00195150512913e-06, 9.7355304465704e-07, 9.238422973924441e-07, 8.896907875818474e-07, 8.807920478050489e-07, 8.444980851761232e-07, 8.062077085924102e-07, 7.869779778957309e-07, 7.489803611601278e-07, 7.3388574557092e-07, 7.093890417309012e-07, 6.534367896295277e-07, 6.256180833068876e-07, 6.255982449537052e-07, 6.019500022245796e-07, 5.959854385244918e-07, 5.723729909082717e-07, 5.610688893804243e-07, 5.808613049599185e-07, 5.829821816405456e-07, 5.510321443580324e-07, 5.332716455086484e-07, 5.060041944813877e-07, 4.917715849777389e-07, 4.6871159448918273e-07, 4.2844018821597274e-07, 4.072306215415925e-07, 4.024473655915374e-07, 3.920373176567539e-07, 3.9740479808512775e-07, 3.899326657119673e-07, 3.7553687093350163e-07, 3.590554586350793e-07, 3.4783089323095086e-07, 3.345802537881225e-07, 3.2801866365973963e-07, 3.1442707739058407e-07, 3.123410254244975e-07, 3.118361990215947e-07, 3.0693519192936947e-07, 3.01494955985098e-07, 3.0081092664399224e-07, 2.965345475201632e-07, 2.882770786527544e-07, 2.791344278258683e-07, 2.767382259857446e-07, 2.76221263512915e-07, 2.739833908955396e-07, 2.632859081229201e-07, 2.5036409161722464e-07, 2.461818736167386e-07, 2.3902213034229393e-07, 2.288930492763289e-07, 2.212990852155729e-07, 2.172019476022667e-07, 2.1404075515160263e-07, 2.1204044945274552e-07, 2.112723578306941e-07, 2.0866827412646672e-07, 2.046360130236735e-07]}];

You should be able to grab the page's source, look for that variable name, and parse the list!

@pujaji
Copy link

pujaji commented Jun 28, 2021

Yes Thanks a lot

Ah you're right! They totally changed how it works that this example will need changes. I did a quick look in the source for the page after doing a search, and luckily the info is in a super easy to grab format:

ngrams.data = [{"ngram": "machinist", "parent": "", "type": "NGRAM", "timeseries": [1.4739899700799697e-07, 1.348211903007268e-07, 1.1842876723733298e-07, 1.1379537058685076e-07, 5.182040295283384e-08, 6.510889757497093e-08, 7.12379727271387e-08, 7.572651838927413e-08, 7.118528902846524e-08, 9.003876806039963e-08, 9.300940513006545e-08, 9.808380596609823e-08, 9.377242855228879e-08, 1.0987565853984701e-07, 1.1320609719486388e-07, 1.1803548005673292e-07, 1.0500249812496674e-07, 1.02940623898056e-07, 9.539117889240905e-08, 9.215533276168181e-08, 9.655714500971563e-08, 1.1061403018938368e-07, 1.4117352518740388e-07, 1.6820891701740948e-07, 1.8103443874143261e-07, 1.9545032107056874e-07, 2.1221589463493729e-07, 2.141315990554306e-07, 2.222391291947393e-07, 2.5915020834028315e-07, 3.02379575600753e-07, 3.4788918412687053e-07, 3.7962420305705435e-07, 4.1261475912181466e-07, 4.229427759128157e-07, 4.3118236752499275e-07, 3.981889652290452e-07, 3.6521762630010404e-07, 3.5520728098552876e-07, 3.5216554903984486e-07, 3.4268149257903655e-07, 3.414977527427254e-07, 3.2718940141356665e-07, 3.43302768815842e-07, 3.600926135050031e-07, 3.5356681036903736e-07, 3.868270762755596e-07, 4.1533521206409727e-07, 4.6060549761607816e-07, 6.041838348989716e-07, 6.357448764252435e-07, 6.80853077028587e-07, 7.355283823276945e-07, 7.604222917377359e-07, 8.44948090161779e-07, 8.713043371244567e-07, 7.843251929833579e-07, 7.755396609354648e-07, 7.632441533392662e-07, 7.721446958644915e-07, 7.700916577440304e-07, 7.248112134610502e-07, 7.699344450884382e-07, 8.420516012717729e-07, 8.907340267277115e-07, 9.125293201707661e-07, 9.223441078834835e-07, 1.007712252690648e-06, 1.0959428859028516e-06, 1.1034169509522535e-06, 1.1550223462108988e-06, 1.2183400193472543e-06, 1.265537938317201e-06, 1.298767821416342e-06, 1.267481317362191e-06, 1.2285755214439372e-06, 1.2507779436938499e-06, 1.2193463589937892e-06, 1.277503559192285e-06, 1.3690164126143957e-06, 1.4778632281117358e-06, 1.4962817626188292e-06, 1.5256110827976954e-06, 1.4975806997037061e-06, 1.570056351088819e-06, 1.4656154527464032e-06, 1.3693768810948574e-06, 1.2387723212928644e-06, 1.2006022741063913e-06, 1.1718832411133917e-06, 1.2111266057347945e-06, 1.100091269888383e-06, 1.1177015234612294e-06, 1.1607799836253565e-06, 1.2216061285081587e-06, 1.2590061585017663e-06, 1.3199066220944848e-06, 1.3489552332300394e-06, 1.5057829289487148e-06, 1.5837644338846855e-06, 1.6554606645513559e-06, 1.7229338839440191e-06, 1.8231381123509242e-06, 1.893546417315e-06, 1.96132389516736e-06, 1.9339303370153563e-06, 1.970586660198543e-06, 1.9648225263933583e-06, 1.974395721455429e-06, 1.979509712717637e-06, 2.001972153006396e-06, 1.9791127020393366e-06, 1.9880930527246425e-06, 2.033439516578385e-06, 2.0917057911771863e-06, 2.199455707341258e-06, 2.277113204789722e-06, 2.287562844165742e-06, 2.43619417363204e-06, 2.42775149672525e-06, 2.3384793946336558e-06, 2.2519496139596285e-06, 2.0859102960457676e-06, 1.9284370377395784e-06, 1.7891081012878982e-06, 1.5648698438651211e-06, 1.488634520033624e-06, 1.4408942336428611e-06, 1.3745447436771689e-06, 1.287856507912823e-06, 1.208424773072433e-06, 1.1526888589027555e-06, 1.0905553722295735e-06, 1.0887735584219627e-06, 1.1215165532770958e-06, 1.1348597987697367e-06, 1.243936708306137e-06, 1.3858857934272756e-06, 1.584449112864864e-06, 1.8271579652459228e-06, 2.040407025560853e-06, 2.1458366420899566e-06, 2.243341214125394e-06, 2.283818194622914e-06, 2.2802489476297135e-06, 2.174424123170736e-06, 2.0150623478392456e-06, 1.8263988944714323e-06, 1.709266663445825e-06, 1.659539975662483e-06, 1.588295390320127e-06, 1.5279037565310967e-06, 1.4833603115351122e-06, 1.4534114630545706e-06, 1.3881045138727392e-06, 1.3355913454558634e-06, 1.2504312962846598e-06, 1.1849686123761266e-06, 1.1212977629579006e-06, 1.065529683533636e-06, 1.00195150512913e-06, 9.7355304465704e-07, 9.238422973924441e-07, 8.896907875818474e-07, 8.807920478050489e-07, 8.444980851761232e-07, 8.062077085924102e-07, 7.869779778957309e-07, 7.489803611601278e-07, 7.3388574557092e-07, 7.093890417309012e-07, 6.534367896295277e-07, 6.256180833068876e-07, 6.255982449537052e-07, 6.019500022245796e-07, 5.959854385244918e-07, 5.723729909082717e-07, 5.610688893804243e-07, 5.808613049599185e-07, 5.829821816405456e-07, 5.510321443580324e-07, 5.332716455086484e-07, 5.060041944813877e-07, 4.917715849777389e-07, 4.6871159448918273e-07, 4.2844018821597274e-07, 4.072306215415925e-07, 4.024473655915374e-07, 3.920373176567539e-07, 3.9740479808512775e-07, 3.899326657119673e-07, 3.7553687093350163e-07, 3.590554586350793e-07, 3.4783089323095086e-07, 3.345802537881225e-07, 3.2801866365973963e-07, 3.1442707739058407e-07, 3.123410254244975e-07, 3.118361990215947e-07, 3.0693519192936947e-07, 3.01494955985098e-07, 3.0081092664399224e-07, 2.965345475201632e-07, 2.882770786527544e-07, 2.791344278258683e-07, 2.767382259857446e-07, 2.76221263512915e-07, 2.739833908955396e-07, 2.632859081229201e-07, 2.5036409161722464e-07, 2.461818736167386e-07, 2.3902213034229393e-07, 2.288930492763289e-07, 2.212990852155729e-07, 2.172019476022667e-07, 2.1404075515160263e-07, 2.1204044945274552e-07, 2.112723578306941e-07, 2.0866827412646672e-07, 2.046360130236735e-07]}];

You should be able to grab the page's source, look for that variable name, and parse the list!

@jeffThompson Yes Thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment