-
-
Save honnibal/30499850449a46c167a8 to your computer and use it in GitHub Desktop.
""" | |
Example use of the spaCy NLP tools for data exploration. | |
Here we will look for reddit comments that describe Google doing something, | |
i.e. discuss the company's actions. This is difficult, because other senses of | |
"Google" now dominate usage of the word in conversation, particularly references to | |
using Google products. | |
The heuristics here are quick and dirty --- about 5 minutes work. A better approach | |
is to use the word vector of the verb. But, the demo here is just to show what's | |
possible to build up quickly, to start to understand some data. | |
""" | |
from __future__ import unicode_literals | |
from __future__ import print_function | |
import sys | |
import plac | |
import bz2 | |
import ujson | |
import spacy.en | |
def main(input_loc): | |
nlp = spacy.en.English() # Load the model takes 10-20 seconds. | |
for line in bz2.BZ2File(input_loc): # Iterate over the reddit comments from the dump. | |
comment_str = ujson.loads(line)['body'] # Parse the json object, and extract the 'body' attribute. | |
comment_parse = nlp(comment_str) # Apply the spaCy NLP pipeline. | |
for word in comment_parse: # Look for the cases we want | |
if google_doing_something(word): | |
# Print the clause | |
print(''.join(w.string for w in word.head.subtree).strip()) | |
def google_doing_something(w): | |
if w.lower_ != 'google': | |
return False | |
elif w.dep_ != 'nsubj': # Is it the subject of a verb? | |
return False | |
elif w.head.lemma_ == 'be' and w.head.dep_ != 'aux': # And not 'is' | |
return False | |
elif w.head.lemma_ in ('say', 'show'): # Exclude e.g. "Google says..." | |
return False | |
else: | |
return True | |
if __name__ == '__main__': | |
plac.call(main) |
Well, the idea is actually that words-before-and-after is actually just a proxy measure for the syntactic structure, which is really a "tree". But we don't have to use the string order — spaCy gives you that tree :).
Like, compare these sentences (trees provided by CMU's parser, since I don't have spaCy linked up to a visualiser yet):
a) "a quick Google would show you're wrong"
http://demo.ark.cs.cmu.edu/parse?sentence=A%20quick%20Google%20would%20show%20you%27re%20wrong.
b) "Google shows you're wrong"
http://demo.ark.cs.cmu.edu/parse?sentence=A%20quick%20Google%20shows%20you%27re%20wrong.
You see the arc labelled "nsubj" from "show" to Google? That's the sort of relationship we're checking out in the google_doing_something function. The "dep" property refers to the label of the arc (e.g. nsubj), and the "lemma" property ensures we get the uninflected form ("show", not "shows").
The idea is to give representations that abstract away a lot of the incidental variation, so that you can write more precise rules for what you're looking for. The CMU parser page has an example of a representation that's more abstract still, the semantic parse. But then the accuracy starts to go down, and we get too many parse errors. The syntactic parse is a sort of compromise, where we can extract this "view" of the sentence reasonably reliably (about 92% of the arcs are correct), but abstract enough to be helpful.
Awesome. Absolutely awesome work man.
The logic of catching some of those tricky adverbs/verbs (I.e. 'A quick Google') would be hard to generalize... Maybe this is too strict, but I assume it's possible to check the word falling directly before/after Google and negate results that contain any verb/adverb on a given blacklist?
Also, 1000 points just for trawling this:
So, Google bought the flying cars patent and Apple acquired self lacing shoes.
haha