Skip to content

Instantly share code, notes, and snippets.

@jrjames83
Created March 20, 2018 12:43
Show Gist options
  • Save jrjames83/6b959a335232d830f9ce51c80dc8a4ae to your computer and use it in GitHub Desktop.
Save jrjames83/6b959a335232d830f9ce51c80dc8a4ae to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@jsma
Copy link

jsma commented Mar 21, 2018

Thanks! I had no idea gists would render a notebook like this.

For the unigrams, it looks like you're just doing a raw count of how many rows had "youtube" in the search phrase. In my case I took the pageviews associated with the origin search phrase and then sum these together, which gives me a better sense of how many times a word was used in search. Using your data set, I'd sum the pageviews for every row that contains 'youtube' to come up with the number.

I'll give your startswith stuff a try up through "Out[24]:". I'll have to dig into the clustering stuff some other day.

I'm using a custom stop words list to just filter out in python if word in stop_list: continue, etc. but any thoughts on merging singular vs plural forms without bringing in nltk? In my data set I see "fast facts" as the top search phrase (well, of phrases that exactly match, only represents a tiny fraction of actual search volume, the data is all long tail) and "fast fact" is #20 in the list. I may just do if word.endswith('s'): # remove the trailing 's' and see if this results in a word that appears elsewhere or some other hacky approximation.

Thanks again!

@jrjames83
Copy link
Author

My bad, just saw your comment. Regarding reducing terms to their basic form (de-pluralization, etc...) there are 2 thoughts. "Stemming", which applies a crude heuristic to terms to reduce them, then lemmatization, is more complex and may change the word entirely depending on the word.

To just use a plain old stemmer without NLTK, you can just google "porter stemmer python" and find a singular implementation. NLTK is easy enough however, unless you're trying to keep an env clean from bloated packages or something.

Stop words definitely normal [x for x in my_list if x not in ['a', 'the', 'and', 'in', 'at', etc...] ]

Yeah the starts with or ends with approximations are usually good enough for grouping. To your point about grouping and then counting the metric, yes, agreed.

@jrjames83
Copy link
Author

you could also group by a lexical sort: if you had a bunch of phrases, for each phrase, split to list, sort alpha wise, add the key to a defaultdict(list) then append the unaffected term which was used for the split and sort operation, to get a lexical ordered grouping.

@jrjames83
Copy link
Author

@jsma ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment