sample_text_wrangling.ipynb

jsma commented Mar 21, 2018

Thanks! I had no idea gists would render a notebook like this.

For the unigrams, it looks like you're just doing a raw count of how many rows had "youtube" in the search phrase. In my case I took the pageviews associated with the origin search phrase and then sum these together, which gives me a better sense of how many times a word was used in search. Using your data set, I'd sum the pageviews for every row that contains 'youtube' to come up with the number.

I'll give your startswith stuff a try up through "Out[24]:". I'll have to dig into the clustering stuff some other day.

I'm using a custom stop words list to just filter out in python if word in stop_list: continue, etc. but any thoughts on merging singular vs plural forms without bringing in nltk? In my data set I see "fast facts" as the top search phrase (well, of phrases that exactly match, only represents a tiny fraction of actual search volume, the data is all long tail) and "fast fact" is #20 in the list. I may just do if word.endswith('s'): # remove the trailing 's' and see if this results in a word that appears elsewhere or some other hacky approximation.

Thanks again!

Author

jrjames83 commented Mar 23, 2018

My bad, just saw your comment. Regarding reducing terms to their basic form (de-pluralization, etc...) there are 2 thoughts. "Stemming", which applies a crude heuristic to terms to reduce them, then lemmatization, is more complex and may change the word entirely depending on the word.

To just use a plain old stemmer without NLTK, you can just google "porter stemmer python" and find a singular implementation. NLTK is easy enough however, unless you're trying to keep an env clean from bloated packages or something.

Stop words definitely normal [x for x in my_list if x not in ['a', 'the', 'and', 'in', 'at', etc...] ]

Yeah the starts with or ends with approximations are usually good enough for grouping. To your point about grouping and then counting the metric, yes, agreed.

Author

jrjames83 commented Mar 23, 2018

you could also group by a lexical sort: if you had a bunch of phrases, for each phrase, split to list, sort alpha wise, add the key to a defaultdict(list) then append the unaffected term which was used for the split and sort operation, to get a lexical ordered grouping.

Author

jrjames83 commented Mar 23, 2018

@jsma ^^

jrjames83/sample_text_wrangling.ipynb

jsma commented Mar 21, 2018

Uh oh!

jrjames83 commented Mar 23, 2018

Uh oh!

jrjames83 commented Mar 23, 2018

Uh oh!

jrjames83 commented Mar 23, 2018

Uh oh!