jrjames83/sample_text_wrangling.ipynb

Created March 20, 2018 12:43

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jrjames83/6b959a335232d830f9ce51c80dc8a4ae.js"></script>
Save jrjames83/6b959a335232d830f9ce51c80dc8a4ae to your computer and use it in GitHub Desktop.

Download ZIP

Raw

sample_text_wrangling.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

Author

jrjames83 commented Mar 23, 2018

My bad, just saw your comment. Regarding reducing terms to their basic form (de-pluralization, etc...) there are 2 thoughts. "Stemming", which applies a crude heuristic to terms to reduce them, then lemmatization, is more complex and may change the word entirely depending on the word.

To just use a plain old stemmer without NLTK, you can just google "porter stemmer python" and find a singular implementation. NLTK is easy enough however, unless you're trying to keep an env clean from bloated packages or something.

Stop words definitely normal [x for x in my_list if x not in ['a', 'the', 'and', 'in', 'at', etc...] ]

Yeah the starts with or ends with approximations are usually good enough for grouping. To your point about grouping and then counting the metric, yes, agreed.

Author

jrjames83 commented Mar 23, 2018

you could also group by a lexical sort: if you had a bunch of phrases, for each phrase, split to list, sort alpha wise, add the key to a defaultdict(list) then append the unaffected term which was used for the split and sort operation, to get a lexical ordered grouping.

Author

jrjames83 commented Mar 23, 2018

@jsma ^^

jrjames83/sample_text_wrangling.ipynb

jrjames83 commented Mar 23, 2018

Uh oh!

jrjames83 commented Mar 23, 2018

Uh oh!

jrjames83 commented Mar 23, 2018

Uh oh!