Created
March 20, 2018 12:43
-
-
Save jrjames83/6b959a335232d830f9ce51c80dc8a4ae to your computer and use it in GitHub Desktop.
Author
Author
you could also group by a lexical sort: if you had a bunch of phrases, for each phrase, split to list, sort alpha wise, add the key to a defaultdict(list) then append the unaffected term which was used for the split and sort operation, to get a lexical ordered grouping.
Author
@jsma ^^
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
My bad, just saw your comment. Regarding reducing terms to their basic form (de-pluralization, etc...) there are 2 thoughts. "Stemming", which applies a crude heuristic to terms to reduce them, then lemmatization, is more complex and may change the word entirely depending on the word.
To just use a plain old stemmer without NLTK, you can just google "porter stemmer python" and find a singular implementation. NLTK is easy enough however, unless you're trying to keep an env clean from bloated packages or something.
Stop words definitely normal [x for x in my_list if x not in ['a', 'the', 'and', 'in', 'at', etc...] ]
Yeah the starts with or ends with approximations are usually good enough for grouping. To your point about grouping and then counting the metric, yes, agreed.