-
-
Save fatum/3891997 to your computer and use it in GitHub Desktop.
Some pointers for Natural Language Processing / Machine Learning
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here are the areas I've been researching, some things I've read and some open source packages... | |
Nearly all text processing starts by transforming text into vectors: | |
http://en.wikipedia.org/wiki/Vector_space_model | |
Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms): | |
http://en.wikipedia.org/wiki/Tf%E2%80%93idf | |
Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. "wishy-washy" in English) - I use this to group words into n-gram tokens because many NLP techniques consider each word as if it's independent of all the others in a document, ignoring order: | |
http://matpalm.com/blog/2011/10/22/collocations_1/ | |
http://matpalm.com/blog/2011/11/05/collocations_2/ | |
When you've got a lot of text and you don't know what the patterns in it are, you can run an "unsupervised" clustering using Latent Dirichlet allocation: | |
http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf | |
http://www.youtube.com/watch?v=5mkJcxTK1sQ | |
Or if you know how your data is divided into topics, otherwise known as "labeled data", then you can run "supervised" techniques such as training a classifier to predict the labels of new similar data. I can't find a really good page on this - I picked up a lot in IM with my friend Ben who is writing a book coming out next year: http://blog.someben.com/2012/07/sequential-learning-book/ | |
Here are the tools I've mostly been using: | |
Vowpal Wabbit (classification and LDA, poor documentation, C++ high performance): https://github.com/JohnLangford/vowpal_wabbit/wiki | |
Gensim (LDA, vector similarity, text processing, python): http://radimrehurek.com/gensim/index.html | |
Mallet (classification and LDA, java): http://mallet.cs.umass.edu/ | |
Lingpipe (text analysis, clustering, classification, linguistics, java, commercial open-source): http://alias-i.com/lingpipe/demos/tutorial/read-me.html | |
Mahout (Hadoop, classification, clustering, LDA, collaborative filtering, java): http://mahout.apache.org/ | |
Langdetect (language detection, java): http://code.google.com/p/language-detection/ | |
Some blogs I like: | |
http://matpalm.com/blog/ | |
http://blog.echen.me/ | |
http://thedatachef.blogspot.co.uk/ | |
http://www.machinedlearnings.com | |
MetaOptimize Q+A is the Stack Overflow of ML: http://metaoptimize.com/qa | |
The Mahout In Action book is quite good and practical: http://manning.com/owen/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment