fatum · October 18, 2018 09:43
diff --git a/nlp-help b/nlp-help
 Here are the areas I've been researching, some things I've read and some open source packages...

 Nearly all text processing starts by transforming text into vectors:
 http://en.wikipedia.org/wiki/Vector_space_model

 Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms):
 http://en.wikipedia.org/wiki/Tf%E2%80%93idf

 Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. "wishy-washy" in English) - I use this to group words into n-gram tokens because many NLP techniques consider each word as if it's independent of all the others in a document, ignoring order:
 http://matpalm.com/blog/2011/10/22/collocations_1/
 http://matpalm.com/blog/2011/11/05/collocations_2/

 When you've got a lot of text and you don't know what the patterns in it are, you can run an "unsupervised" clustering using Latent Dirichlet allocation:
 http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
 http://www.youtube.com/watch?v=5mkJcxTK1sQ

 Or if you know how your data is divided into topics, otherwise known as "labeled data", then you can run "supervised" techniques such as training a classifier to predict the labels of new similar data. I can't find a really good page on this - I picked up a lot in IM with my friend Ben who is writing a book coming out next year: http://blog.someben.com/2012/07/sequential-learning-book/

 Here are the tools I've mostly been using:

 Vowpal Wabbit (classification and LDA, poor documentation, C++ high performance): https://github.com/JohnLangford/vowpal_wabbit/wiki

 Gensim (LDA, vector similarity, text processing, python): http://radimrehurek.com/gensim/index.html

 Mallet (classification and LDA, java): http://mallet.cs.umass.edu/

 Lingpipe (text analysis, clustering, classification, linguistics, java, commercial open-source): http://alias-i.com/lingpipe/demos/tutorial/read-me.html

 Mahout (Hadoop, classification, clustering, LDA, collaborative filtering, java): http://mahout.apache.org/

 Langdetect (language detection, java): http://code.google.com/p/language-detection/

 Some blogs I like:

 http://matpalm.com/blog/

 http://blog.echen.me/

 http://thedatachef.blogspot.co.uk/

 http://www.machinedlearnings.com

 MetaOptimize Q+A is the Stack Overflow of ML: http://metaoptimize.com/qa

 The Mahout In Action book is quite good and practical: http://manning.com/owen/
	Here are the areas I've been researching, some things I've read and some open source packages...

	Nearly all text processing starts by transforming text into vectors:
	http://en.wikipedia.org/wiki/Vector_space_model

	Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms):
	http://en.wikipedia.org/wiki/Tf%E2%80%93idf

	Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. "wishy-washy" in English) - I use this to group words into n-gram tokens because many NLP techniques consider each word as if it's independent of all the others in a document, ignoring order:
	http://matpalm.com/blog/2011/10/22/collocations_1/
	http://matpalm.com/blog/2011/11/05/collocations_2/

	When you've got a lot of text and you don't know what the patterns in it are, you can run an "unsupervised" clustering using Latent Dirichlet allocation:
	http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
	http://www.youtube.com/watch?v=5mkJcxTK1sQ

	Or if you know how your data is divided into topics, otherwise known as "labeled data", then you can run "supervised" techniques such as training a classifier to predict the labels of new similar data. I can't find a really good page on this - I picked up a lot in IM with my friend Ben who is writing a book coming out next year: http://blog.someben.com/2012/07/sequential-learning-book/

	Here are the tools I've mostly been using:

	Vowpal Wabbit (classification and LDA, poor documentation, C++ high performance): https://github.com/JohnLangford/vowpal_wabbit/wiki

	Gensim (LDA, vector similarity, text processing, python): http://radimrehurek.com/gensim/index.html

	Mallet (classification and LDA, java): http://mallet.cs.umass.edu/

	Lingpipe (text analysis, clustering, classification, linguistics, java, commercial open-source): http://alias-i.com/lingpipe/demos/tutorial/read-me.html

	Mahout (Hadoop, classification, clustering, LDA, collaborative filtering, java): http://mahout.apache.org/

	Langdetect (language detection, java): http://code.google.com/p/language-detection/

	Some blogs I like:

	http://matpalm.com/blog/

	http://blog.echen.me/

	http://thedatachef.blogspot.co.uk/

	http://www.machinedlearnings.com

	MetaOptimize Q+A is the Stack Overflow of ML: http://metaoptimize.com/qa

	The Mahout In Action book is quite good and practical: http://manning.com/owen/