Elias Ponvert eponvert

Sublime Text 2 – Useful Shortcuts (Mac OS X)

General

⌘T	go to file
⌘⌃P	go to project
⌘R	go to methods
⌃G	go to line
⌘KB	toggle side bar
⌘⇧P	command prompt

The INSTALL instructions that come with Vowpal Wabbit appear not to work on Mac OS X Lion. Here's what I did to get it to compile. You will need the developer tools that come with the XCode installation.

The only dependency VW has is the boost C++ library. So first, download and install Boost

To install Boost, do the following:

$ cp ~/Downloads/boost_1_48_0.tar.bz2 ./

This guide will get you started using Spark on Heroku/Cedar. Spark is basically a clone of Sinatra for Java. 'Nuff said.

Create your app

Create a single Java main class in src/main/java/HelloWorld.java:

import static spark.Spark.*;
import spark.*;

Here are the areas I've been researching, some things I've read and some open source packages...

Nearly all text processing starts by transforming text into vectors: http://en.wikipedia.org/wiki/Vector_space_model

Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms): http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. "wishy-washy" in English) - I use this to group words into n-gram tokens because many NLP techniques consider each word as if it's independent of all the others in a document, ignoring order: http://matpalm.com/blog/2011/10/22/collocations_1/

	class A
	class A2 extends A
	class B

	trait M[X]

	//
	// Upper Type Bound
	//
	def upperTypeBound[AA <: A](x: AA): A = x

	/*
	* Copyright (c) 2012, Lawrence Livermore National Security, LLC. Produced at
	* the Lawrence Livermore National Laboratory. Written by Keith Stevens,
	* [email protected] OCEC-10-073 All rights reserved.
	*
	* This file is part of the S-Space package and is covered under the terms and
	* conditions therein.
	*
	* The S-Space package is free software: you can redistribute it and/or modify
	* it under the terms of the GNU General Public License version 2 as published

	// Just an ordinary function
	def sum(x: Int, y: Int, z: Int) = x + y + z

	// A tuple of arguments
	val args = (1, 2, 3)

	// Convert the function to a (partial) Function, which has a tupled method
	// that takes tuples up to arity 5
	(sum _).tupled(args)

	import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
	import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
	import org.apache.spark.mllib.linalg.Vector

	import sqlContext.implicits._

	val numTopics: Int = 100
	val maxIterations: Int = 100
	val vocabSize: Int = 10000