ceteri · July 25, 2012 21:28
diff --git a/HTML b/HTML
 <h1>Cascading Sample Recommender</h1>

 <p>The goal for this project is to create a sample application in <a href="http://www.cascading.org/">Cascading 2.0</a> which shows how to build a simple kind of <a href="http://en.wikipedia.org/wiki/Recommender_system">social recommender</a>.</p>

 <h2>Build</h2>

 <p>First, clone a copy of the source code from our GitHub repo at <a href="https://github.com/Cascading/SampleRecommender">https://github.com/Cascading/SampleRecommender</a></p>

 <pre><code>git clone https://github.com/Cascading/SampleRecommender.git
 </code></pre>

 <p>The code repo includes a <a href="http://gradle.org/">Gradle script</a> for command line builds. To generate an <a href="https://www.jetbrains.com/idea/">IntelliJ project</a> use:</p>

 <pre><code>gradle ideaModule
 </code></pre>

 <p>To build the sample app from the command line use:</p>

 <pre><code>gradle clean jar
 </code></pre>

 <h2>Run</h2>

 <p>To run on the <a href="http://aws.amazon.com/elasticmapreduce/">Elastic MapReduce</a> service, based on the <a href="http://aws.amazon.com/developertools/2264">command line interface</a>:</p>

 <pre><code>$ elastic-mapreduce --create --name &quot;Sample Recommender&quot; \
  --jar s3n://temp.cascading.org/sample/recommender.jar \
  --arg s3n://temp.cascading.org/sample/en.stop \
  --arg s3n://temp.cascading.org/sample/tweets/ \
  --arg s3n://temp.cascading.org/sample/out/token \
  --arg s3n://temp.cascading.org/sample/out/similarity
 </code></pre>

 <p>Then check the <code>token</code> and <code>similarity</code> output data.</p>

 <p>An example of log captured from a successful build+run is at <a href="https://gist.github.com/2949834">https://gist.github.com/2949834</a></p>

 <h2>Overview</h2>

 <p>The social recommender is based on using data from the <a href="https://dev.twitter.com/docs/twitter-data-providers">Twitter firehose</a>. The app recommends other Twitter users to follow who have similar interests in stocks/investing.</p>

 <p>Sample data includes <a href="http://archivist.visitmix.com/pacoid/12">500 tweets</a>, which is enough to demonstrate how this sample app works. A compiled JAR may run on a laptop with this sample data, or it can also scale-out horizontally to run on a cluster with thousands of nodes and handle much larger data.</p>

 <p>Execution steps for this implementation are:</p>

 <ol>
 <li>Take the sample Twitter data as the input <strong>source</strong> tap. The endpoint used for that tap (the command line argument <code>data/tweets</code>) could be replaced by a much larger data set.</li>
 <li>Clean-up and tokenize the text of tweets using a <a href="http://en.wikipedia.org/wiki/Regular_expression">regex pattern</a>. NB: if you use a different data source later, that regex pattern may need to be modified.</li>
 <li>Generate <em>(user, token)</em> pairs to construct a <a href="http://en.wikipedia.org/wiki/Bipartite_graph">bipartite graph</a></li>
 <li>Apply a <a href="http://en.wikipedia.org/wiki/Stop_words">stop word list</a> to filter out common words, offensive phrases, etc.</li>
 <li>Also create a <strong>sink</strong> tap to measure token frequency, which may be used for QA: improve the regex patterns, adjust stop words, etc.</li>
 <li>Find the pairs of users who have &#8220;interesting&#8221; tokens in common, and generate an inverted index as <em>((user1, user2), token)</em></li>
 <li>Calculate a similarity metric for each user pair which are known to be <a href="http://en.wikipedia.org/wiki/Nearest_neighbor_search">neighbors</a></li>
 <li>Apply thresholds on the similarity metric to filter out poor recommendations.</li>
 <li>Connect up all the <strong>pipes</strong> and <strong>taps</strong> into a <strong>cascade</strong>, then generate a flow diagram and run the app. Results for recommended users get stored in the <code>similarityPath</code> <strong>sink</strong> tap.</li>
 </ol>

 <p>The first part of the program illustrates use of a &#8220;stream assertion&#8221;, which is much like running a unit test on your data set. Later there&#8217;s use of a <strong>Debug</strong> which traces the tuple stream on stdout. Both of these testing features can be turned off in production. Both are important features of Cascading which are not found in other frameworks based on Apache Hadoop.</p>

 <p>When you run this app, it generates a Cascading flow diagram in the file <code>dot/similarity.dot</code> which can be read by <a href="http://www.omnigroup.com/products/omnigraffle/">OmniGraffle</a> or Visio. The diagram shows how the Cascading workflows will be run as job steps in Apache Hadoop. An annotated version is provided as <code>docs/similarity.graffle</code> which shows how the physical plan of the Cascading <strong>flows</strong> overlays onto &#8220;map&#8221; and &#8220;reduce&#8221; tasks.</p>

 <h2>Integration</h2>

 <p>Our intent is to show how such an app might integrate multiple data sources, and potentially where to integrate other systems outside of <a href="http://hadoop.apache.org/">Apache Hadoop</a>.</p>

 <p>Based on using the command line arguments below, the recommender results get stored in <code>output/similarity/</code> as Apache Hadoop part files in TSV format. In practice, those would most likely get loaded into <a href="http://redis.io/">Redis</a> or some another low latency key/value store for use in a production system.</p>

 <p>Also check the output results in <code>output/token/</code> for QA on how well the text clean-up is working, which additional stop words need to be filtered, etc.</p>

 <img src="https://github.com/Cascading/SampleRecommender/raw/master/docs/sample_app.png" alt="Sample App Diagram" />


 <p>To take this example a few steps further, one could use additional taps to include more use cases and potential integrations:
 * topic trending &#8211; e.g., via <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">LDA analysis</a>
 * social graph &#8211; e.g., via <a href="http://neo4j.org/">Neo4J queries</a></p>

 <h2>Algorithms</h2>

 <p>One nuance to note about this code: recommender systems often use algorithms which require a <a href="http://en.wikipedia.org/wiki/Regular_expression">cartesian product</a>. Those don&#8217;t scale well when coded naively as MapReduce jobs. This example shows how to use the pattern of an inverted index + join. There is still a cartesian product calculated, but it is based on using a <strong>CoGroup</strong> so that it scales well for this kind of data. Also, the potential number of pairs gets filtered prior to where the cross-product gets performed, which reduces the required processing.</p>

 <p>For example, other recommenders might use a <a href="http://en.wikipedia.org/wiki/Cosine_similarity">Cosine distance</a> which implies use of a cartesian product. Instead this app calculates an <a href="http://www.mothur.org/wiki/Ochiai">Ochiai similarity metric</a>. This shows one case of how to &#8220;unwind&#8221; a cartesian product into a more parallel, more efficient algorithm for use with MapReduce.</p>

 <h2>Analytics</h2>

 <p>An example R script shows analysis of the similarity metrics. This analysis can be used to decide how to tune the metric thresholds <code>MAX_SIMILARITY</code> and <code>MIN_SIMILARITY</code>, and for QA of the recommender in general.</p>

 <p>The R script requires two libraries to be installed via the usual R mechanisms: <code>plyr</code> and <code>ggplot2</code>.</p>

 <p>After running the sample app in Apache Hadoop, run the R script to analyze the distribution similarity metrics in its results:</p>

 <pre><code>R --vanilla --args output/similarity/part-00000 &lt; src/main/r/metric.R
 </code></pre>

 <p>That will produce two charts in a PDF file <code>Rplots.pdf</code>, which show the distribution of similarity metrics across Twitter user IDs in the input data set.</p>

 <img src="https://github.com/Cascading/SampleRecommender/raw/master/docs/tweets.png" alt="R analysis" />
	<h1>Cascading Sample Recommender</h1>

	<p>The goal for this project is to create a sample application in <a href="http://www.cascading.org/">Cascading 2.0</a> which shows how to build a simple kind of <a href="http://en.wikipedia.org/wiki/Recommender_system">social recommender</a>.</p>

	<h2>Build</h2>

	<p>First, clone a copy of the source code from our GitHub repo at <a href="https://github.com/Cascading/SampleRecommender">https://github.com/Cascading/SampleRecommender</a></p>

	<pre><code>git clone https://github.com/Cascading/SampleRecommender.git
	</code></pre>

	<p>The code repo includes a <a href="http://gradle.org/">Gradle script</a> for command line builds. To generate an <a href="https://www.jetbrains.com/idea/">IntelliJ project</a> use:</p>

	<pre><code>gradle ideaModule
	</code></pre>

	<p>To build the sample app from the command line use:</p>

	<pre><code>gradle clean jar
	</code></pre>

	<h2>Run</h2>

	<p>To run on the <a href="http://aws.amazon.com/elasticmapreduce/">Elastic MapReduce</a> service, based on the <a href="http://aws.amazon.com/developertools/2264">command line interface</a>:</p>

	<pre><code>$ elastic-mapreduce --create --name "Sample Recommender" \
	--jar s3n://temp.cascading.org/sample/recommender.jar \
	--arg s3n://temp.cascading.org/sample/en.stop \
	--arg s3n://temp.cascading.org/sample/tweets/ \
	--arg s3n://temp.cascading.org/sample/out/token \
	--arg s3n://temp.cascading.org/sample/out/similarity
	</code></pre>

	<p>Then check the <code>token</code> and <code>similarity</code> output data.</p>

	<p>An example of log captured from a successful build+run is at <a href="https://gist.github.com/2949834">https://gist.github.com/2949834</a></p>

	<h2>Overview</h2>

	<p>The social recommender is based on using data from the <a href="https://dev.twitter.com/docs/twitter-data-providers">Twitter firehose</a>. The app recommends other Twitter users to follow who have similar interests in stocks/investing.</p>

	<p>Sample data includes <a href="http://archivist.visitmix.com/pacoid/12">500 tweets</a>, which is enough to demonstrate how this sample app works. A compiled JAR may run on a laptop with this sample data, or it can also scale-out horizontally to run on a cluster with thousands of nodes and handle much larger data.</p>

	<p>Execution steps for this implementation are:</p>

	<ol>
	<li>Take the sample Twitter data as the input <strong>source</strong> tap. The endpoint used for that tap (the command line argument <code>data/tweets</code>) could be replaced by a much larger data set.</li>
	<li>Clean-up and tokenize the text of tweets using a <a href="http://en.wikipedia.org/wiki/Regular_expression">regex pattern</a>. NB: if you use a different data source later, that regex pattern may need to be modified.</li>
	<li>Generate <em>(user, token)</em> pairs to construct a <a href="http://en.wikipedia.org/wiki/Bipartite_graph">bipartite graph</a></li>
	<li>Apply a <a href="http://en.wikipedia.org/wiki/Stop_words">stop word list</a> to filter out common words, offensive phrases, etc.</li>
	<li>Also create a <strong>sink</strong> tap to measure token frequency, which may be used for QA: improve the regex patterns, adjust stop words, etc.</li>
	<li>Find the pairs of users who have “interesting” tokens in common, and generate an inverted index as <em>((user1, user2), token)</em></li>
	<li>Calculate a similarity metric for each user pair which are known to be <a href="http://en.wikipedia.org/wiki/Nearest_neighbor_search">neighbors</a></li>
	<li>Apply thresholds on the similarity metric to filter out poor recommendations.</li>
	<li>Connect up all the <strong>pipes</strong> and <strong>taps</strong> into a <strong>cascade</strong>, then generate a flow diagram and run the app. Results for recommended users get stored in the <code>similarityPath</code> <strong>sink</strong> tap.</li>
	</ol>

	<p>The first part of the program illustrates use of a “stream assertion”, which is much like running a unit test on your data set. Later there’s use of a <strong>Debug</strong> which traces the tuple stream on stdout. Both of these testing features can be turned off in production. Both are important features of Cascading which are not found in other frameworks based on Apache Hadoop.</p>

	<p>When you run this app, it generates a Cascading flow diagram in the file <code>dot/similarity.dot</code> which can be read by <a href="http://www.omnigroup.com/products/omnigraffle/">OmniGraffle</a> or Visio. The diagram shows how the Cascading workflows will be run as job steps in Apache Hadoop. An annotated version is provided as <code>docs/similarity.graffle</code> which shows how the physical plan of the Cascading <strong>flows</strong> overlays onto “map” and “reduce” tasks.</p>

	<h2>Integration</h2>

	<p>Our intent is to show how such an app might integrate multiple data sources, and potentially where to integrate other systems outside of <a href="http://hadoop.apache.org/">Apache Hadoop</a>.</p>

	<p>Based on using the command line arguments below, the recommender results get stored in <code>output/similarity/</code> as Apache Hadoop part files in TSV format. In practice, those would most likely get loaded into <a href="http://redis.io/">Redis</a> or some another low latency key/value store for use in a production system.</p>

	<p>Also check the output results in <code>output/token/</code> for QA on how well the text clean-up is working, which additional stop words need to be filtered, etc.</p>

	<img src="https://github.com/Cascading/SampleRecommender/raw/master/docs/sample_app.png" alt="Sample App Diagram" />


	<p>To take this example a few steps further, one could use additional taps to include more use cases and potential integrations:
	* topic trending – e.g., via <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">LDA analysis</a>
	* social graph – e.g., via <a href="http://neo4j.org/">Neo4J queries</a></p>

	<h2>Algorithms</h2>

	<p>One nuance to note about this code: recommender systems often use algorithms which require a <a href="http://en.wikipedia.org/wiki/Regular_expression">cartesian product</a>. Those don’t scale well when coded naively as MapReduce jobs. This example shows how to use the pattern of an inverted index + join. There is still a cartesian product calculated, but it is based on using a <strong>CoGroup</strong> so that it scales well for this kind of data. Also, the potential number of pairs gets filtered prior to where the cross-product gets performed, which reduces the required processing.</p>

	<p>For example, other recommenders might use a <a href="http://en.wikipedia.org/wiki/Cosine_similarity">Cosine distance</a> which implies use of a cartesian product. Instead this app calculates an <a href="http://www.mothur.org/wiki/Ochiai">Ochiai similarity metric</a>. This shows one case of how to “unwind” a cartesian product into a more parallel, more efficient algorithm for use with MapReduce.</p>

	<h2>Analytics</h2>

	<p>An example R script shows analysis of the similarity metrics. This analysis can be used to decide how to tune the metric thresholds <code>MAX_SIMILARITY</code> and <code>MIN_SIMILARITY</code>, and for QA of the recommender in general.</p>

	<p>The R script requires two libraries to be installed via the usual R mechanisms: <code>plyr</code> and <code>ggplot2</code>.</p>

	<p>After running the sample app in Apache Hadoop, run the R script to analyze the distribution similarity metrics in its results:</p>

	<pre><code>R --vanilla --args output/similarity/part-00000 < src/main/r/metric.R
	</code></pre>

	<p>That will produce two charts in a PDF file <code>Rplots.pdf</code>, which show the distribution of similarity metrics across Twitter user IDs in the input data set.</p>

	<img src="https://github.com/Cascading/SampleRecommender/raw/master/docs/tweets.png" alt="R analysis" />
No results found