Skip to content

Instantly share code, notes, and snippets.

  1. General Background and Overview
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
DROP TABLE raw_log;
CREATE EXTERNAL TABLE raw_log(
IP STRING,
timestamp STRING,
URL STRING,
referrer STRING,
user_agent STRING)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
package ner
import edu.stanford.nlp.ie.crf.CRFClassifier
import scala.collection.JavaConversions._
import scala.collection.JavaConverters._
import edu.stanford.nlp.ling.CoreAnnotations
import java.util.ArrayList
import java.util.HashMap
import java.util.Map
import scala.xml.XML
import com.twitter.algebird.{Aggregator, Semigroup}
import com.twitter.scalding._
import scala.util.Random
/**
* This job is a tutorial of sorts for scalding's Execution[T] abstraction.
* It is a simple implementation of Lloyd's algorithm for k-means on 2D data.
*
* http://en.wikipedia.org/wiki/K-means_clustering
import com.twitter.scalding._
import com.twitter.algebird.{ MinHasher, MinHasher32, MinHashSignature }
/**
* Computes similar items (with a string itemId), based on approximate
* Jaccard similarity, using LSH.
*
* Assumes an input data TSV file of the following format:
*
* itemId userId
import cascading.tuple.{Fields, TupleEntry}
import com.twitter.scalding._
import java.net.URLDecoder
import scala.util.matching.Regex
class BoomerangLogJob(args: Args) extends Job(args) {
val input = TextLine(args("input"))
val output = TextLine(args("output"))
val trap = Tsv(args("trap"))
Goal: Process the 12 million plus records
from: http://seer.cancer.gov/popdata/download.html
using: a Scala API atop Cascading, aka SCALDING ( Inventors: Avi Bryant, Oscar Boykin, Argyris )
to find:
THE FASTEST GROWING COUNTY IN THE UNITED STATES over the 1969-2011 timeframe.
-----------------------------------------------------------------------------
RESULTS: Scroll to the very bottom.
First, the scalding source...
---
import com.twitter.algebird.{Aggregator, Semigroup}
import com.twitter.scalding._
import scala.util.Random
/**
* This job is a tutorial of sorts for scalding's Execution[T] abstraction.
* It is a simple implementation of Lloyd's algorithm for k-means on 2D data.
*
* http://en.wikipedia.org/wiki/K-means_clustering