jamesrajendran’s gists

jamesrajendran / FlumeToKafka

Created May 8, 2017 10:11

	install kafka: https://archive.cloudera.com/kafka/kafka/2/kafka/quickstart.html
	start zookeeper if not started(one comes with kafka):
	bin/zookeeper-server-start.sh config/zookeeper.properties

	start kafka server:
	bin/kafka-server-start.sh config/server.properties

	create topic:
	bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic ftoKafka_topic
	bin/kafka-topics.sh --list --zookeeper localhost:2181

jamesrajendran / FlumeToHdfs

Last active January 19, 2019 23:54

	--ftohdfs.conf--
	ftohdfs.sources = logsource
	ftohdfs.sinks = hdfssink
	ftohdfs.channels = mchannel

	# Describe/configure the source
	ftohdfs.sources.logsource.type = exec
	ftohdfs.sources.logsource.command = tail -F /opt/gen_logs/logs/access.log

	# Describe the sink

jamesrajendran / SparkDatasetVsRDD

Last active June 20, 2017 09:13

	Advantages of Dataset/Dataframe over RDDs from spark2.0
	--------------------------------------------------------
	1.Dataset's Strong Typing: compile check
	Strong typing is virtual - at the 'table' level same datatype
	Makes it portable across languages like sql/python/scala
	eg:val ds = sqlContext.range(3)
	ds.as[String]
	2.Existing Business logic can be reused:
	case class(zip:String, city:String, loc:Array[Double], pop.Long, state:String){
	val latChicago = 41.87

jamesrajendran / FlumeVsKafka

Created May 11, 2017 11:04

	kafka vs Flume:
	Kafka - general purpose infrastructure with producer and consumer API, so can be used with any system/sinks.
	Enterprise messaging system, to connect any systems not just Hadoop - Flume is primarily meant for Hadoop.

	Asynchronous kafka - producer and consumer can work on their own pace, for example no data losss if the consumer is down for some time.
	Consumer can come up later and resume where left off - it can even pull from a particular older offset if needed.
	Overflow events will be in the kafka "persistent buffer"
	Events spike can be easily accomodated in Kafka, can handle 100K per second

	High durability/Fault Tolerance:

jamesrajendran / SparkStreamingExample

Created May 12, 2017 13:06

	package org.apache.spark.examples.streaming

	import org.apache.spark.SparkConf
	import org.apache.spark.rdd.RDD
	import org.apache.spark.sql.SparkSession
	import org.apache.spark.storage.StorageLevel
	import org.apache.spark.streaming.{Seconds, StreamingContext, Time}

	/**
	* Use DataFrames and SQL to count words in UTF8 encoded, '\n' delimited text received from the

jamesrajendran / mapVsflatMap

Last active March 2, 2020 13:37

	The difference between map, flatMap is a little confusing for beginers - this example might help:
	This can be tested on a spark shell or scala CLI:

	scala> val l = List(1,2,3,4,5)

	scala> l.map(x => List(x-1, x, x+1))
	res1: List[List[Int]] = List(List(0, 1, 2), List(1, 2, 3), List(2, 3, 4), List(3, 4, 5), List(4, 5, 6))

	scala> l.flatMap(x => List(x-1, x, x+1))
	res2: List[Int] = List(0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5, 6)

jamesrajendran / reduceVsreduceByKey

Last active March 2, 2020 13:37

Demystifying reduce and reduceByKey


	val lrdd =sc.parallelize( List(1,2,3,4,5,3,5))

	lrdd.reduce((a,b) => a+b)
	//short version of the syntax below
	lrdd.reduce( _+_)
	// try changing reduce to reduceByKey, which will fail
	// as reduceByKey is applicable to key-value pairs
	lrdd.reduceByKey( _+_)

jamesrajendran / RDD-To-DF-To-DS

Last active March 2, 2020 13:37

Rdd --> DF --> Table --> SQL --> DS


	val lrdd =sc.parallelize( List(1,2,3,4,5,3,5))
	//without case class
	val namedDF = sqlContext.createDataFrame(lrdd.map(Tuple1.apply)).toDF("Id")
	//with case class
	case class Dummy(Id: Int)
	val namedDF = lrdd.map(x => Dummy(x.toInt)).toDF()
	//one liner DF
	val ldf = List(1,2,3,4,5,3,5).toDS().toDF()

jamesrajendran / SparkStreaming

Last active March 2, 2020 13:37

Spark Streaming - quick tips

	#Analysing from Kafka topic:
	#This below script can be put in a scala script name.scala and run from spark-shell
	#This can be created as a scala project as well, remove the comments, use the dependencies below in build.sbt and compile

	import kafka.serializer.StringDecoder
	import org.apache.spark.streaming._
	import org.apache.spark.streaming.kafka._
	import org.apache.spark.SparkConf
	import org.apache.spark.streaming.dstream.InputDStream
	import org.apache.spark.sql._

jamesrajendran / SparkRDDTrasformationActionsVisual

Last active May 14, 2017 14:01

Transformation and Actions of RDD - Visual

	Please check the URL:
	http://training.databricks.com/visualapi.pdf