Skip to content

Instantly share code, notes, and snippets.

install kafka: https://archive.cloudera.com/kafka/kafka/2/kafka/quickstart.html
start zookeeper if not started(one comes with kafka):
bin/zookeeper-server-start.sh config/zookeeper.properties
start kafka server:
bin/kafka-server-start.sh config/server.properties
create topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic ftoKafka_topic
bin/kafka-topics.sh --list --zookeeper localhost:2181
--ftohdfs.conf--
ftohdfs.sources = logsource
ftohdfs.sinks = hdfssink
ftohdfs.channels = mchannel
# Describe/configure the source
ftohdfs.sources.logsource.type = exec
ftohdfs.sources.logsource.command = tail -F /opt/gen_logs/logs/access.log
# Describe the sink
Advantages of Dataset/Dataframe over RDDs from spark2.0
--------------------------------------------------------
1.Dataset's Strong Typing: compile check
Strong typing is virtual - at the 'table' level same datatype
Makes it portable across languages like sql/python/scala
eg:val ds = sqlContext.range(3)
ds.as[String]
2.Existing Business logic can be reused:
case class(zip:String, city:String, loc:Array[Double], pop.Long, state:String){
val latChicago = 41.87
kafka vs Flume:
Kafka - general purpose infrastructure with producer and consumer API, so can be used with any system/sinks.
Enterprise messaging system, to connect any systems not just Hadoop - Flume is primarily meant for Hadoop.
Asynchronous kafka - producer and consumer can work on their own pace, for example no data losss if the consumer is down for some time.
Consumer can come up later and resume where left off - it can even pull from a particular older offset if needed.
Overflow events will be in the kafka "persistent buffer"
Events spike can be easily accomodated in Kafka, can handle 100K per second
High durability/Fault Tolerance:
package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
/**
* Use DataFrames and SQL to count words in UTF8 encoded, '\n' delimited text received from the
The difference between map, flatMap is a little confusing for beginers - this example might help:
This can be tested on a spark shell or scala CLI:
scala> val l = List(1,2,3,4,5)
scala> l.map(x => List(x-1, x, x+1))
res1: List[List[Int]] = List(List(0, 1, 2), List(1, 2, 3), List(2, 3, 4), List(3, 4, 5), List(4, 5, 6))
scala> l.flatMap(x => List(x-1, x, x+1))
res2: List[Int] = List(0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5, 6)
@jamesrajendran
jamesrajendran / reduceVsreduceByKey
Last active March 2, 2020 13:37
Demystifying reduce and reduceByKey
val lrdd =sc.parallelize( List(1,2,3,4,5,3,5))
lrdd.reduce((a,b) => a+b)
//short version of the syntax below
lrdd.reduce( _+_)
// try changing reduce to reduceByKey, which will fail
// as reduceByKey is applicable to key-value pairs
lrdd.reduceByKey( _+_)
@jamesrajendran
jamesrajendran / RDD-To-DF-To-DS
Last active March 2, 2020 13:37
Rdd --> DF --> Table --> SQL --> DS
val lrdd =sc.parallelize( List(1,2,3,4,5,3,5))
//without case class
val namedDF = sqlContext.createDataFrame(lrdd.map(Tuple1.apply)).toDF("Id")
//with case class
case class Dummy(Id: Int)
val namedDF = lrdd.map(x => Dummy(x.toInt)).toDF()
//one liner DF
val ldf = List(1,2,3,4,5,3,5).toDS().toDF()
@jamesrajendran
jamesrajendran / SparkStreaming
Last active March 2, 2020 13:37
Spark Streaming - quick tips
#Analysing from Kafka topic:
#This below script can be put in a scala script name.scala and run from spark-shell
#This can be created as a scala project as well, remove the comments, use the dependencies below in build.sbt and compile
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.sql._
@jamesrajendran
jamesrajendran / SparkRDDTrasformationActionsVisual
Last active May 14, 2017 14:01
Transformation and Actions of RDD - Visual
Please check the URL:
http://training.databricks.com/visualapi.pdf