This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
install kafka: https://archive.cloudera.com/kafka/kafka/2/kafka/quickstart.html | |
start zookeeper if not started(one comes with kafka): | |
bin/zookeeper-server-start.sh config/zookeeper.properties | |
start kafka server: | |
bin/kafka-server-start.sh config/server.properties | |
create topic: | |
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic ftoKafka_topic | |
bin/kafka-topics.sh --list --zookeeper localhost:2181 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--ftohdfs.conf-- | |
ftohdfs.sources = logsource | |
ftohdfs.sinks = hdfssink | |
ftohdfs.channels = mchannel | |
# Describe/configure the source | |
ftohdfs.sources.logsource.type = exec | |
ftohdfs.sources.logsource.command = tail -F /opt/gen_logs/logs/access.log | |
# Describe the sink |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Advantages of Dataset/Dataframe over RDDs from spark2.0 | |
-------------------------------------------------------- | |
1.Dataset's Strong Typing: compile check | |
Strong typing is virtual - at the 'table' level same datatype | |
Makes it portable across languages like sql/python/scala | |
eg:val ds = sqlContext.range(3) | |
ds.as[String] | |
2.Existing Business logic can be reused: | |
case class(zip:String, city:String, loc:Array[Double], pop.Long, state:String){ | |
val latChicago = 41.87 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
kafka vs Flume: | |
Kafka - general purpose infrastructure with producer and consumer API, so can be used with any system/sinks. | |
Enterprise messaging system, to connect any systems not just Hadoop - Flume is primarily meant for Hadoop. | |
Asynchronous kafka - producer and consumer can work on their own pace, for example no data losss if the consumer is down for some time. | |
Consumer can come up later and resume where left off - it can even pull from a particular older offset if needed. | |
Overflow events will be in the kafka "persistent buffer" | |
Events spike can be easily accomodated in Kafka, can handle 100K per second | |
High durability/Fault Tolerance: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package org.apache.spark.examples.streaming | |
import org.apache.spark.SparkConf | |
import org.apache.spark.rdd.RDD | |
import org.apache.spark.sql.SparkSession | |
import org.apache.spark.storage.StorageLevel | |
import org.apache.spark.streaming.{Seconds, StreamingContext, Time} | |
/** | |
* Use DataFrames and SQL to count words in UTF8 encoded, '\n' delimited text received from the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The difference between map, flatMap is a little confusing for beginers - this example might help: | |
This can be tested on a spark shell or scala CLI: | |
scala> val l = List(1,2,3,4,5) | |
scala> l.map(x => List(x-1, x, x+1)) | |
res1: List[List[Int]] = List(List(0, 1, 2), List(1, 2, 3), List(2, 3, 4), List(3, 4, 5), List(4, 5, 6)) | |
scala> l.flatMap(x => List(x-1, x, x+1)) | |
res2: List[Int] = List(0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5, 6) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val lrdd =sc.parallelize( List(1,2,3,4,5,3,5)) | |
lrdd.reduce((a,b) => a+b) | |
//short version of the syntax below | |
lrdd.reduce( _+_) | |
// try changing reduce to reduceByKey, which will fail | |
// as reduceByKey is applicable to key-value pairs | |
lrdd.reduceByKey( _+_) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val lrdd =sc.parallelize( List(1,2,3,4,5,3,5)) | |
//without case class | |
val namedDF = sqlContext.createDataFrame(lrdd.map(Tuple1.apply)).toDF("Id") | |
//with case class | |
case class Dummy(Id: Int) | |
val namedDF = lrdd.map(x => Dummy(x.toInt)).toDF() | |
//one liner DF | |
val ldf = List(1,2,3,4,5,3,5).toDS().toDF() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Analysing from Kafka topic: | |
#This below script can be put in a scala script name.scala and run from spark-shell | |
#This can be created as a scala project as well, remove the comments, use the dependencies below in build.sbt and compile | |
import kafka.serializer.StringDecoder | |
import org.apache.spark.streaming._ | |
import org.apache.spark.streaming.kafka._ | |
import org.apache.spark.SparkConf | |
import org.apache.spark.streaming.dstream.InputDStream | |
import org.apache.spark.sql._ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Please check the URL: | |
http://training.databricks.com/visualapi.pdf |