This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The difference between map, flatMap is a little confusing for beginers - this example might help: | |
This can be tested on a spark shell or scala CLI: | |
scala> val l = List(1,2,3,4,5) | |
scala> l.map(x => List(x-1, x, x+1)) | |
res1: List[List[Int]] = List(List(0, 1, 2), List(1, 2, 3), List(2, 3, 4), List(3, 4, 5), List(4, 5, 6)) | |
scala> l.flatMap(x => List(x-1, x, x+1)) | |
res2: List[Int] = List(0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5, 6) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val lrdd =sc.parallelize( List(1,2,3,4,5,3,5)) | |
//without case class | |
val namedDF = sqlContext.createDataFrame(lrdd.map(Tuple1.apply)).toDF("Id") | |
//with case class | |
case class Dummy(Id: Int) | |
val namedDF = lrdd.map(x => Dummy(x.toInt)).toDF() | |
//one liner DF | |
val ldf = List(1,2,3,4,5,3,5).toDS().toDF() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val lrdd =sc.parallelize( List(1,2,3,4,5,3,5)) | |
lrdd.reduce((a,b) => a+b) | |
//short version of the syntax below | |
lrdd.reduce( _+_) | |
// try changing reduce to reduceByKey, which will fail | |
// as reduceByKey is applicable to key-value pairs | |
lrdd.reduceByKey( _+_) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Analysing from Kafka topic: | |
#This below script can be put in a scala script name.scala and run from spark-shell | |
#This can be created as a scala project as well, remove the comments, use the dependencies below in build.sbt and compile | |
import kafka.serializer.StringDecoder | |
import org.apache.spark.streaming._ | |
import org.apache.spark.streaming.kafka._ | |
import org.apache.spark.SparkConf | |
import org.apache.spark.streaming.dstream.InputDStream | |
import org.apache.spark.sql._ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1.MapJoin: | |
small tables can be loaded in memory and joined with bigger tables. | |
1. use hint /*+ MAPJOIN(table_name) */ | |
2. 'better' option - let hive do automatically by setting these properties: | |
hive.auto.convert.join - true | |
hive.mapjoin.smalltable.filesize = <> default is 25MB | |
2.Partition Design | |
Low cardinality column -eg, regiou, year |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-------------kafka notes----------- | |
why? | |
better throughput | |
Replication | |
built-in partitioning | |
Fault tolerance | |
topics are unique!! | |
location of a message -> topic - partition - offset |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1.Producer | |
1.request.required.acks=[0,1,all/-1] 0 no acknowledgement but ver fast, 1 acknowledged after leader commits, all acknowledged after replicated | |
2.use Async producer - use callback for the acknowledgement, using property producer.type=1 | |
3.Batching data - send multiple messages together. | |
batch.num.messages | |
queue.buffer.max.ms | |
4.Compression for Large files - gzip, snappy supported | |
very large files can be stored in shared location and just the file path can be logged by the kafka producer. | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
env # to get all env variables | |
#Find class files in a jar | |
all_hdp_classes () { | |
find -L /usr/hdp/current -maxdepth 20 -name "*.jar" -print | while read line; do | |
for i in `jar tf $line | grep .class` | |
do | |
echo $line : $i | |
done | |
done |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1.mapPartition() instead of map() - when some expensive initializations like DBconnection need to be done | |
2.RDD Parallelism: for No parent RDDs, example, sc.parallelize(',,,',4),Unless specified YARN will try to use as many CPU cores as available | |
This could be tuned using spark.default.parallelism property. | |
- to find default parallelism use sc.defaultParallelism | |
rdd.getNumPartitions() | |
rdd = sc.parallelize(<value>, numSlices=4) | |
rdd.getNumPartitions() will return 4 | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
env # to get all env variables | |
*********to work as root************* | |
su - | |
**************ifconfig synonyms------------ | |
ip address show or ip a s or ip a s eth0 | |
************formatted file name************ | |
cp a.txt a_$(date +%F).txt |
NewerOlder