thanooj kalathuru thanoojgithub

🏠

Working from home

a member of Java/BigData practice team at IMPETUS bangalore

thanoojgithub / 2_SCD_Type_2_Data_model_using_PySpark.py

Last active March 27, 2022 16:44

Sample code 2 - Implementing SCD Type 2 Data model using PySpark

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import *

	spark = SparkSession \
	.builder \
	.master('local') \
	.appName('pyspark-test-run') \
	.getOrCreate()
	spark.sparkContext.setLogLevel("ERROR")

thanoojgithub / Sample_code_1_SCD_Type_2_Data_model_using_PySpark.py

Last active January 31, 2022 08:59

Sample code - Implementing SCD Type 2 Data model using PySpark

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import *

	spark = SparkSession \
	.builder \
	.master('local') \
	.appName('pyspark-test-run') \
	.getOrCreate()
	spark.sparkContext.setLogLevel("ERROR")

thanoojgithub / PySparkOne.py

Last active January 21, 2022 05:01

PySpark Example One

	from pyspark.sql import SparkSession
	from pyspark.sql import Window
	from pyspark.sql.functions import *

	spark = SparkSession \
	.builder \
	.master('local') \
	.appName('pyspark-test-run') \
	.getOrCreate()

thanoojgithub / JavaByExamples

Last active February 24, 2021 16:19

Java By Examples

	1. How to find out second hightest value from a Map<String, Integer>
	Map<String, Integer> books = new HashMap<>();
	books.put("one", 1);
	books.put("two", 22);
	books.put("three", 333);
	books.put("four", 4444);
	books.put("five", 55555);
	books.put("six", 666666);
	Stream<Integer> list = books.entrySet().stream().filter(e -> e.getValue().toString().length() > 3)
	.map(Map.Entry::getValue);

thanoojgithub / SparkByExamples.scala

Created January 27, 2021 07:29

SparkByExamples - Spark By Examples

	import org.apache.spark.sql.expressions.Window
	import org.apache.spark.sql.functions._
	val data = sc.parallelize(Seq((101,"ram","12-01-2021",10001,120.00),(102,"sam","12-01-2021",10002,130.00),(101,"ram","12-01-2021",10003,140.00),(103,"jam","12-01-2021",10004,150.00),(101,"ram","12-01-2021",10005,130.00),(103,"jam","12-01-2021",10006,120.00),(102,"sam","12-01-2021",10007,130.00)))
	val dataDF = data.toDF("id","name","date","transid","amount")
	val windowSpec = Window.partitionBy("id").orderBy('transid desc)
	val dataDF1 = dataDF.withColumn("row_number",rank().over(windowSpec))
	dataDF.printSchema
	dataDF.show()
	dataDF1.printSchema
	dataDF1.show()

thanoojgithub / Spark-Cluster-Overview.png

Last active January 11, 2021 09:47

SparkNotes

thanoojgithub / hiveQueryOptimizationTechniques.txt

Last active October 28, 2023 11:52

hive query optimization techniques

	https://github.com/Thomas-George-T/Movies-Analytics-in-Spark-and-Scala

	Change execution engine = Tez, spark ( set Tez/Spark client jars into HADOOP_CLASSPATH)
	Partitioning - PARTITIONED BY clause is used to divide the table into buckets.
	Buckting - CLUSTERED BY clause is used to divide the table into buckets.
	Map-Side join, Bucket-Map-Side join, Sorted Bucket-Map-Side join
	Usage of suitable file format = ORC(Optimized Row Columnar) file formate
	Indexing
	Vectorization along with ORC
	CBO

thanoojgithub / SparkWithHiveUsingPython.py

Created December 12, 2020 17:37

spark with hive using python

	import subprocess
	from pyspark.sql import functions as f
	from operator import add
	from pyspark.sql import Row, SparkSession
	from pyspark.sql.types import StructField, StringType, StructType

	def sparkwithhiveone():
	sparkwithhive = getsparkwithhive()
	try:
	assert (sparkwithhive.conf.get("spark.sql.catalogImplementation") == "hive")

thanoojgithub / WhySparkSQLOverHiveQL.txt

Created December 4, 2020 18:04

Why Spark SQL Over Hive QL

	By default hive uses MR engine but, we can set to taz or even spark engine (in-memory computation)
	But,
	hive has SQL like HiveQL (HQL) and more usage when you are a SQL developer
	even though we have UDFs, we do not have extra backyard area to do some core/complex business logic
	and Spark has Spark SQL and we can move from DF to RDD and RDD to DF to perform core/complex business logic

	No resume capability

	Hive can not drop encripted databases

thanoojgithub / KafkaSampleProducer.java

Last active November 30, 2020 16:23

Kafka SampleProducer in java

	package com.kafkaconnectone;

	import java.util.Map.Entry;
	import java.util.Properties;

	import org.apache.kafka.clients.producer.KafkaProducer;
	import org.apache.kafka.clients.producer.Producer;
	import org.apache.kafka.clients.producer.ProducerRecord;
	import org.apache.kafka.common.KafkaException;
	import org.apache.kafka.common.errors.AuthorizationException;