Skip to content

Instantly share code, notes, and snippets.

View thanoojgithub's full-sized avatar
🏠
Working from home

thanooj kalathuru thanoojgithub

🏠
Working from home
View GitHub Profile
@thanoojgithub
thanoojgithub / 2_SCD_Type_2_Data_model_using_PySpark.py
Last active March 27, 2022 16:44
Sample code 2 - Implementing SCD Type 2 Data model using PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession \
.builder \
.master('local') \
.appName('pyspark-test-run') \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
@thanoojgithub
thanoojgithub / Sample_code_1_SCD_Type_2_Data_model_using_PySpark.py
Last active January 31, 2022 08:59
Sample code - Implementing SCD Type 2 Data model using PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession \
.builder \
.master('local') \
.appName('pyspark-test-run') \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
@thanoojgithub
thanoojgithub / PySparkOne.py
Last active January 21, 2022 05:01
PySpark Example One
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.functions import *
spark = SparkSession \
.builder \
.master('local') \
.appName('pyspark-test-run') \
.getOrCreate()
@thanoojgithub
thanoojgithub / JavaByExamples
Last active February 24, 2021 16:19
Java By Examples
1. How to find out second hightest value from a Map<String, Integer>
Map<String, Integer> books = new HashMap<>();
books.put("one", 1);
books.put("two", 22);
books.put("three", 333);
books.put("four", 4444);
books.put("five", 55555);
books.put("six", 666666);
Stream<Integer> list = books.entrySet().stream().filter(e -> e.getValue().toString().length() > 3)
.map(Map.Entry::getValue);
@thanoojgithub
thanoojgithub / SparkByExamples.scala
Created January 27, 2021 07:29
SparkByExamples - Spark By Examples
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val data = sc.parallelize(Seq((101,"ram","12-01-2021",10001,120.00),(102,"sam","12-01-2021",10002,130.00),(101,"ram","12-01-2021",10003,140.00),(103,"jam","12-01-2021",10004,150.00),(101,"ram","12-01-2021",10005,130.00),(103,"jam","12-01-2021",10006,120.00),(102,"sam","12-01-2021",10007,130.00)))
val dataDF = data.toDF("id","name","date","transid","amount")
val windowSpec = Window.partitionBy("id").orderBy('transid desc)
val dataDF1 = dataDF.withColumn("row_number",rank().over(windowSpec))
dataDF.printSchema
dataDF.show()
dataDF1.printSchema
dataDF1.show()
@thanoojgithub
thanoojgithub / hiveQueryOptimizationTechniques.txt
Last active October 28, 2023 11:52
hive query optimization techniques
https://github.com/Thomas-George-T/Movies-Analytics-in-Spark-and-Scala
Change execution engine = Tez, spark ( set Tez/Spark client jars into HADOOP_CLASSPATH)
Partitioning - PARTITIONED BY clause is used to divide the table into buckets.
Buckting - CLUSTERED BY clause is used to divide the table into buckets.
Map-Side join, Bucket-Map-Side join, Sorted Bucket-Map-Side join
Usage of suitable file format = ORC(Optimized Row Columnar) file formate
Indexing
Vectorization along with ORC
CBO
@thanoojgithub
thanoojgithub / SparkWithHiveUsingPython.py
Created December 12, 2020 17:37
spark with hive using python
import subprocess
from pyspark.sql import functions as f
from operator import add
from pyspark.sql import Row, SparkSession
from pyspark.sql.types import StructField, StringType, StructType
def sparkwithhiveone():
sparkwithhive = getsparkwithhive()
try:
assert (sparkwithhive.conf.get("spark.sql.catalogImplementation") == "hive")
@thanoojgithub
thanoojgithub / WhySparkSQLOverHiveQL.txt
Created December 4, 2020 18:04
Why Spark SQL Over Hive QL
By default hive uses MR engine but, we can set to taz or even spark engine (in-memory computation)
But,
hive has SQL like HiveQL (HQL) and more usage when you are a SQL developer
even though we have UDFs, we do not have extra backyard area to do some core/complex business logic
and Spark has Spark SQL and we can move from DF to RDD and RDD to DF to perform core/complex business logic
No resume capability
Hive can not drop encripted databases
@thanoojgithub
thanoojgithub / KafkaSampleProducer.java
Last active November 30, 2020 16:23
Kafka SampleProducer in java
package com.kafkaconnectone;
import java.util.Map.Entry;
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.KafkaException;
import org.apache.kafka.common.errors.AuthorizationException;