thanooj kalathuru thanoojgithub

🏠

Working from home

a member of Java/BigData practice team at IMPETUS bangalore

thanoojgithub / hiveQueryOptimizationTechniques.txt

Last active October 28, 2023 11:52

hive query optimization techniques

	https://github.com/Thomas-George-T/Movies-Analytics-in-Spark-and-Scala

	Change execution engine = Tez, spark ( set Tez/Spark client jars into HADOOP_CLASSPATH)
	Partitioning - PARTITIONED BY clause is used to divide the table into buckets.
	Buckting - CLUSTERED BY clause is used to divide the table into buckets.
	Map-Side join, Bucket-Map-Side join, Sorted Bucket-Map-Side join
	Usage of suitable file format = ORC(Optimized Row Columnar) file formate
	Indexing
	Vectorization along with ORC
	CBO

thanoojgithub / Spark-Cluster-Overview.png

Last active January 11, 2021 09:47

SparkNotes

thanoojgithub / SparkByExamples.scala

Created January 27, 2021 07:29

SparkByExamples - Spark By Examples

	import org.apache.spark.sql.expressions.Window
	import org.apache.spark.sql.functions._
	val data = sc.parallelize(Seq((101,"ram","12-01-2021",10001,120.00),(102,"sam","12-01-2021",10002,130.00),(101,"ram","12-01-2021",10003,140.00),(103,"jam","12-01-2021",10004,150.00),(101,"ram","12-01-2021",10005,130.00),(103,"jam","12-01-2021",10006,120.00),(102,"sam","12-01-2021",10007,130.00)))
	val dataDF = data.toDF("id","name","date","transid","amount")
	val windowSpec = Window.partitionBy("id").orderBy('transid desc)
	val dataDF1 = dataDF.withColumn("row_number",rank().over(windowSpec))
	dataDF.printSchema
	dataDF.show()
	dataDF1.printSchema
	dataDF1.show()

thanoojgithub / JavaByExamples

Last active February 24, 2021 16:19

Java By Examples

	1. How to find out second hightest value from a Map<String, Integer>
	Map<String, Integer> books = new HashMap<>();
	books.put("one", 1);
	books.put("two", 22);
	books.put("three", 333);
	books.put("four", 4444);
	books.put("five", 55555);
	books.put("six", 666666);
	Stream<Integer> list = books.entrySet().stream().filter(e -> e.getValue().toString().length() > 3)
	.map(Map.Entry::getValue);

thanoojgithub / PySparkOne.py

Last active January 21, 2022 05:01

PySpark Example One

	from pyspark.sql import SparkSession
	from pyspark.sql import Window
	from pyspark.sql.functions import *

	spark = SparkSession \
	.builder \
	.master('local') \
	.appName('pyspark-test-run') \
	.getOrCreate()

thanoojgithub / Sample_code_1_SCD_Type_2_Data_model_using_PySpark.py

Last active January 31, 2022 08:59

Sample code - Implementing SCD Type 2 Data model using PySpark

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import *

	spark = SparkSession \
	.builder \
	.master('local') \
	.appName('pyspark-test-run') \
	.getOrCreate()
	spark.sparkContext.setLogLevel("ERROR")

thanoojgithub / 2_SCD_Type_2_Data_model_using_PySpark.py

Last active March 27, 2022 16:44

Sample code 2 - Implementing SCD Type 2 Data model using PySpark

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import *

	spark = SparkSession \
	.builder \
	.master('local') \
	.appName('pyspark-test-run') \
	.getOrCreate()
	spark.sparkContext.setLogLevel("ERROR")

thanoojgithub / create_empty_parquet_file_from_existing_parquet_file.py

Last active March 27, 2022 17:22

create empty parquet file from existing parquet file

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import *

	from pyspark.sql.types import StructType, StructField, StringType, TimestampType

	if __name__ == "__main__":
	spark = SparkSession \
	.builder \
	.master('local') \
	.appName('pyspark-test-run') \

thanoojgithub / MySQLShell8.0.28Sample.sql

Created March 18, 2022 02:28

MySQL Shell 8.0.28 sample

	MySQL Shell 8.0.28

	Copyright (c) 2016, 2022, Oracle and/or its affiliates.
	Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
	Other names may be trademarks of their respective owners.

	Type '\help' or '\?' for help; '\quit' to exit.
	MySQL JS > \connect thanooj@localhost
	Creating a session to 'thanooj@localhost'
	Please provide the password for 'thanooj@localhost': ********

thanoojgithub / Hadoop3.2.2_Start-up_commands.txt

Last active March 21, 2022 05:14

Hadoop 3.2.2 Start-up commands

	Hadoop 3.2.2
	Start-up commands:
	--------------------------------
	1. Stop the dfs and yarn first.
	2. Remove the datanode and namenode directories as specified in the core-site.xml file.
	3. Re-create the directories.
	4. hdfs namenode -format
	5. Then re-start the dfs and the yarn as follows.
	start-dfs.sh
	start-yarn.sh