anjijava16 · September 14, 2022 03:47
diff --git a/AWS_ETL.py b/AWS_ETL.py
 Incoming/Sources/Inputs:  S3 Files (CSV,Parquet) and RDBMS (Oracle,MySQL ) Data

 Languages: Python,Shell Script

 Cluster : AWS EMR (It is Like Hortnworks)

 Linux : Linux server (It is EC2)

 Processing :
     1. Hive SQL
     2. MapReduce Java Code
     3. Spark (Spark SQL means Spark DataFrames )

 Target : S3 and AWS Athena (means Like Hive External Tables )

 IDE : PyCharam

 Source Repo : github.com

 Tools: Putty, Winscp (Developement purpose copying files from Windows to Linux)

 For UAT and PROD : Using CI/CD Deploy the Files to PROD servers 

 AWS : Roles 
    --> IAM (Identifity Access Managements) nothing but Mammatha only Data Enginneer roles not other roles 
    
    
    
    ## Every Time
    
    User/Control-m ---> Shell Scripts ---> Pyspark Jobs (abc.py)
                                      ---> Hive Jobs(abc.sql)
                                      ---> HDFS Commands (Linux commadns hadoop fs -put $SRC $TARGET
    


diff --git a/Project2.sql b/Project2.sql
 Sources

 Team1: 
   Source: CSV (.dat file(Data) and .done(done))
   Processing/Transformation : MapReduce(Java Code) ---> Do the Processing ---> Creating Output File
   Processsing/Transformation->2: Previous outputfiles to ---> Hive External table ---> Create Lake System (Storage) --> Parquet 
   
 Team2:
   Source: Hive Tables (Parquet Files)
   Processing: Spark Read as enableHiveContext() --> then Do some Transformation --> 
      Case ID='012' then 'TV'
      Case ID='013' then 'Tablet'
      
      df1=++ Other Transformation also addColumn,Concatiation columns,Adding Date,Converting date to correct format
      df2=++ Checking the NULL values of specific columns
      df3=++ Checking String columns with Where Condition
      df4=++ Checking Filter Conditions 
      Finally Result Dataframe then write back to HDFS 
      
      Target/Outputn: Hive Table as Parquet 
      
  Team3:
     Source/Input : Read Hive Table 
     Processing : Spark Python ===> PYspark 
     Output  : Write back to RDBMS (Oracle ) : df.write.jdbc("oracle").mode("overrite/append/error")
              Here 3 modes but we used only Overrite 
     Output. : S3 (access_key and Security Key)
diff --git a/Project_1.sql b/Project_1.sql
 Sources: Files (CSV) ---> Incoming --->Location

 Here : Incoming Location is Linux Server (Edge Node)

 Step2: These Incoming Files Processing Using Hadoop Put Command then Push to HDFS 

 Before Going to Step 3 : Do the File Validation in the Shell Script (Count) Source VS DONE FIle (FileName|Number Of Records)


 Step3: On top HDFS Data We are creating the Hive External Table 
 Note: If we are creating External Table make sure msck repair table $TABLE_NAME 


 Step4: Do some Transformation using Hive Side and write to HDFS Target/Final Layer using the insert into table partition($COLUMN) select * from Stage_TABLE

 INSERT OVERWRITE TABLE SOME_TABLE
 PARTITION
 (
     process_date
 )
 SELECT A,B,C
 FROM SOME_TABLE
 WHERE FALSE


 Target/Output: Hive Tables (HDFS Files as Parquet) 
diff --git a/spark_submit.sh b/spark_submit.sh
 $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
 --master yarn-cluster \
 --num-executors 10 \
 --driver-memory 4g \
 --executor-memory 2g \
 --executor-cores 4 \
 --queue thequeue \
 lib/spark-examples*.jar \
 10
diff --git a/SQL_Datawarehouse.sql b/SQL_Datawarehouse.sql
 1. Data Model (TABLES, TABELS COLUMNS)
 2. Star Schema 
 3. Fact and Dimenission Tables
 4. OLAP vs OLTP
 5. Which SQL used in the project (ANS must be MySQL)
 6. Why MYSQL it is similar to HIVE Dataware house System
 7. SQL Function Aggrgation,Row Functions
 8. SQL Joins (Inner Join, Left Outer Join,RIght OUtter Joiin (ANS should use Left Outer Join)
 9. SQl With Sub Querys (MYSQL)
 5. ACID Properties 
 6. Datawarehouse systems (Hive Vs Teradata vs RedShit vs AWS Athena)
 7. SCD TABLES TYPE1 VS TYPE2
 8. WHY hive not support DML Operations
 9. Hive not Support ACID 
 10. Hive Storagee Files (Text,Avro,JSON(Single Line),Parquet,ORC)
 11.
	Incoming/Sources/Inputs: S3 Files (CSV,Parquet) and RDBMS (Oracle,MySQL ) Data

	Languages: Python,Shell Script

	Cluster : AWS EMR (It is Like Hortnworks)

	Linux : Linux server (It is EC2)

	Processing :
	1. Hive SQL
	2. MapReduce Java Code
	3. Spark (Spark SQL means Spark DataFrames )

	Target : S3 and AWS Athena (means Like Hive External Tables )

	IDE : PyCharam

	Source Repo : github.com

	Tools: Putty, Winscp (Developement purpose copying files from Windows to Linux)

	For UAT and PROD : Using CI/CD Deploy the Files to PROD servers

	AWS : Roles
	--> IAM (Identifity Access Managements) nothing but Mammatha only Data Enginneer roles not other roles



	## Every Time

	User/Control-m ---> Shell Scripts ---> Pyspark Jobs (abc.py)
	---> Hive Jobs(abc.sql)
	---> HDFS Commands (Linux commadns hadoop fs -put $SRC $TARGET
	Sources

	Team1:
	Source: CSV (.dat file(Data) and .done(done))
	Processing/Transformation : MapReduce(Java Code) ---> Do the Processing ---> Creating Output File
	Processsing/Transformation->2: Previous outputfiles to ---> Hive External table ---> Create Lake System (Storage) --> Parquet

	Team2:
	Source: Hive Tables (Parquet Files)
	Processing: Spark Read as enableHiveContext() --> then Do some Transformation -->
	Case ID='012' then 'TV'
	Case ID='013' then 'Tablet'

	df1=++ Other Transformation also addColumn,Concatiation columns,Adding Date,Converting date to correct format
	df2=++ Checking the NULL values of specific columns
	df3=++ Checking String columns with Where Condition
	df4=++ Checking Filter Conditions
	Finally Result Dataframe then write back to HDFS

	Target/Outputn: Hive Table as Parquet

	Team3:
	Source/Input : Read Hive Table
	Processing : Spark Python ===> PYspark
	Output : Write back to RDBMS (Oracle ) : df.write.jdbc("oracle").mode("overrite/append/error")
	Here 3 modes but we used only Overrite
	Output. : S3 (access_key and Security Key)
	Sources: Files (CSV) ---> Incoming --->Location

	Here : Incoming Location is Linux Server (Edge Node)

	Step2: These Incoming Files Processing Using Hadoop Put Command then Push to HDFS

	Before Going to Step 3 : Do the File Validation in the Shell Script (Count) Source VS DONE FIle (FileName\|Number Of Records)


	Step3: On top HDFS Data We are creating the Hive External Table
	Note: If we are creating External Table make sure msck repair table $TABLE_NAME


	Step4: Do some Transformation using Hive Side and write to HDFS Target/Final Layer using the insert into table partition($COLUMN) select * from Stage_TABLE

	INSERT OVERWRITE TABLE SOME_TABLE
	PARTITION
	(
	process_date
	)
	SELECT A,B,C
	FROM SOME_TABLE
	WHERE FALSE


	Target/Output: Hive Tables (HDFS Files as Parquet)
	$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
	--master yarn-cluster \
	--num-executors 10 \
	--driver-memory 4g \
	--executor-memory 2g \
	--executor-cores 4 \
	--queue thequeue \
	lib/spark-examples*.jar \
	10
	1. Data Model (TABLES, TABELS COLUMNS)
	2. Star Schema
	3. Fact and Dimenission Tables
	4. OLAP vs OLTP
	5. Which SQL used in the project (ANS must be MySQL)
	6. Why MYSQL it is similar to HIVE Dataware house System
	7. SQL Function Aggrgation,Row Functions
	8. SQL Joins (Inner Join, Left Outer Join,RIght OUtter Joiin (ANS should use Left Outer Join)
	9. SQl With Sub Querys (MYSQL)
	5. ACID Properties
	6. Datawarehouse systems (Hive Vs Teradata vs RedShit vs AWS Athena)
	7. SCD TABLES TYPE1 VS TYPE2
	8. WHY hive not support DML Operations
	9. Hive not Support ACID
	10. Hive Storagee Files (Text,Avro,JSON(Single Line),Parquet,ORC)
	11.