Skip to content

Instantly share code, notes, and snippets.

@anjijava16
Last active September 14, 2022 03:47
Show Gist options
  • Select an option

  • Save anjijava16/805a556e6f39a37375f63ce9e27c36a3 to your computer and use it in GitHub Desktop.

Select an option

Save anjijava16/805a556e6f39a37375f63ce9e27c36a3 to your computer and use it in GitHub Desktop.
Incoming/Sources/Inputs: S3 Files (CSV,Parquet) and RDBMS (Oracle,MySQL ) Data
Languages: Python,Shell Script
Cluster : AWS EMR (It is Like Hortnworks)
Linux : Linux server (It is EC2)
Processing :
1. Hive SQL
2. MapReduce Java Code
3. Spark (Spark SQL means Spark DataFrames )
Target : S3 and AWS Athena (means Like Hive External Tables )
IDE : PyCharam
Source Repo : github.com
Tools: Putty, Winscp (Developement purpose copying files from Windows to Linux)
For UAT and PROD : Using CI/CD Deploy the Files to PROD servers
AWS : Roles
--> IAM (Identifity Access Managements) nothing but Mammatha only Data Enginneer roles not other roles
## Every Time
User/Control-m ---> Shell Scripts ---> Pyspark Jobs (abc.py)
---> Hive Jobs(abc.sql)
---> HDFS Commands (Linux commadns hadoop fs -put $SRC $TARGET
Sources
Team1:
Source: CSV (.dat file(Data) and .done(done))
Processing/Transformation : MapReduce(Java Code) ---> Do the Processing ---> Creating Output File
Processsing/Transformation->2: Previous outputfiles to ---> Hive External table ---> Create Lake System (Storage) --> Parquet
Team2:
Source: Hive Tables (Parquet Files)
Processing: Spark Read as enableHiveContext() --> then Do some Transformation -->
Case ID='012' then 'TV'
Case ID='013' then 'Tablet'
df1=++ Other Transformation also addColumn,Concatiation columns,Adding Date,Converting date to correct format
df2=++ Checking the NULL values of specific columns
df3=++ Checking String columns with Where Condition
df4=++ Checking Filter Conditions
Finally Result Dataframe then write back to HDFS
Target/Outputn: Hive Table as Parquet
Team3:
Source/Input : Read Hive Table
Processing : Spark Python ===> PYspark
Output : Write back to RDBMS (Oracle ) : df.write.jdbc("oracle").mode("overrite/append/error")
Here 3 modes but we used only Overrite
Output. : S3 (access_key and Security Key)
Sources: Files (CSV) ---> Incoming --->Location
Here : Incoming Location is Linux Server (Edge Node)
Step2: These Incoming Files Processing Using Hadoop Put Command then Push to HDFS
Before Going to Step 3 : Do the File Validation in the Shell Script (Count) Source VS DONE FIle (FileName|Number Of Records)
Step3: On top HDFS Data We are creating the Hive External Table
Note: If we are creating External Table make sure msck repair table $TABLE_NAME
Step4: Do some Transformation using Hive Side and write to HDFS Target/Final Layer using the insert into table partition($COLUMN) select * from Stage_TABLE
INSERT OVERWRITE TABLE SOME_TABLE
PARTITION
(
process_date
)
SELECT A,B,C
FROM SOME_TABLE
WHERE FALSE
Target/Output: Hive Tables (HDFS Files as Parquet)
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 10 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 4 \
--queue thequeue \
lib/spark-examples*.jar \
10
1. Data Model (TABLES, TABELS COLUMNS)
2. Star Schema
3. Fact and Dimenission Tables
4. OLAP vs OLTP
5. Which SQL used in the project (ANS must be MySQL)
6. Why MYSQL it is similar to HIVE Dataware house System
7. SQL Function Aggrgation,Row Functions
8. SQL Joins (Inner Join, Left Outer Join,RIght OUtter Joiin (ANS should use Left Outer Join)
9. SQl With Sub Querys (MYSQL)
5. ACID Properties
6. Datawarehouse systems (Hive Vs Teradata vs RedShit vs AWS Athena)
7. SCD TABLES TYPE1 VS TYPE2
8. WHY hive not support DML Operations
9. Hive not Support ACID
10. Hive Storagee Files (Text,Avro,JSON(Single Line),Parquet,ORC)
11.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment