Last active
September 14, 2022 03:47
-
-
Save anjijava16/805a556e6f39a37375f63ce9e27c36a3 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Incoming/Sources/Inputs: S3 Files (CSV,Parquet) and RDBMS (Oracle,MySQL ) Data | |
| Languages: Python,Shell Script | |
| Cluster : AWS EMR (It is Like Hortnworks) | |
| Linux : Linux server (It is EC2) | |
| Processing : | |
| 1. Hive SQL | |
| 2. MapReduce Java Code | |
| 3. Spark (Spark SQL means Spark DataFrames ) | |
| Target : S3 and AWS Athena (means Like Hive External Tables ) | |
| IDE : PyCharam | |
| Source Repo : github.com | |
| Tools: Putty, Winscp (Developement purpose copying files from Windows to Linux) | |
| For UAT and PROD : Using CI/CD Deploy the Files to PROD servers | |
| AWS : Roles | |
| --> IAM (Identifity Access Managements) nothing but Mammatha only Data Enginneer roles not other roles | |
| ## Every Time | |
| User/Control-m ---> Shell Scripts ---> Pyspark Jobs (abc.py) | |
| ---> Hive Jobs(abc.sql) | |
| ---> HDFS Commands (Linux commadns hadoop fs -put $SRC $TARGET | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Sources | |
| Team1: | |
| Source: CSV (.dat file(Data) and .done(done)) | |
| Processing/Transformation : MapReduce(Java Code) ---> Do the Processing ---> Creating Output File | |
| Processsing/Transformation->2: Previous outputfiles to ---> Hive External table ---> Create Lake System (Storage) --> Parquet | |
| Team2: | |
| Source: Hive Tables (Parquet Files) | |
| Processing: Spark Read as enableHiveContext() --> then Do some Transformation --> | |
| Case ID='012' then 'TV' | |
| Case ID='013' then 'Tablet' | |
| df1=++ Other Transformation also addColumn,Concatiation columns,Adding Date,Converting date to correct format | |
| df2=++ Checking the NULL values of specific columns | |
| df3=++ Checking String columns with Where Condition | |
| df4=++ Checking Filter Conditions | |
| Finally Result Dataframe then write back to HDFS | |
| Target/Outputn: Hive Table as Parquet | |
| Team3: | |
| Source/Input : Read Hive Table | |
| Processing : Spark Python ===> PYspark | |
| Output : Write back to RDBMS (Oracle ) : df.write.jdbc("oracle").mode("overrite/append/error") | |
| Here 3 modes but we used only Overrite | |
| Output. : S3 (access_key and Security Key) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Sources: Files (CSV) ---> Incoming --->Location | |
| Here : Incoming Location is Linux Server (Edge Node) | |
| Step2: These Incoming Files Processing Using Hadoop Put Command then Push to HDFS | |
| Before Going to Step 3 : Do the File Validation in the Shell Script (Count) Source VS DONE FIle (FileName|Number Of Records) | |
| Step3: On top HDFS Data We are creating the Hive External Table | |
| Note: If we are creating External Table make sure msck repair table $TABLE_NAME | |
| Step4: Do some Transformation using Hive Side and write to HDFS Target/Final Layer using the insert into table partition($COLUMN) select * from Stage_TABLE | |
| INSERT OVERWRITE TABLE SOME_TABLE | |
| PARTITION | |
| ( | |
| process_date | |
| ) | |
| SELECT A,B,C | |
| FROM SOME_TABLE | |
| WHERE FALSE | |
| Target/Output: Hive Tables (HDFS Files as Parquet) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ | |
| --master yarn-cluster \ | |
| --num-executors 10 \ | |
| --driver-memory 4g \ | |
| --executor-memory 2g \ | |
| --executor-cores 4 \ | |
| --queue thequeue \ | |
| lib/spark-examples*.jar \ | |
| 10 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 1. Data Model (TABLES, TABELS COLUMNS) | |
| 2. Star Schema | |
| 3. Fact and Dimenission Tables | |
| 4. OLAP vs OLTP | |
| 5. Which SQL used in the project (ANS must be MySQL) | |
| 6. Why MYSQL it is similar to HIVE Dataware house System | |
| 7. SQL Function Aggrgation,Row Functions | |
| 8. SQL Joins (Inner Join, Left Outer Join,RIght OUtter Joiin (ANS should use Left Outer Join) | |
| 9. SQl With Sub Querys (MYSQL) | |
| 5. ACID Properties | |
| 6. Datawarehouse systems (Hive Vs Teradata vs RedShit vs AWS Athena) | |
| 7. SCD TABLES TYPE1 VS TYPE2 | |
| 8. WHY hive not support DML Operations | |
| 9. Hive not Support ACID | |
| 10. Hive Storagee Files (Text,Avro,JSON(Single Line),Parquet,ORC) | |
| 11. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment