Last active
January 6, 2018 09:13
-
-
Save syedatifakhtar/ad6625fd8576bddeda827f08c44e4458 to your computer and use it in GitHub Desktop.
DE Workshop
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SESSION 1 | |
What is Big Data?Why Big Data? | |
Some Examples of where Big Data is necessary | |
Thinking in terms of MapReduce | |
HDFS vs S3 vs Local File System | |
Resources/Containers/Nodes | |
WordCount - Exercise | |
Extras-: | |
JOINS in MapReduce | |
Input Splits | |
HDFS Internals | |
Data Skews | |
Session 2 | |
ETL Pipelines and How they work (Ingestion/Cleaning/Processing/Output Churn) | |
Tying things together with Sqoop and Oozie | |
-Demo | |
Resource Management Basics - YARN vs Mesos | |
Enter HIVE - WordCount | |
Sample ETL Oozie + Sqoop + MapR + Hive | |
Session 3 | |
Different Databases - Pig vs HBASE vs HIVE | |
Analyzing Queries - Optimizations and why they make a huge difference in Big Data | |
ROW vs Columnar stores + Compression - How to use and store your data effectively | |
Sample Application - Analyze and Optimize | |
After thoughts - Where does SQL fail?Why is the MapReduce API so complicated?Basic Optimizations that should be done? | |
Alternatives - Flume/Crunch/Scooby | |
Extras-: | |
BucketJoin | |
MapJoin | |
Other Optimizations - How does Hive do it? | |
CBO | |
Session 4 | |
Flume/Crunch/Scooby | |
Spark! | |
Running Programs through Spark Shell (Word Count) | |
Spark RDD Basics | |
Optimizations Basic - Watching for jobs - Checkpointing/Preventing Spills/Parallelism | |
Extras-: | |
Data Locality | |
Optimizing Jobs - Executor Cores / Memory Configuration / Optimizing Spills | |
Session 5 (Optional - Discuss?) | |
Scala Primer | |
How to think of Data Functionally and Why is it important for ETL? | |
Extras-: | |
Course on Scala | |
Session 6 (Data Analytics) | |
Notebooks Intro Ipython vs Zeppelin and others | |
Spark SQL Basics | |
Spark for basic Data Analytics | |
Intro to charts and visualization - How to represent data and make charts effectively | |
Basic Twitter Sentiment Analysis using Zeppelin/Spark Shell + SQL | |
Demos + Discussion | |
Extras-: | |
Discuss - R vs Spark vs R on Spark - When to use which? | |
Session 7 (Streaming) | |
Spark Streaming Demo | |
Spark vs Storm vs Kafka vs Flink (Opinions and Thoughts) | |
Spark Streaming basics | |
Exercise - Spark Streaming with Kafka (Live Twitter Sentiment Analysis) | |
Discussion | |
Extras-: | |
Real Time vs Near Real time (Pros vs cons?) | |
Session 8 (Beyond DE - Beginning basic Data Science?) - Optional | |
Intro to ML - Where to begin and how - Resources? | |
Spark ML/Tensor Flow | |
Demos + Discussion | |
--FIN-- | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Code Here -:
https://github.com/syedatifakhtar/DEWorkshop