Skip to content

Instantly share code, notes, and snippets.

@syedatifakhtar
Last active January 6, 2018 09:13
Show Gist options
  • Save syedatifakhtar/ad6625fd8576bddeda827f08c44e4458 to your computer and use it in GitHub Desktop.
Save syedatifakhtar/ad6625fd8576bddeda827f08c44e4458 to your computer and use it in GitHub Desktop.
DE Workshop
SESSION 1
What is Big Data?Why Big Data?
Some Examples of where Big Data is necessary
Thinking in terms of MapReduce
HDFS vs S3 vs Local File System
Resources/Containers/Nodes
WordCount - Exercise
Extras-:
JOINS in MapReduce
Input Splits
HDFS Internals
Data Skews
Session 2
ETL Pipelines and How they work (Ingestion/Cleaning/Processing/Output Churn)
Tying things together with Sqoop and Oozie
-Demo
Resource Management Basics - YARN vs Mesos
Enter HIVE - WordCount
Sample ETL Oozie + Sqoop + MapR + Hive
Session 3
Different Databases - Pig vs HBASE vs HIVE
Analyzing Queries - Optimizations and why they make a huge difference in Big Data
ROW vs Columnar stores + Compression - How to use and store your data effectively
Sample Application - Analyze and Optimize
After thoughts - Where does SQL fail?Why is the MapReduce API so complicated?Basic Optimizations that should be done?
Alternatives - Flume/Crunch/Scooby
Extras-:
BucketJoin
MapJoin
Other Optimizations - How does Hive do it?
CBO
Session 4
Flume/Crunch/Scooby
Spark!
Running Programs through Spark Shell (Word Count)
Spark RDD Basics
Optimizations Basic - Watching for jobs - Checkpointing/Preventing Spills/Parallelism
Extras-:
Data Locality
Optimizing Jobs - Executor Cores / Memory Configuration / Optimizing Spills
Session 5 (Optional - Discuss?)
Scala Primer
How to think of Data Functionally and Why is it important for ETL?
Extras-:
Course on Scala
Session 6 (Data Analytics)
Notebooks Intro Ipython vs Zeppelin and others
Spark SQL Basics
Spark for basic Data Analytics
Intro to charts and visualization - How to represent data and make charts effectively
Basic Twitter Sentiment Analysis using Zeppelin/Spark Shell + SQL
Demos + Discussion
Extras-:
Discuss - R vs Spark vs R on Spark - When to use which?
Session 7 (Streaming)
Spark Streaming Demo
Spark vs Storm vs Kafka vs Flink (Opinions and Thoughts)
Spark Streaming basics
Exercise - Spark Streaming with Kafka (Live Twitter Sentiment Analysis)
Discussion
Extras-:
Real Time vs Near Real time (Pros vs cons?)
Session 8 (Beyond DE - Beginning basic Data Science?) - Optional
Intro to ML - Where to begin and how - Resources?
Spark ML/Tensor Flow
Demos + Discussion
--FIN--
@syedatifakhtar
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment