Skip to content

Instantly share code, notes, and snippets.

Last active January 6, 2018 09:13
Show Gist options
  • Save syedatifakhtar/ad6625fd8576bddeda827f08c44e4458 to your computer and use it in GitHub Desktop.
Save syedatifakhtar/ad6625fd8576bddeda827f08c44e4458 to your computer and use it in GitHub Desktop.
DE Workshop
What is Big Data?Why Big Data?
Some Examples of where Big Data is necessary
Thinking in terms of MapReduce
HDFS vs S3 vs Local File System
WordCount - Exercise
JOINS in MapReduce
Input Splits
HDFS Internals
Data Skews
Session 2
ETL Pipelines and How they work (Ingestion/Cleaning/Processing/Output Churn)
Tying things together with Sqoop and Oozie
Resource Management Basics - YARN vs Mesos
Enter HIVE - WordCount
Sample ETL Oozie + Sqoop + MapR + Hive
Session 3
Different Databases - Pig vs HBASE vs HIVE
Analyzing Queries - Optimizations and why they make a huge difference in Big Data
ROW vs Columnar stores + Compression - How to use and store your data effectively
Sample Application - Analyze and Optimize
After thoughts - Where does SQL fail?Why is the MapReduce API so complicated?Basic Optimizations that should be done?
Alternatives - Flume/Crunch/Scooby
Other Optimizations - How does Hive do it?
Session 4
Running Programs through Spark Shell (Word Count)
Spark RDD Basics
Optimizations Basic - Watching for jobs - Checkpointing/Preventing Spills/Parallelism
Data Locality
Optimizing Jobs - Executor Cores / Memory Configuration / Optimizing Spills
Session 5 (Optional - Discuss?)
Scala Primer
How to think of Data Functionally and Why is it important for ETL?
Course on Scala
Session 6 (Data Analytics)
Notebooks Intro Ipython vs Zeppelin and others
Spark SQL Basics
Spark for basic Data Analytics
Intro to charts and visualization - How to represent data and make charts effectively
Basic Twitter Sentiment Analysis using Zeppelin/Spark Shell + SQL
Demos + Discussion
Discuss - R vs Spark vs R on Spark - When to use which?
Session 7 (Streaming)
Spark Streaming Demo
Spark vs Storm vs Kafka vs Flink (Opinions and Thoughts)
Spark Streaming basics
Exercise - Spark Streaming with Kafka (Live Twitter Sentiment Analysis)
Real Time vs Near Real time (Pros vs cons?)
Session 8 (Beyond DE - Beginning basic Data Science?) - Optional
Intro to ML - Where to begin and how - Resources?
Spark ML/Tensor Flow
Demos + Discussion
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment