syedatifakhtar · January 6, 2018 09:13 · syedatifakhtar · Aug 24, 2017
diff --git a/BEGINNERS_GUIDE_TO_DE.txt b/BEGINNERS_GUIDE_TO_DE.txt
 SESSION 1
 What is Big Data?Why Big Data?
 Some Examples of where Big Data is necessary
 Thinking in terms of MapReduce
 HDFS vs S3 vs Local File System
 Resources/Containers/Nodes
 WordCount - Exercise

 Extras-:
 JOINS in MapReduce
 Input Splits
 HDFS Internals
 Data Skews

 Session 2
 ETL Pipelines and How they work (Ingestion/Cleaning/Processing/Output Churn)
 Tying things together with Sqoop and Oozie
    -Demo
 Resource Management Basics - YARN vs Mesos 
 Enter HIVE - WordCount
 Sample ETL Oozie + Sqoop + MapR + Hive

 Session 3
 Different Databases - Pig vs HBASE vs HIVE
 Analyzing Queries - Optimizations and why they make a huge difference in Big Data
 ROW vs Columnar stores + Compression - How to use and store your data effectively
 Sample Application - Analyze and Optimize
 After thoughts - Where does SQL fail?Why is the MapReduce API so complicated?Basic Optimizations that should be done?
 Alternatives - Flume/Crunch/Scooby

 Extras-:
 BucketJoin
 MapJoin
 Other Optimizations - How does Hive do it?
 CBO

 Session 4
 Flume/Crunch/Scooby
 Spark!
 Running Programs through Spark Shell (Word Count)
 Spark RDD Basics
 Optimizations Basic - Watching for jobs - Checkpointing/Preventing Spills/Parallelism

 Extras-:
 Data Locality
 Optimizing Jobs - Executor Cores / Memory Configuration / Optimizing Spills 



 Session 5 (Optional - Discuss?)
 Scala Primer
 How to think of Data Functionally and Why is it important for ETL?

 Extras-:
 Course on Scala

 Session 6 (Data Analytics)
 Notebooks Intro Ipython vs Zeppelin and others
 Spark SQL Basics
 Spark for basic Data Analytics
 Intro to charts and visualization - How to represent data and make charts effectively
 Basic Twitter Sentiment Analysis using Zeppelin/Spark Shell + SQL
 Demos + Discussion

 Extras-:
 Discuss - R vs Spark vs R on Spark - When to use which?

 Session 7 (Streaming)

 Spark Streaming Demo
 Spark vs Storm vs Kafka vs Flink (Opinions and Thoughts)
 Spark Streaming basics
 Exercise - Spark Streaming with Kafka (Live Twitter Sentiment Analysis)
 Discussion

 Extras-:
 Real Time vs Near Real time (Pros vs cons?)

 Session 8 (Beyond DE - Beginning basic Data Science?) - Optional
 Intro to ML -  Where to begin and how - Resources?
 Spark ML/Tensor Flow
 Demos + Discussion

 --FIN--
	SESSION 1
	What is Big Data?Why Big Data?
	Some Examples of where Big Data is necessary
	Thinking in terms of MapReduce
	HDFS vs S3 vs Local File System
	Resources/Containers/Nodes
	WordCount - Exercise

	Extras-:
	JOINS in MapReduce
	Input Splits
	HDFS Internals
	Data Skews

	Session 2
	ETL Pipelines and How they work (Ingestion/Cleaning/Processing/Output Churn)
	Tying things together with Sqoop and Oozie
	-Demo
	Resource Management Basics - YARN vs Mesos
	Enter HIVE - WordCount
	Sample ETL Oozie + Sqoop + MapR + Hive

	Session 3
	Different Databases - Pig vs HBASE vs HIVE
	Analyzing Queries - Optimizations and why they make a huge difference in Big Data
	ROW vs Columnar stores + Compression - How to use and store your data effectively
	Sample Application - Analyze and Optimize
	After thoughts - Where does SQL fail?Why is the MapReduce API so complicated?Basic Optimizations that should be done?
	Alternatives - Flume/Crunch/Scooby

	Extras-:
	BucketJoin
	MapJoin
	Other Optimizations - How does Hive do it?
	CBO

	Session 4
	Flume/Crunch/Scooby
	Spark!
	Running Programs through Spark Shell (Word Count)
	Spark RDD Basics
	Optimizations Basic - Watching for jobs - Checkpointing/Preventing Spills/Parallelism

	Extras-:
	Data Locality
	Optimizing Jobs - Executor Cores / Memory Configuration / Optimizing Spills



	Session 5 (Optional - Discuss?)
	Scala Primer
	How to think of Data Functionally and Why is it important for ETL?

	Extras-:
	Course on Scala

	Session 6 (Data Analytics)
	Notebooks Intro Ipython vs Zeppelin and others
	Spark SQL Basics
	Spark for basic Data Analytics
	Intro to charts and visualization - How to represent data and make charts effectively
	Basic Twitter Sentiment Analysis using Zeppelin/Spark Shell + SQL
	Demos + Discussion

	Extras-:
	Discuss - R vs Spark vs R on Spark - When to use which?

	Session 7 (Streaming)

	Spark Streaming Demo
	Spark vs Storm vs Kafka vs Flink (Opinions and Thoughts)
	Spark Streaming basics
	Exercise - Spark Streaming with Kafka (Live Twitter Sentiment Analysis)
	Discussion

	Extras-:
	Real Time vs Near Real time (Pros vs cons?)

	Session 8 (Beyond DE - Beginning basic Data Science?) - Optional
	Intro to ML - Where to begin and how - Resources?
	Spark ML/Tensor Flow
	Demos + Discussion

	--FIN--