Hadoop Basics

Hadoop Stack Basics

What is Haddop (2005)

Is an open-source software framework for storage and large scale processing of data-sets on cluster of commodity hardware.

Scalability
Realibility

Keep all the data to raw format and use schema on reading style.

The Apache Framework: Basic Modules

Hadoop Common: libraries and utilities to other hadoop modules
Hadoop Distributed File System (HDFS): distributed file system that stores data.
Hadoop Map Reduce: programing model
Hadoop YARN: resource managemnt platform responsible for managing compute resources in the cluster

Map Reduce Layer

Hadoop Distributed File System (HDFS)

Distributed, scalable and portable file-system written in Java for the Hadoop framework.

The Hadoop Zoo

Hadoop Ecosystem Major Components

Sqoop

Tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

HBASE

Is a key component of the Hadoop stack

Column-oriented database management
Key-value store
Based on Google Big Table
Can hold extremely large data
Dynamic data model
Not a relational DRMS

PIG

It's a scripting language

High level programming on top of Hadoop MapReduce
Pig Latin
Data analysis problems as data flows
Originally developed at Yahoo in 2006

UDF: User defined functions

Apache Hive

Data warehouse software facilitates quering and managing large datasets residing in distribute storage.

Oozie

Workflow scheduler system to manage Apache Hadoop jobs
oozie Coordinator jobs
Supports: MapReduce, Pig, Apache, Hive and Sqoop

Zookeeper

Provides operational services for a Hadoop cluster
Centralized service for maintaining configuration information, naming, provising distributed syncronization and providing

Flume

Distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data.

Impala

Is Cloudera's open source massively parallel processing (MPP) SQL

Spark

Is a fast and general engine for large-scale data processing

Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications
Allows users prograns to load data into a cluster's memory and query it repeatedly
Well-suited to machine learning

Overview of Hadoop Stack

The Hadoop Distributed File System (HDFS) and HDFS2

Original HDFS Design Goals

Resilience
Scalable
Application Locality
Portability

By default de DataNode is replicated 3 times.

Single NameNode
Multiple DataNodes
- Manage store - blocks of daa
- Serving read/write requests from clients
- Block creation, deletion, replication

HDFS in Hadoop 2

HDFS Federation

Increased namespace scalability
Performance
Isolation

How it's works:

Multiple NameNode Servers
Multiple namespaces
Block pools

MapReduce Framework and YARN

Software framework - for writing parallel data processing applications
MapReduce jobs splits data into chuncks
Map tasks process data chuncks
Framework sorts mao output
Reduce tasks use sorted map data as input

Original MapReduce Framework

Single Master JobTracker
JobTracker schedules, monitors and re-executes failed tasks
One slace TaskTracker per cluster node
TaskTracker executes tasks per JobTracker requests

YARN - Next Generation of MapReduce

Separate resource management and job scheduling/monitoring
Global ResourceManager (RM)
NodeManage on each node
ApplicationMaster - one for each application

Additional features

High Avalilability ResourceManager
Timeline Serer
Use of Cgroups
Secure Containers
YARN - web services REST APIs

The Apache Tez™ project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.

The Hadoop Execution Environment

DGA - Directed Acyclic Graph

YARN, Tez, and Spark

Execution frameworks

Tez

Dataflow graphs
Custom data types
Can run complex DAG of tasks
Dynamic DAG changes
Resource usage efficiency

Spark

Advances DAG execution engine
Supports cyclic data flow
In-memory computing
Java, Scala, Python, R
Existing optimized libraries

Hadoop Resource Scheduling

Resource management
Different kinds of scheduling algorithms
Types of parameters that can be controlled

Basic:

Default - First in First out (FIFO)
Fairshare
Capacity

Capacity Scheduler

Queues and sub-queues
Capacity Guarantee with elasticity
ACLs for security
Runtime changes/draining apps
Resources based scheduling

Fairshare Scheduler

Balances out resource allocation among app over time
Can organize into queues/sub-queues
Guarantee minimum shares
Limits per user/app
Weighted app priorities

wagnerjgoncalves/week_1.md

Hadoop Basics

Hadoop Stack Basics

What is Haddop (2005)

The Apache Framework: Basic Modules

Map Reduce Layer

Hadoop Distributed File System (HDFS)

The Hadoop Zoo

Hadoop Ecosystem Major Components

Sqoop

HBASE

PIG

Apache Hive

Oozie

Zookeeper

Flume

Impala

Spark

Overview of Hadoop Stack

The Hadoop Distributed File System (HDFS) and HDFS2

Original HDFS Design Goals

HDFS in Hadoop 2

MapReduce Framework and YARN

Original MapReduce Framework

YARN - Next Generation of MapReduce

The Hadoop Execution Environment

YARN, Tez, and Spark

Tez

Spark

Hadoop Resource Scheduling

Capacity Scheduler

Fairshare Scheduler