Skip to content

Instantly share code, notes, and snippets.

@mamun67
Created January 13, 2019 20:28
Show Gist options
  • Select an option

  • Save mamun67/b2ea0d99df3f2b7fff3c6f6215a9516c to your computer and use it in GitHub Desktop.

Select an option

Save mamun67/b2ea0d99df3f2b7fff3c6f6215a9516c to your computer and use it in GitHub Desktop.
Hadoop Basics
Pre Hadoop 2.2
Two main components
-disstributed file system
-MapReduce Engine
HDFS runs on top of the existing file system
Designed to handle very large files
HDFS file blocks
-Not same as the OS file blocks
-Default for Hadoop is 64MB
-Block for a single file are spread across multiples nodes in the cluster
Name Node
Only one per hadoop cluster
Manage the filesystem namespace and metadata
-data does not go through the NameNode
-data is not stored on the NameNode
Single point of failure
-good idea to mirror the NameNode
-Do not use inexpensive, commodity hardware
Data Nodes
Many per Hadoop cluster
Blocks from different files can be stored on the same data Node
Manage blocks with data and serves then to clients
Peridically reports to NameNode the list of block it stores
Suitable for inexpensive, commodity hardware
JobTracker
Manage the MapReduce jobs in the cluster
One per Hadoopcluster
Receives job requests submitted by clients
Schedule and Monitors MapReduce jobs on TaskTrackers
Execute MapReduce Operations
-Runs the MapReduce tasks in JVMs
-Have a set number of slots used to run tasks
-Communicates with the JobTracker via heartbeat messages
-Reads block from DataNodes
Hadoop 2.2 architecture
Provides Yarn
- Referred to as MapReduce V2
-Resource manager and scheduler external to any framework
-DataNodes still existing
- JobTracker and Task tracker no longer exists
Two Main Idea
Provide generis scheduling and resource managment
More efficient scheduling and workload managment
HDFS command line interface
The FileSystem(FS) shell is invoked by
hdfs dfs <args>
Example to list the current directory in HDFS
hdfs dfs -ls
All FS shell command take URIs as arguments
scheme://authority/path
Scheme
-schema for HDFS is hdfs
-Scheme for the local filesystem is file
hdfs dfs -cp
file://sampleData/spark.myfile.txt
hdfs://
Scheme and authoriy are optional
Dafualts are taken from core-site.xml configuration file
Most of the FS shell commands behave like corresponding UNIX xommands
A number of POSIX-like commands
-cat, chgrp, chmod, chown, cp , du, ls, mkdir, mv, rm , stat, tail
Some HDFS-specific commands
-copyFroomLocal, copyToLocal, get, getmerge, put, setrep
copyFroomLocal/put
copies files from the local filesystem to HDFS
copyToLocal/get
copies files from HDFS to the local filesystem
getMerge
gets all files in the directories that match the source -pattern
-Merges and sorts them to only one file on local filesystem
setrep
Sets the replication factor of a file
Can be executed recursively to change an entire tree
can specify to wait until the replication level is achieved
Adding or removing node from the cluster
-can be performed from Ambari web console
-need IP address or hostname of node to address
-node must be reachable
-BigInsights on node to add must NOT be installed
/etc/hosts on both master and child nodes should be updated prior to adding child nodes
hadoop-env.sh environment variables that are used in the scripts to run Hadoop
core-site.xml configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce
hdfs-site.xml COnfiguration settinggs for HDFS daemons, the name node, secondary name node and the daya nodes
mapred-site.xml configuration settings for MapReduce daemons and jobtracker and task trackers
masters a list of mchines (one per line) that each rin secondary NameNode
slaves A list of machines (one per line) that each run data node and tasktracker
hadoop-metrics.properties Properties for controlling how metrics are published in Hadoop
log4j.properties Properties for system logfiles, the NameNode audit log and the task log for the tasktracker child process
MapReduce
Processes huge datasets for certain kinds of
distributable problems using a large number of nodes
Map
Master node partitions the input into smaller sub-problems
Distributes the sub-problems to worker nodes
Reduce
Master node then takes the answers to all the sub-problems
Combines them in some way to get the output
Allows for disstributed processing of the map and reduce oprations
Pig
Developed at Yahoo
Data flow language
Can operate on comple, nested data structures
schema optional
relationally complete
turing when extended with Java UDFs
Hive
developed at facebook
decarartive language (SQL dialect)
Schema non-optional but data can have may schemas
Relationally completeTuring complete when extended with Java UDFs
Flume
A service for moving large amount of data around a cluster soon after the data is produced
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment