mamun67 · January 13, 2019 20:28
diff --git a/hadoop b/hadoop
 Pre Hadoop 2.2 

 Two main components
 -disstributed file system
 -MapReduce Engine


 HDFS runs on top of the existing file system

 Designed to handle very large files

 HDFS file blocks
 -Not same as the OS file blocks
 -Default for Hadoop is 64MB
 -Block for a single file are spread across multiples nodes in the cluster


 Name Node

  Only one per hadoop cluster
  Manage the filesystem namespace and metadata
 	-data does not go through the NameNode
 	-data is not stored on the NameNode
 Single point of failure
 	-good idea to mirror the NameNode
 	-Do not use inexpensive, commodity hardware
 	
 Data Nodes
 Many per Hadoop cluster
 Blocks from different files can be stored on the same data Node
 Manage blocks with data and serves then to clients
 Peridically reports to NameNode the list of block it stores
 Suitable for inexpensive, commodity hardware

 JobTracker
 Manage the MapReduce jobs in the cluster
 One per Hadoopcluster
 Receives job requests submitted by clients
 Schedule and Monitors MapReduce jobs on TaskTrackers
 Execute MapReduce Operations
 	-Runs the MapReduce tasks in JVMs
 	-Have a set number of slots used to run tasks
 	-Communicates with the JobTracker via heartbeat messages
 	-Reads block from DataNodes
 	
 Hadoop 2.2 architecture

 Provides Yarn
 	- Referred to as MapReduce V2
 	-Resource manager and scheduler external to any framework
 	-DataNodes still existing
 	- JobTracker and Task tracker no longer exists
 Two Main Idea
 	Provide generis scheduling and resource managment
 	More efficient scheduling and workload managment
 	
 	
 HDFS command line interface
 The FileSystem(FS) shell is invoked by
 	hdfs dfs <args>
 Example to list the current directory in HDFS
 	hdfs dfs -ls
 	
 All FS shell command take URIs as arguments
 	scheme://authority/path
 Scheme
 	-schema for HDFS is hdfs
 	-Scheme for the local filesystem is file
 	
 hdfs dfs -cp
 		file://sampleData/spark.myfile.txt
 		hdfs://
 Scheme and authoriy are optional
 	Dafualts are taken from core-site.xml configuration file
 Most of the FS shell commands behave like corresponding UNIX xommands

 A number of POSIX-like commands
 	-cat, chgrp, chmod, chown, cp , du, ls, mkdir, mv, rm , stat, tail
 Some HDFS-specific commands

 -copyFroomLocal, copyToLocal, get, getmerge, put, setrep
 
 copyFroomLocal/put
 	copies files from the local filesystem to HDFS
 copyToLocal/get
 	copies files from HDFS to the local filesystem
 getMerge
  gets all files in the directories that match the source -pattern
  -Merges and sorts them to only one file on local filesystem
 setrep
 	Sets the replication factor of a file
 	Can be executed recursively to change an entire tree
 	can specify to wait until the replication level is achieved

 Adding or removing node from the cluster

 -can be performed from Ambari web console
 	-need IP address or hostname of node to address
 	-node must be reachable
 -BigInsights on node to add must NOT be installed
 /etc/hosts on both master and child nodes should be updated prior to adding child nodes

 hadoop-env.sh  environment variables that are used in the scripts to run Hadoop
 core-site.xml configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce
 hdfs-site.xml COnfiguration settinggs for HDFS daemons, the name node, secondary name node and the daya nodes
 mapred-site.xml configuration settings for MapReduce daemons and jobtracker and task trackers

 masters a list of mchines (one per line) that each rin secondary NameNode 
 slaves  A list of machines (one per line) that each run data node and tasktracker

 hadoop-metrics.properties Properties for controlling how metrics are published in Hadoop

 log4j.properties Properties for system logfiles, the NameNode audit log and the task log for the tasktracker child process


 MapReduce

 Processes huge datasets for certain kinds of 
 distributable problems using a large number of nodes

 Map
 	Master node partitions the input into smaller sub-problems
 	Distributes the sub-problems to worker nodes
 	
 Reduce
 	Master node then takes the answers to all the sub-problems
 	Combines them in some way to get the output
 Allows for disstributed processing of the map and reduce oprations

 Pig

 Developed at Yahoo
 Data flow language
 Can operate on comple, nested data structures
 schema optional
 relationally complete
 turing when extended with Java UDFs

 Hive

 developed at facebook
 decarartive language (SQL dialect)
 Schema non-optional but data can have may schemas
 Relationally completeTuring complete when extended with Java UDFs

 Flume 
 A service for moving large amount of data around a cluster soon after the data is produced
	Pre Hadoop 2.2

	Two main components
	-disstributed file system
	-MapReduce Engine


	HDFS runs on top of the existing file system

	Designed to handle very large files

	HDFS file blocks
	-Not same as the OS file blocks
	-Default for Hadoop is 64MB
	-Block for a single file are spread across multiples nodes in the cluster


	Name Node

	Only one per hadoop cluster
	Manage the filesystem namespace and metadata
	-data does not go through the NameNode
	-data is not stored on the NameNode
	Single point of failure
	-good idea to mirror the NameNode
	-Do not use inexpensive, commodity hardware

	Data Nodes
	Many per Hadoop cluster
	Blocks from different files can be stored on the same data Node
	Manage blocks with data and serves then to clients
	Peridically reports to NameNode the list of block it stores
	Suitable for inexpensive, commodity hardware

	JobTracker
	Manage the MapReduce jobs in the cluster
	One per Hadoopcluster
	Receives job requests submitted by clients
	Schedule and Monitors MapReduce jobs on TaskTrackers
	Execute MapReduce Operations
	-Runs the MapReduce tasks in JVMs
	-Have a set number of slots used to run tasks
	-Communicates with the JobTracker via heartbeat messages
	-Reads block from DataNodes

	Hadoop 2.2 architecture

	Provides Yarn
	- Referred to as MapReduce V2
	-Resource manager and scheduler external to any framework
	-DataNodes still existing
	- JobTracker and Task tracker no longer exists
	Two Main Idea
	Provide generis scheduling and resource managment
	More efficient scheduling and workload managment


	HDFS command line interface
	The FileSystem(FS) shell is invoked by
	hdfs dfs <args>
	Example to list the current directory in HDFS
	hdfs dfs -ls

	All FS shell command take URIs as arguments
	scheme://authority/path
	Scheme
	-schema for HDFS is hdfs
	-Scheme for the local filesystem is file

	hdfs dfs -cp
	file://sampleData/spark.myfile.txt
	hdfs://
	Scheme and authoriy are optional
	Dafualts are taken from core-site.xml configuration file
	Most of the FS shell commands behave like corresponding UNIX xommands

	A number of POSIX-like commands
	-cat, chgrp, chmod, chown, cp , du, ls, mkdir, mv, rm , stat, tail
	Some HDFS-specific commands

	-copyFroomLocal, copyToLocal, get, getmerge, put, setrep

	copyFroomLocal/put
	copies files from the local filesystem to HDFS
	copyToLocal/get
	copies files from HDFS to the local filesystem
	getMerge
	gets all files in the directories that match the source -pattern
	-Merges and sorts them to only one file on local filesystem
	setrep
	Sets the replication factor of a file
	Can be executed recursively to change an entire tree
	can specify to wait until the replication level is achieved

	Adding or removing node from the cluster

	-can be performed from Ambari web console
	-need IP address or hostname of node to address
	-node must be reachable
	-BigInsights on node to add must NOT be installed
	/etc/hosts on both master and child nodes should be updated prior to adding child nodes

	hadoop-env.sh environment variables that are used in the scripts to run Hadoop
	core-site.xml configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce
	hdfs-site.xml COnfiguration settinggs for HDFS daemons, the name node, secondary name node and the daya nodes
	mapred-site.xml configuration settings for MapReduce daemons and jobtracker and task trackers

	masters a list of mchines (one per line) that each rin secondary NameNode
	slaves A list of machines (one per line) that each run data node and tasktracker

	hadoop-metrics.properties Properties for controlling how metrics are published in Hadoop

	log4j.properties Properties for system logfiles, the NameNode audit log and the task log for the tasktracker child process


	MapReduce

	Processes huge datasets for certain kinds of
	distributable problems using a large number of nodes

	Map
	Master node partitions the input into smaller sub-problems
	Distributes the sub-problems to worker nodes

	Reduce
	Master node then takes the answers to all the sub-problems
	Combines them in some way to get the output
	Allows for disstributed processing of the map and reduce oprations

	Pig

	Developed at Yahoo
	Data flow language
	Can operate on comple, nested data structures
	schema optional
	relationally complete
	turing when extended with Java UDFs

	Hive

	developed at facebook
	decarartive language (SQL dialect)
	Schema non-optional but data can have may schemas
	Relationally completeTuring complete when extended with Java UDFs

	Flume
	A service for moving large amount of data around a cluster soon after the data is produced
No results found