Created
January 13, 2019 20:28
-
-
Save mamun67/b2ea0d99df3f2b7fff3c6f6215a9516c to your computer and use it in GitHub Desktop.
Hadoop Basics
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Pre Hadoop 2.2 | |
| Two main components | |
| -disstributed file system | |
| -MapReduce Engine | |
| HDFS runs on top of the existing file system | |
| Designed to handle very large files | |
| HDFS file blocks | |
| -Not same as the OS file blocks | |
| -Default for Hadoop is 64MB | |
| -Block for a single file are spread across multiples nodes in the cluster | |
| Name Node | |
| Only one per hadoop cluster | |
| Manage the filesystem namespace and metadata | |
| -data does not go through the NameNode | |
| -data is not stored on the NameNode | |
| Single point of failure | |
| -good idea to mirror the NameNode | |
| -Do not use inexpensive, commodity hardware | |
| Data Nodes | |
| Many per Hadoop cluster | |
| Blocks from different files can be stored on the same data Node | |
| Manage blocks with data and serves then to clients | |
| Peridically reports to NameNode the list of block it stores | |
| Suitable for inexpensive, commodity hardware | |
| JobTracker | |
| Manage the MapReduce jobs in the cluster | |
| One per Hadoopcluster | |
| Receives job requests submitted by clients | |
| Schedule and Monitors MapReduce jobs on TaskTrackers | |
| Execute MapReduce Operations | |
| -Runs the MapReduce tasks in JVMs | |
| -Have a set number of slots used to run tasks | |
| -Communicates with the JobTracker via heartbeat messages | |
| -Reads block from DataNodes | |
| Hadoop 2.2 architecture | |
| Provides Yarn | |
| - Referred to as MapReduce V2 | |
| -Resource manager and scheduler external to any framework | |
| -DataNodes still existing | |
| - JobTracker and Task tracker no longer exists | |
| Two Main Idea | |
| Provide generis scheduling and resource managment | |
| More efficient scheduling and workload managment | |
| HDFS command line interface | |
| The FileSystem(FS) shell is invoked by | |
| hdfs dfs <args> | |
| Example to list the current directory in HDFS | |
| hdfs dfs -ls | |
| All FS shell command take URIs as arguments | |
| scheme://authority/path | |
| Scheme | |
| -schema for HDFS is hdfs | |
| -Scheme for the local filesystem is file | |
| hdfs dfs -cp | |
| file://sampleData/spark.myfile.txt | |
| hdfs:// | |
| Scheme and authoriy are optional | |
| Dafualts are taken from core-site.xml configuration file | |
| Most of the FS shell commands behave like corresponding UNIX xommands | |
| A number of POSIX-like commands | |
| -cat, chgrp, chmod, chown, cp , du, ls, mkdir, mv, rm , stat, tail | |
| Some HDFS-specific commands | |
| -copyFroomLocal, copyToLocal, get, getmerge, put, setrep | |
| copyFroomLocal/put | |
| copies files from the local filesystem to HDFS | |
| copyToLocal/get | |
| copies files from HDFS to the local filesystem | |
| getMerge | |
| gets all files in the directories that match the source -pattern | |
| -Merges and sorts them to only one file on local filesystem | |
| setrep | |
| Sets the replication factor of a file | |
| Can be executed recursively to change an entire tree | |
| can specify to wait until the replication level is achieved | |
| Adding or removing node from the cluster | |
| -can be performed from Ambari web console | |
| -need IP address or hostname of node to address | |
| -node must be reachable | |
| -BigInsights on node to add must NOT be installed | |
| /etc/hosts on both master and child nodes should be updated prior to adding child nodes | |
| hadoop-env.sh environment variables that are used in the scripts to run Hadoop | |
| core-site.xml configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce | |
| hdfs-site.xml COnfiguration settinggs for HDFS daemons, the name node, secondary name node and the daya nodes | |
| mapred-site.xml configuration settings for MapReduce daemons and jobtracker and task trackers | |
| masters a list of mchines (one per line) that each rin secondary NameNode | |
| slaves A list of machines (one per line) that each run data node and tasktracker | |
| hadoop-metrics.properties Properties for controlling how metrics are published in Hadoop | |
| log4j.properties Properties for system logfiles, the NameNode audit log and the task log for the tasktracker child process | |
| MapReduce | |
| Processes huge datasets for certain kinds of | |
| distributable problems using a large number of nodes | |
| Map | |
| Master node partitions the input into smaller sub-problems | |
| Distributes the sub-problems to worker nodes | |
| Reduce | |
| Master node then takes the answers to all the sub-problems | |
| Combines them in some way to get the output | |
| Allows for disstributed processing of the map and reduce oprations | |
| Pig | |
| Developed at Yahoo | |
| Data flow language | |
| Can operate on comple, nested data structures | |
| schema optional | |
| relationally complete | |
| turing when extended with Java UDFs | |
| Hive | |
| developed at facebook | |
| decarartive language (SQL dialect) | |
| Schema non-optional but data can have may schemas | |
| Relationally completeTuring complete when extended with Java UDFs | |
| Flume | |
| A service for moving large amount of data around a cluster soon after the data is produced | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment