#Apache Hadoop Quick Start Guide
This content was originally prepared for the course CMPUT 681 - Parallel and Distributed Systems at the University of Alberta (Fall 2016):
https://www.ualberta.ca/computing-science/graduate-studies/course-directory/courses/parallel-and-distributed-systems
Obtain latest stable Hadoop release: https://hadoop.apache.org/releases.html
A good article for cluster setup: http://dasunhegoda.com/hadoop-2-deployment-installation-ubuntu/1085/
Cloudera QuickStart VM: http://www.cloudera.com/downloads/quickstart_vms/5-8.html
Hortonworks Sandbox VM: http://hortonworks.com/products/sandbox/
MapR Sandbox: https://www.mapr.com/products/mapr-sandbox-hadoop
- Assign the Floating-IP to your master node
- To be able to access web interface you should create rules to allow incoming traffic on ports 8088, 50070, in addition to the SSH port (22)
- RAC (OpenStack) dashboard > Access & Security > Security Groups > Manage Rules
- Use the following command on nodes to install Java:
$ sudo apt-get install openjdk-8-jdk-headless
- Use the following path for the JAVA_HOME instead of the one in article:
/usr/lib/jvm/java-8-openjdk-amd64
- Add the following line at the end of ~/.bashrc :
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar:$HADOOP_CLASSPATH
- The above article is written for a cluster with 5 nodes. In case you have 4 nodes in your RAC (OpenStack) account, do the steps with only 2 slave nodes, or assign the master2 (Secondary NameNode) to one of the slave nodes.
- After starting the Hadoop cluster, you need to create the main user folder in HDFS before putting files on it:
$ hdfs dfs -mkdir -p /user/ubuntu
- HDFS
http://<floating-IP>:50070
- YARN
http://<floating-IP>:8088
$ yarn node -list
$ yarn node -status <node-id>
$ hdfs dfsadmin -report
$ mapred job -status <job-id>
$ mapred job -kill <job-id>
$ hdfs dfs -put <data-file/directory> <path in HDFS>
$ hdfs dfs -ls
$ hdfs dfs -rm -r <HDFS path>
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
It might happen that Hadoop Yarn fails to run the Java Jar file using the above code, which is indeed taken from Hadoop official website. It turns out that you may add the following line to the main()
function in order to make it to work:
job.setJar("wc.jar");
(Credit goes to Saeed Sarabchi)
$ hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class
$ hadoop jar wc.jar WordCount <path to input file/folders in HDFS> <path to output folder in HDFS>
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
keys = line.split()
for key in keys:
value = 1
print( "%s\t%d" % (key, value) )
#!/usr/bin/env python
import sys
last_key = None
running_total = 0
for input_line in sys.stdin:
input_line = input_line.strip()
this_key, value = input_line.split("\t", 1)
value = int(value)
if last_key == this_key:
running_total += value
else:
if last_key:
print( "%s\t%d" % (last_key, running_total) )
running_total = value
last_key = this_key
if last_key == this_key:
print( "%s\t%d" % (last_key, running_total) )
$ head -n100 <local-data-file> | ./mapper.py | sort | ./reducer.py
$ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
-file /home/ubuntu/workspace/mapper.py -file /home/ubuntu/workspace/reducer.py \
-mapper /home/ubuntu/workspace/mapper.py -reducer /home/ubuntu/workspace/reducer.py \
-input <HDFS path to file/folder> -output <HDFS path>
MapReduce Pipes are used to run programs developed using 3rd party libraries, such as Pydoop.
You need to install the utilized library on nodes to be able to use this method
Developed using Pydoop library
#!/usr/bin/env python
import pydoop.pipes as pa
class Mapper(pa.Mapper):
def map(self, context):
words = context.getInputValue().split()
for w in words:
context.emit(w, "1")
class Reducer(pa.Reducer):
def reduce(self, context):
s = 0
while context.nextValue():
s += int(context.getInputValue())
context.emit(context.getInputKey(), str(s))
if __name__ == "__main__":
pa.runTask(pa.Factory(Mapper, Reducer))
$ mapred pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true \
-program mapreduce.py -input <HDFS path to file/folder> -output <HDFS path>
Download Apache Hadoop:
Hadoop 3 New Features:
- https://blog.cloudera.com/blog/2016/09/getting-to-know-the-apache-hadoop-3-alpha/
- https://dzone.com/articles/getting-to-know-the-apache-hadoop-3-alpha
Articles on Cluster Configuration:
- http://dasunhegoda.com/hadoop-2-deployment-installation-ubuntu/1085/
- http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
- http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
MapReduce concepts and programming:
- http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html
Cloudera QuickStart VM:
Hadoop Libraries for Streaming/Piping:
Article on Python MapReduce with Pydoop:
Articles on Hadoop Concepts:
- http://www.glennklockwood.com/data-intensive/hadoop/overview.html
- https://developer.yahoo.com/hadoop/tutorial/
Articles on Hadoop Streaming: