Apache Hadoop Quick Start

#Apache Hadoop Quick Start Guide

This content was originally prepared for the course CMPUT 681 - Parallel and Distributed Systems at the University of Alberta (Fall 2016):
https://www.ualberta.ca/computing-science/graduate-studies/course-directory/courses/parallel-and-distributed-systems

Installing Hadoop on a cluster

Manual Installation on RAC (OpenStack) Cluster:

Obtain latest stable Hadoop release: https://hadoop.apache.org/releases.html

A good article for cluster setup: http://dasunhegoda.com/hadoop-2-deployment-installation-ubuntu/1085/

Pre-cooked VM as local playground

Cloudera QuickStart VM: http://www.cloudera.com/downloads/quickstart_vms/5-8.html
Hortonworks Sandbox VM: http://hortonworks.com/products/sandbox/
MapR Sandbox: https://www.mapr.com/products/mapr-sandbox-hadoop

Notes

Assign the Floating-IP to your master node
To be able to access web interface you should create rules to allow incoming traffic on ports 8088, 50070, in addition to the SSH port (22)
- RAC (OpenStack) dashboard > Access & Security > Security Groups > Manage Rules
Use the following command on nodes to install Java:
- $ sudo apt-get install openjdk-8-jdk-headless
Use the following path for the JAVA_HOME instead of the one in article:
- /usr/lib/jvm/java-8-openjdk-amd64
Add the following line at the end of ~/.bashrc :
- export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar:$HADOOP_CLASSPATH
The above article is written for a cluster with 5 nodes. In case you have 4 nodes in your RAC (OpenStack) account, do the steps with only 2 slave nodes, or assign the master2 (Secondary NameNode) to one of the slave nodes.
After starting the Hadoop cluster, you need to create the main user folder in HDFS before putting files on it:
- $ hdfs dfs -mkdir -p /user/ubuntu

Accessing web interfaces

HDFS http://<floating-IP>:50070
YARN http://<floating-IP>:8088

Some useful commands

$ yarn node -list
$ yarn node -status <node-id>
$ hdfs dfsadmin -report

$ mapred job -status <job-id>
$ mapred job -kill <job-id>

$ hdfs dfs -put <data-file/directory> <path in HDFS>
$ hdfs dfs -ls 
$ hdfs dfs -rm -r <HDFS path>

Java MapReduce programs

WordCount.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Update 2016-11-30

It might happen that Hadoop Yarn fails to run the Java Jar file using the above code, which is indeed taken from Hadoop official website. It turns out that you may add the following line to the main() function in order to make it to work:

job.setJar("wc.jar");

(Credit goes to Saeed Sarabchi)

Compile and Run

$ hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class
$ hadoop jar wc.jar WordCount <path to input file/folders in HDFS> <path to output folder in HDFS>

Hadoop Streaming

mapper.py

#!/usr/bin/env python

import sys

for line in sys.stdin:
    line = line.strip()
    keys = line.split()
    for key in keys:
        value = 1
        print( "%s\t%d" % (key, value) )

reducer.py

#!/usr/bin/env python

import sys

last_key = None
running_total = 0

for input_line in sys.stdin:
   input_line = input_line.strip()
   this_key, value = input_line.split("\t", 1)
   value = int(value)

   if last_key == this_key:
       running_total += value
   else:
       if last_key:
           print( "%s\t%d" % (last_key, running_total) )
       running_total = value
       last_key = this_key

if last_key == this_key:
   print( "%s\t%d" % (last_key, running_total) )

Testing your Mapper & Reducer scripts

$ head -n100 <local-data-file> | ./mapper.py | sort | ./reducer.py

Run on Hadoop

$ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
 -file /home/ubuntu/workspace/mapper.py -file /home/ubuntu/workspace/reducer.py \
 -mapper /home/ubuntu/workspace/mapper.py -reducer /home/ubuntu/workspace/reducer.py \
 -input <HDFS path to file/folder> -output <HDFS path>

MapReduce Pipes

MapReduce Pipes are used to run programs developed using 3rd party libraries, such as Pydoop.
You need to install the utilized library on nodes to be able to use this method

mapreduce.py

Developed using Pydoop library

#!/usr/bin/env python
import pydoop.pipes as pa

class Mapper(pa.Mapper):

  def map(self, context):
    words = context.getInputValue().split()
    for w in words:
      context.emit(w, "1")

class Reducer(pa.Reducer):

  def reduce(self, context):
    s = 0
    while context.nextValue():
      s += int(context.getInputValue())
    context.emit(context.getInputKey(), str(s))

if __name__ == "__main__":
  pa.runTask(pa.Factory(Mapper, Reducer))

Command to Run

$ mapred pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true \
-program mapreduce.py  -input <HDFS path to file/folder> -output <HDFS path>

References & Resources

Download Apache Hadoop:

https://hadoop.apache.org/releases.html

Hadoop 3 New Features:

Articles on Cluster Configuration:

MapReduce concepts and programming:

Cloudera QuickStart VM:

http://www.cloudera.com/downloads/quickstart_vms/5-8.html

Hadoop Libraries for Streaming/Piping:

https://wiki.apache.org/hadoop/HadoopStreaming/AlternativeInterfaces

Article on Python MapReduce with Pydoop:

Articles on Hadoop Concepts:

Articles on Hadoop Streaming:

hanvari/hadoop-quick-start.md