DSBDA Python hadoop practical

Required files

mapper.py
reducer.py

put the files on the hadoops file system

run hdfs dfs -mkdir /sid
Create a file data.txt with contents

WELCOME TO PVGCOET  WELCOME TO PVGCOET  WELCOME TO PVGCOET
WELCOME TO PVGCOET  WELCOME TO PVGCOET  WELCOME TO PVGCOET 
WELCOME TO PVGCOET  ...... // keep it big so that examiner won;t count

run hdfs dfs -put data.txt /sid
Create file mapper.py

import sys
import io 

input_stream = io.TextIOWrapper(sys.stdin.buffer)

for line in input_stream:
  line = line.lower()
  words= line.split()
  for word in words:
    print("%s\t %s" % (word,1))

Create file reducer.py

import sys
import io

current_word = None
current_count =0
word = None

for line in sys.stdin:
  word, count = line.split('\t', 1)
  count = int(count)
  if current_word == word:
      current_count += count
  else
      if current_word:
          print("%s\t%s", $(current_word, current_count))
      current_count = count
      current_word = word

if current_word == word:
   print("%s\t%s" %(current_word, current_word))

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.9.0.jar
-input /sid/data.txt
-output /sid/myoutput
-file mapper.py
-file reducer.py
-mapper `python3 mapper.py`
-reducer `python3 reducer.py`

Output:

See the file contents run hdfs dfs -cat /sid/myoutput/* Output

SiddheshKukade/pyhadoop.md

Required files

put the files on the hadoops file system

Output: