Skip to content

Instantly share code, notes, and snippets.

@SiddheshKukade
Last active May 25, 2023 08:22
Show Gist options
  • Save SiddheshKukade/a075adefc154c856d164bbf3ac3ed925 to your computer and use it in GitHub Desktop.
Save SiddheshKukade/a075adefc154c856d164bbf3ac3ed925 to your computer and use it in GitHub Desktop.
DSBDA Python hadoop practical

Required files

  • mapper.py
  • reducer.py

put the files on the hadoops file system

  1. run hdfs dfs -mkdir /sid
  2. Create a file data.txt with contents
WELCOME TO PVGCOET  WELCOME TO PVGCOET  WELCOME TO PVGCOET
WELCOME TO PVGCOET  WELCOME TO PVGCOET  WELCOME TO PVGCOET 
WELCOME TO PVGCOET  ...... // keep it big so that examiner won;t count
  1. run hdfs dfs -put data.txt /sid
  2. Create file mapper.py
import sys
import io 

input_stream = io.TextIOWrapper(sys.stdin.buffer)

for line in input_stream:
  line = line.lower()
  words= line.split()
  for word in words:
    print("%s\t %s" % (word,1))
  1. Create file reducer.py
import sys
import io

current_word = None
current_count =0
word = None

for line in sys.stdin:
  word, count = line.split('\t', 1)
  count = int(count)
  if current_word == word:
      current_count += count
  else
      if current_word:
          print("%s\t%s", $(current_word, current_count))
      current_count = count
      current_word = word

if current_word == word:
   print("%s\t%s" %(current_word, current_word))
  1. run
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.9.0.jar
-input /sid/data.txt
-output /sid/myoutput
-file mapper.py
-file reducer.py
-mapper `python3 mapper.py`
-reducer `python3 reducer.py`

Output:

image

  1. See the file contents run hdfs dfs -cat /sid/myoutput/* Output image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment