Skip to content

Instantly share code, notes, and snippets.

@kagesenshi
Last active December 2, 2015 12:17
Show Gist options
  • Save kagesenshi/76c42130fce6160eb4b2 to your computer and use it in GitHub Desktop.
Save kagesenshi/76c42130fce6160eb4b2 to your computer and use it in GitHub Desktop.
Template for writing Python Hive UDF Transform that supports multiprocessing
from __future__ import print_function
import sys
from multiprocessing import Process, Pool, cpu_count
def transform(*args):
# -- do something here --
return []
def process_line(line):
line = line.strip().split('\t')
line = [None if x=='\N' else x for x in line]
result = transform(*line)
return '\t'.join(['\N' if x is None else str(x) for x in result])
pool = Pool(cpu_count())
for r in pool.map(process_line, sys.stdin):
print(r)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment