Skip to content

Instantly share code, notes, and snippets.

@ewencp
Created October 16, 2013 16:13
Show Gist options
  • Save ewencp/7010531 to your computer and use it in GitHub Desktop.
Save ewencp/7010531 to your computer and use it in GitHub Desktop.
Simple example of reduce-side join in mrjob
from mrjob.job import MRJob
class JoinExample(MRJob):
def mapper(self, id, record):
# Use both large files as input. If you have orders and
# customers, you'll have as input either
# order_id, order_data
# or
# customer_id, customer_data
# In this case, I assume both have a customerID field to join
# on and that you'll be able to differentiate them in the
# reducer
yield record['customerId'], record
def reducer(self, customerID, records):
for record in records:
if is_customer_record(record):
# do something with the customer info
else:
# do something with the order info
yield customerID, new_data
if __name__ == '__main__':
JoinExample.run()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment