Skip to content

Instantly share code, notes, and snippets.

@dgadiraju
Last active June 28, 2019 06:52
Show Gist options
  • Select an option

  • Save dgadiraju/bba062f110c7fd05815055c9caa7191d to your computer and use it in GitHub Desktop.

Select an option

Save dgadiraju/bba062f110c7fd05815055c9caa7191d to your computer and use it in GitHub Desktop.
path = "/public/yelp-dataset/yelp_review.csv"
yelpReview = sc.newAPIHadoopFile(path,
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter' : '\r'})
yelpReview.count()
for i in yelpReview.map(lambda r: str(r[1])).take(10): print(i)
for i in yelpReview. \
map(lambda r: (len(str(r[1]).split('","')), 1)). \
reduceByKey(lambda x, y: x + y). \
collect():
print(i)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment