Skip to content

Instantly share code, notes, and snippets.

@david30907d
Last active August 30, 2017 13:15
Show Gist options
  • Save david30907d/9aa5ae6b8f27fba81bf95ffc86c54a23 to your computer and use it in GitHub Desktop.
Save david30907d/9aa5ae6b8f27fba81bf95ffc86c54a23 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
import json
t=sc.textFile('Rides_0310.csv')
header = t.map(lambda x:x.split(',')).first()
data = t.map(lambda x:x.split(',')).filter(lambda x:'Id' not in x)
dataDict = data.map(lambda x:dict(zip(header, x)))
stop_and_time = dataDict.flatMap(lambda x:((x['BoardStop'], x['BoardTime']), (x['AlightStop'], x['AlightTime'])))
def groupDate(x):
# s = '2015-05-06 07:00:00.000'
x = list(x)
result = []
for i in x:
tmp = {}
for j in map(lambda x:x[:x.rfind(':')][:-1], list(i[1])):
tmp[j] = tmp.setdefault(j, 0) + 1
result.append((i[0], tmp))
return result
stop=stop_and_time.groupByKey().mapPartitions(groupDate).collect()
json.dump(dict(stop), open('zhou.json','w'))
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment