What follows is a technical test for this job offer at CARTO: https://boards.greenhouse.io/cartodb/jobs/705852#.WSvORxOGPUI
Build the following and make it run as fast as you possibly can using Python 3 (vanilla). The faster it runs, the more you will impress us!
Your code should:
- Download this ~2GB file: https://s3.amazonaws.com/carto-1000x/data/yellow_tripdata_2016-01.csv
- Count the lines in the file
- Calculate the average value of the tip_amount field.
All of that in the most efficient way you can come up with.
That's it. Make it fly!
It takes more or less same processing time as previous Pandas solutions (Vanilla Python):
import time
file = '/home/manolinux/yellow_tripdata_2016-01.csv'
field = 'tip_amount'
no_more_data_to_process = False
separator = ','
#https://stackoverflow.com/questions/1883980/find-the-nth-occurrence-of-substring-in-a-string
def findnth(string, substring, n):
parts = string.split(substring, n + 1)
if len(parts) <= n + 1:
return -1
return len(string) - len(parts[-1]) - len(substring)
t0 = time.time()
#Get number of field 'tip_amount'
with open(file) as f:
first_line = f.readline().strip()
l = [idx for idx, item in enumerate(first_line) if separator in item]
fieldNumber = 0
for commaPosition in l:
#First position?
if (field + separator) == first_line[0:len(field+separator)]:
break
#Last
if (separator + field) == first_line[-len(separator + field)]:
break
#Middle
if (separator + field) == first_line[commaPosition:commaPosition+len(separator+field)]:
break
fieldNumber+=1
#Processing
countLines = 0
accumulatedTips = 0
#Buffering option of open seems not to do much
with open(file,"r",163840) as f:
#skip first line
f.readline()
for line in f:
posInLine = findnth(line,separator,fieldNumber)
endField = posInLine+(line[posInLine+1:].index(separator))
countLines+=1
#print (posInLine,endField,line[posInLine:])
accumulatedTips+=float(line[posInLine+1:endField+1])
print("Number of lines:",countLines)
print("Average tip:",accumulatedTips/countLines)
print('Elapsed time : ', time.time() - t0)