jorgesancha/python_code_test_carto.md

Last active May 12, 2025 13:51

Star (19) You must be signed in to star a gist
Fork (23) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jorgesancha/2a8027e5a89a2ea1693d63a45afdd8b6.js"></script>
Save jorgesancha/2a8027e5a89a2ea1693d63a45afdd8b6 to your computer and use it in GitHub Desktop.

Python code test - CARTO

Raw

What follows is a technical test for this job offer at CARTO: https://boards.greenhouse.io/cartodb/jobs/705852#.WSvORxOGPUI

Build the following and make it run as fast as you possibly can using Python 3 (vanilla). The faster it runs, the more you will impress us!

Your code should:

Download this ~2GB file: https://s3.amazonaws.com/carto-1000x/data/yellow_tripdata_2016-01.csv
Count the lines in the file
Calculate the average value of the tip_amount field.

All of that in the most efficient way you can come up with.

That's it. Make it fly!

manolinux commented Feb 7, 2020

It takes more or less same processing time as previous Pandas solutions (Vanilla Python):

import time

file = '/home/manolinux/yellow_tripdata_2016-01.csv'
field = 'tip_amount'
no_more_data_to_process = False
separator = ','

#https://stackoverflow.com/questions/1883980/find-the-nth-occurrence-of-substring-in-a-string
def findnth(string, substring, n):
parts = string.split(substring, n + 1)
if len(parts) <= n + 1:
return -1
return len(string) - len(parts[-1]) - len(substring)

t0 = time.time()

#Get number of field 'tip_amount'
with open(file) as f:
first_line = f.readline().strip()

l = [idx for idx, item in enumerate(first_line) if separator in item]
fieldNumber = 0

for commaPosition in l:
#First position?
if (field + separator) == first_line[0:len(field+separator)]:
break
#Last
if (separator + field) == first_line[-len(separator + field)]:
break
#Middle
if (separator + field) == first_line[commaPosition:commaPosition+len(separator+field)]:
break
fieldNumber+=1

#Processing
countLines = 0
accumulatedTips = 0

#Buffering option of open seems not to do much
with open(file,"r",163840) as f:
#skip first line
f.readline()
for line in f:
posInLine = findnth(line,separator,fieldNumber)
endField = posInLine+(line[posInLine+1:].index(separator))
countLines+=1
#print (posInLine,endField,line[posInLine:])
accumulatedTips+=float(line[posInLine+1:endField+1])

print("Number of lines:",countLines)
print("Average tip:",accumulatedTips/countLines)
print('Elapsed time : ', time.time() - t0)

boriel commented Aug 10, 2020

Simple (vanila) Python (not even csv is needed):

#!/usr/bin/python3
import sys

SEPARATOR = ','  # Some CSV formats use other separators like tabs or ';'
generator = enumerate(open(sys.argv[1], 'rt', encoding='utf-8'))
index, header = next(generator)
field_index = header.split(SEPARATOR).index('tip_amount')
tip_amount_acc = 0

for i, line in generator:
   tip_amount_acc += float(line.split(SEPARATOR)[field_index])

print("Total lines: {}".format(i + 1))
print("Tip amount average: {}".format(tip_amount_acc / i))

Of note: SEPARATOR can be configurable. The field position is automatically determined (field_index)

blackrez commented Jul 17, 2021 •

edited

Loading

Hello,

It was fun to play with, there is a lot of solution but I like this 2.
I think streaming is the future of data and I hate to download big file.

Solution 1

import csv
import urllib.request
import codecs

url = "https://s3.amazonaws.com/carto-1000x/data/yellow_tripdata_2016-01.csv"
stream = urllib.request.urlopen(url)
csvfile = csv.DictReader(codecs.iterdecode(stream, 'utf-8'))
count = 0
z = 0.0
for line in csvfile:
    z = float(line['tip_amount']) + z
    count = count + 1

print("final")
print(z)
print(count)
avg = z/count
print(avg)

Also I think people should be lazy and use some else computing capacities (I cheated I used boto3 but I don't have the time to rewrite a SDK) and in this case AWS and S3.

Solution 2

import boto3
s3 = boto3.client('s3')

resp = s3.select_object_content(
    Bucket='carto-1000x',
    Key='data/yellow_tripdata_2016-01.csv',
    ExpressionType='SQL',
    Expression="SELECT avg(cast(tip_amount as float)) , count(1) FROM s3object s",
    InputSerialization = {'CSV': {"FileHeaderInfo": "Use", 'FieldDelimiter': ',','RecordDelimiter': '\n'}, 'CompressionType': 'NONE'},
    OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
    if 'Records' in event:
         records = event['Records']['Payload'].decode('utf-8')
         print(records)

I think I will an article on this solutions.

edit I forget the count in solution 2.

jorgesancha/python_code_test_carto.md

manolinux commented Feb 7, 2020

Uh oh!

boriel commented Aug 10, 2020

Uh oh!

blackrez commented Jul 17, 2021 •

edited

Loading

Uh oh!

jorgesancha/python_code_test_carto.md

manolinux commented Feb 7, 2020

Uh oh!

boriel commented Aug 10, 2020

Uh oh!

blackrez commented Jul 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blackrez commented Jul 17, 2021 •

edited

Loading