I have worked first with the file downloaded to make sure I got the result right. Those drafts are not here.
After that we have four solutions:
- A simple one that fetches the entire payload in one request.
- An streamed one, same idea, but processing by chunks.
- A parallel approach that speeds up the download using HTTP range requests.
- A variant of the previous one that parallelizes the computation as well.
In my MacBook Pro with fiber via wifi (~280 Mpbs) I get these times:
Solution | Runtime | Memory |
---|---|---|
simple | ~5 min | High |
streamed | 2 min 55 s | Low |
parallel | 1 min 20 s | High |
parallel streamed | 53 s | Low |
Since this is dominated by I/O, times vary, but they are around those ballparks.
- S3 does not support compression, so there is no need to send
Accept-Encoding
. - The average of a collection of numbers can be computed on the fly, incrementally, no need to collect them all, sum them and divide. The first three solutions compute it that way. This is relavant since there are 10,906,858 data points. The fourth solution does that per process, and then computes the grand total from the partials.
- Those formulas have divisions. Division is not exact even with arbitrary precision decimals, but it is with rational numbers. I compared the performance of rationals and floats in this problem. The average with rationals needed about 3 minutes, while floats needed 20 seconds. The exact result computed with rationals is
1.7506631158120882
, and floats yield1.7506631158119936
which is not that bad (1.75066311581206
in the 4th solution). A real use case should decide if those decimals are important. The solutions use floats given this tiny error. - In general is better to reuse buffers where possible.
- As a rule of thumb, it is better to delegate as much as possible to the builtin functions, since oftentimes they are written in C. This is particularly important in buffered solutions.
- I have played a bit with buffer sizes, 4096 is a typical one that works well, 8192 had no measurable difference, 1MB slowed things down.
- Concatenation of bytearrays has terrible performance, the simple parallel downloader used it in some initial versions and I could not understand what was going on with the slowness of downloads. Lists are faster for that use case.
It's been about 14 years since I did any Python. If the dog that flies the helicopter had a Twitter account, he would send a picture of me for "I have no idea of what I am doing".