Skip to content

Instantly share code, notes, and snippets.

@AbdealiLoKo
Last active March 12, 2025 02:14
Show Gist options
  • Save AbdealiLoKo/c8b7c100dd0e19cf88bb6566782e0ae7 to your computer and use it in GitHub Desktop.
Save AbdealiLoKo/c8b7c100dd0e19cf88bb6566782e0ae7 to your computer and use it in GitHub Desktop.
large-data-in-api.md

Microbenchmarks to send large data via a REST API

Trying to use duckdb to perform some filters/groupbys on a parquet file and send a lot of data to a browser. Aprox 100mb of data was being sent to a browser. Found that the API was quite slow - and was looking for options to make it faster.

Frameworks being used:

  1. Chrome 133.x
  2. Angular 18.x
  3. Python 3.11.x
  4. Flask 3.0.x
  5. duckdb 1.1.x

Snippets of code to optimize:

@flask_app.route("/read_data")
def read_data():
    data = get_duck_db_filtered_data()
    result = data.df().to_dict(orient="split")
    count = len(data)
    return {"result": result, "total_count": count}

On the Browser:

this.http.get('/read_data').subscribe(payload => {
  return payload.data.map((row: string[]) => {
    const rowObject: Record<string, string> = {};

    payload.columns.forEach((column: string, colIndex: number) => {
        rowObject[column] = row[colIndex];
    });

    return rowObject;
  });
}

Approach 1: Optimize the backend code to conver to dict

In [1]: len(data)
Out[1]: 500000

pandas trials:

# Try pandas returning {col1: [v1, v2, ...], col2: [v1, v2, ...], ...}
In [1]: %time  _ = data.df().to_dict(orient='list')
CPU times: user 3.67 s, sys: 1.57 s, total: 5.24 s
Wall time: 7.09 s

# Try pandas returning {col1: pd.Series<v1, v2, ...>, col2: pd.Series<v1, v2, ...>, ...}
In [2]: %time  _ = data.df().to_dict(orient='series')
CPU times: user 1.98 s, sys: 1.19 s, total: 3.17 s
Wall time: 2.66 s

# Try pandas returning {columns: [col1, col2, ...], data: [[v1, v1, ...], [v2, v2, ...]}
In [3]: %time _ = data.df().to_dict(orient='split')
CPU times: user 4.1 s, sys: 551 ms, total: 4.65 s
Wall time: 4.91 s

polars trials:

# Try polars returning {col1: pl.Series<v1, v2, ...>, col2: pd.Series<v1, v2, ...>, ...}
In [1]: %time _= data.pl().to_dict()
CPU times: user 1.14 s, sys: 180 ms, total: 1.32 s
Wall time: 766 ms

# Try polars returning {col1: pl.Series<v1, v2, ...>, col2: pd.Series<v1, v2, ...>, ...}
In [2]: %time _= data.pl().to_dict(as_series=False)
CPU times: user 2.69 s, sys: 480 ms, total: 3.16 s
Wall time: 3.1 s

# Try polars + dict returning {col1: [v1, v2, ...], col2: [v1, v2, ...], ...}
In [3]: %time _ = {k: v.to_list() for k,v in data.pl().to_dict().items()}
CPU times: user 2.82 s, sys: 391 ms, total: 3.21 s
Wall time: 2.75 s

duckdb trials:

In [3]: %time _ = data.fetchall()
CPU times: user 1.81 s, sys: 284 ms, total: 2.09 s
Wall time: 2.18 s

Conclusion: Converting duckdb to polars is blazing fast - and converting duckdb to pandas is quite slow. As the duckdb->pandas is slow, any variation of to_dict() on pd.DataFrame is then slow.

But doing it with polars limits us to some formats, but is faster.

Approach 2: Optimize the backend code to read data

const { columns, data: rows } = data;

console.time('foreach');
const val = rows.map((row: string[]) => {
    const rowObject: Record<string, string> = {};
    columns.forEach((column: string, colIndex: number) => {
        rowObject[column] = row[colIndex];
    });
    return rowObject;
});
console.timeEnd('foreach');
// foreach: 207.812 ms

console.time('fromEntries');
data.data.map(row =>
    Object.fromEntries(data.columns.map((column, colIndex) => [column, row[colIndex]])),
);
console.timeEnd('fromEntries');
// fromEntries: 735.25 ms

console.time('lodash');
data.data.map(row => zipObject(data.columns, row));
console.timeEnd('lodash');
// lodash: 382.171 ms

console.time('for-loop');
const numRows = rows.length;
const numCols = columns.length;
const result = new Array(numRows); // Pre-allocate array for speed

for (let i = 0; i < numRows; i++) {
    const rowObject: Record<string, string> = {};
    for (let j = 0; j < numCols; j++) {
        rowObject[columns[j]] = rows[i][j];
    }
    result[i] = rowObject; // Assign directly to pre-allocated array
}
console.timeEnd('for-loop');
// for-loop: 200.572 ms

Conclusion: With "map + forEach" and "for-loop" we got nearly the same performance - lodash was a bit slower, and fromEntries is very slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment