Trying to use duckdb to perform some filters/groupbys on a parquet file and send a lot of data to a browser. Aprox 100mb of data was being sent to a browser. Found that the API was quite slow - and was looking for options to make it faster.
Frameworks being used:
- Chrome 133.x
- Angular 18.x
- Python 3.11.x
- Flask 3.0.x
- duckdb 1.1.x
Snippets of code to optimize:
@flask_app.route("/read_data")
def read_data():
data = get_duck_db_filtered_data()
result = data.df().to_dict(orient="split")
count = len(data)
return {"result": result, "total_count": count}
On the Browser:
this.http.get('/read_data').subscribe(payload => {
return payload.data.map((row: string[]) => {
const rowObject: Record<string, string> = {};
payload.columns.forEach((column: string, colIndex: number) => {
rowObject[column] = row[colIndex];
});
return rowObject;
});
}
In [1]: len(data)
Out[1]: 500000
pandas trials:
# Try pandas returning {col1: [v1, v2, ...], col2: [v1, v2, ...], ...}
In [1]: %time _ = data.df().to_dict(orient='list')
CPU times: user 3.67 s, sys: 1.57 s, total: 5.24 s
Wall time: 7.09 s
# Try pandas returning {col1: pd.Series<v1, v2, ...>, col2: pd.Series<v1, v2, ...>, ...}
In [2]: %time _ = data.df().to_dict(orient='series')
CPU times: user 1.98 s, sys: 1.19 s, total: 3.17 s
Wall time: 2.66 s
# Try pandas returning {columns: [col1, col2, ...], data: [[v1, v1, ...], [v2, v2, ...]}
In [3]: %time _ = data.df().to_dict(orient='split')
CPU times: user 4.1 s, sys: 551 ms, total: 4.65 s
Wall time: 4.91 s
polars trials:
# Try polars returning {col1: pl.Series<v1, v2, ...>, col2: pd.Series<v1, v2, ...>, ...}
In [1]: %time _= data.pl().to_dict()
CPU times: user 1.14 s, sys: 180 ms, total: 1.32 s
Wall time: 766 ms
# Try polars returning {col1: pl.Series<v1, v2, ...>, col2: pd.Series<v1, v2, ...>, ...}
In [2]: %time _= data.pl().to_dict(as_series=False)
CPU times: user 2.69 s, sys: 480 ms, total: 3.16 s
Wall time: 3.1 s
# Try polars + dict returning {col1: [v1, v2, ...], col2: [v1, v2, ...], ...}
In [3]: %time _ = {k: v.to_list() for k,v in data.pl().to_dict().items()}
CPU times: user 2.82 s, sys: 391 ms, total: 3.21 s
Wall time: 2.75 s
duckdb trials:
In [3]: %time _ = data.fetchall()
CPU times: user 1.81 s, sys: 284 ms, total: 2.09 s
Wall time: 2.18 s
Conclusion: Converting duckdb to polars is blazing fast - and converting duckdb to pandas is quite slow. As the duckdb->pandas is slow, any variation of to_dict() on pd.DataFrame is then slow.
But doing it with polars limits us to some formats, but is faster.
const { columns, data: rows } = data;
console.time('foreach');
const val = rows.map((row: string[]) => {
const rowObject: Record<string, string> = {};
columns.forEach((column: string, colIndex: number) => {
rowObject[column] = row[colIndex];
});
return rowObject;
});
console.timeEnd('foreach');
// foreach: 207.812 ms
console.time('fromEntries');
data.data.map(row =>
Object.fromEntries(data.columns.map((column, colIndex) => [column, row[colIndex]])),
);
console.timeEnd('fromEntries');
// fromEntries: 735.25 ms
console.time('lodash');
data.data.map(row => zipObject(data.columns, row));
console.timeEnd('lodash');
// lodash: 382.171 ms
console.time('for-loop');
const numRows = rows.length;
const numCols = columns.length;
const result = new Array(numRows); // Pre-allocate array for speed
for (let i = 0; i < numRows; i++) {
const rowObject: Record<string, string> = {};
for (let j = 0; j < numCols; j++) {
rowObject[columns[j]] = rows[i][j];
}
result[i] = rowObject; // Assign directly to pre-allocated array
}
console.timeEnd('for-loop');
// for-loop: 200.572 ms
Conclusion: With "map + forEach" and "for-loop" we got nearly the same performance - lodash was a bit slower, and fromEntries is very slow.