Skip to content

Instantly share code, notes, and snippets.

# This file if written with pyarrow==2.0.0 can't be read by pyarrow==8.0.0
import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema([
("col1", pa.int8()),
("col2", pa.string()),
("col3", pa.float64()),
("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))
@bivald
bivald / batch-insert-duckdb.md
Created May 22, 2023 11:41
Batch insert DuckDB

How I batch-read into DuckDB

On a machine with much ram the following works:

con.sql("CREATE TABLE new_tbl AS SELECT * FROM read_parquet('file.parq')")

It uses about 20GB of ram or more and takes 130s and the duckdb file is 3.42GB

Trying to read the parquet in batches and inserting them instead and not keeping anything more than needed (i.e each row group) in RAM: