Niklas B bivald

How I batch-read into DuckDB

On a machine with much ram the following works:

con.sql("CREATE TABLE new_tbl AS SELECT * FROM read_parquet('file.parq')")

It uses about 20GB of ram or more and takes 130s and the duckdb file is 3.42GB

Trying to read the parquet in batches and inserting them instead and not keeping anything more than needed (i.e each row group) in RAM:

	# This file if written with pyarrow==2.0.0 can't be read by pyarrow==8.0.0

	import pyarrow as pa
	import pyarrow.parquet as pq

	schema = pa.schema([
	("col1", pa.int8()),
	("col2", pa.string()),
	("col3", pa.float64()),
	("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))