Skip to content

Instantly share code, notes, and snippets.

@iangow
Created December 30, 2023 12:26
Show Gist options
  • Save iangow/2d8f7be06fea688ec9b84bc45c6c473a to your computer and use it in GitHub Desktop.
Save iangow/2d8f7be06fea688ec9b84bc45c6c473a to your computer and use it in GitHub Desktop.
Notebook for photo-package for RAM-friendly DB-to-parquet conversion
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@iangow
Copy link
Author

iangow commented Dec 30, 2023

Almost every option I have tried with converting large database tables to parquet ends up loading the full data set into memory as part of the process. After much dabbling, I have found a way to avoid this. (Dabbling was required because these features are not well-documented.)

In essence, the pg_to_pq() function above uses the to_pandas_batches() method from Ibis to create a pyarrow.RecordBatchReader, which is a stream of pyarrow batches can be written in batches to a parquet file.

The complication with using pyarrow.parquet.ParquetWriter is the need to give it a schema before writing any batches. I can inspect the batches, but these are in pyarrow format and some conversion is needed to create the parquet schema. The trick I use is to get a small number of rows, write them to a temporary file, then feed the schema for that file to pyarrow.parquet.ParquetWriter. This way I let the pyarrow package figure out how to write pyarrow to parquet.

Note that the pg_to_pq() function above could be generalised beyond PostgreSQL to anything that DuckDB can handle as an Ibis backend. I see read_postgres(), read_csv(), and read_sqlite().

The pg_to_pq() can use the PostgreSQL backend for Ibis, but this doesn't work with batched files for some reason. Also, DuckDB can process PostgreSQL data much faster than the PostgreSQL can. So I might drop the use_duckdb=False option.

For me, it was important to include the col_types argument. I can't stand the idea of permno as double precision, as matching across tables using that just feels wrong. It also used to be very slow, but it seems that this has been fixed in more recent versions of PostgreSQL.

@iangow
Copy link
Author

iangow commented Dec 30, 2023

I probably should do the initial write to a temporary file and then move it over to pq_file as the last moment, as this would prevent clobbering the existing file immediately. The work-in-progress version of the file is visible to the user, but not useable. Also if something goes wrong, the old file is gone with the approach above.

@iangow
Copy link
Author

iangow commented Dec 30, 2023

The above probably belongs in a separate package from wrds2pg, which I view as a "WRDS SAS to stuff" package.

It seems that all this needs is pip install 'ibis-framework[duckdb]' and pip install pyarrow, which seems OK. As with wrds2pg, these dependencies could be set in the package.

@iangow
Copy link
Author

iangow commented Dec 30, 2023

Using batches=True slows things down a bit relative to batches=False, but it seems like the right default because the latter can lead to exhausting memory. (Even my MacBook with 32GB of RAM---effectively much more---had trouble a couple of times with crsp.dsf with variants of the above. Though in general it's fine and can get crsp.dsf from WRDS PG to local parquet in as little as seven minutes. It takes about 2 minutes from a local PG instance.)

@iangow
Copy link
Author

iangow commented Dec 30, 2023

It would be good to add keep and drop arguments to pg_to_pq(). These wouldn't be snippets of SAS code, but I think it would be good to support a regular expression or a list of strings for each. I guess anything would be built on IBIS "selectors".

@iangow
Copy link
Author

iangow commented Dec 30, 2023

The row_group_size argument is not being used if batched=True. Yet somehow I end up with the same row_group_size. Perhaps it's picked up in the schema passed into pyarrow.parquet.ParquetWriter(). Perhaps it reflects some default in terms of batch sizes.

Amendment: It must be the latter, because row_group_size seems completely unused with batched=True.

@iangow
Copy link
Author

iangow commented Dec 30, 2023

@blucap The above may be of interest. Suggestions welcomed.

@iangow
Copy link
Author

iangow commented Dec 30, 2023

Is it necessary to specify Python 3.11 (not 3.12) for now?

@iangow
Copy link
Author

iangow commented Dec 30, 2023

See here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment