-
-
Save iangow/2d8f7be06fea688ec9b84bc45c6c473a to your computer and use it in GitHub Desktop.
Using batches=True
slows things down a bit relative to batches=False
, but it seems like the right default because the latter can lead to exhausting memory. (Even my MacBook with 32GB of RAM---effectively much more---had trouble a couple of times with crsp.dsf
with variants of the above. Though in general it's fine and can get crsp.dsf
from WRDS PG to local parquet in as little as seven minutes. It takes about 2 minutes from a local PG instance.)
It would be good to add keep
and drop
arguments to pg_to_pq()
. These wouldn't be snippets of SAS code, but I think it would be good to support a regular expression or a list of strings for each. I guess anything would be built on IBIS "selectors".
The row_group_size
argument is not being used if batched=True
. Yet somehow I end up with the same row_group_size
. Perhaps it's picked up in the schema passed into pyarrow.parquet.ParquetWriter()
. Perhaps it reflects some default in terms of batch sizes.
Amendment: It must be the latter, because row_group_size
seems completely unused with batched=True
.
@blucap The above may be of interest. Suggestions welcomed.
Is it necessary to specify Python 3.11 (not 3.12) for now?
See here.
The above probably belongs in a separate package from
wrds2pg
, which I view as a "WRDS SAS to stuff" package.It seems that all this needs is
pip install 'ibis-framework[duckdb]'
andpip install pyarrow
, which seems OK. As withwrds2pg
, these dependencies could be set in the package.