Skip to content

Instantly share code, notes, and snippets.

@mhconradt
Last active December 7, 2021 22:37
Show Gist options
  • Save mhconradt/f1ec8c0cb226bfbf27c94d3e31ac7c38 to your computer and use it in GitHub Desktop.
Save mhconradt/f1ec8c0cb226bfbf27c94d3e31ac7c38 to your computer and use it in GitHub Desktop.
Attempting to write pl.Categorical to Parquet
from string import ascii_uppercase
import numpy as np
import polars as pl
N = 5 # N > 4 to break to_parquet/read_parquet
# for smaller N, categories need to be multiple characters
cats = [a + b for a in ascii_uppercase for b in ascii_uppercase][:N]
df = pl.DataFrame(
{
"cats": pl.Series(cats).cast(pl.datatypes.Categorical),
"numbers": np.arange(N),
}
)
df.to_parquet("df.parquet", use_pyarrow=True) # this works
_ = pl.read_parquet("df.parquet")
df.drop("cats").to_parquet("df.parquet") # this does too
_ = pl.read_parquet("df.parquet")
df.with_column(df.cats.cast(pl.datatypes.Utf8))\
.to_parquet("df.parquet") # so does this
_ = pl.read_parquet("df.parquet")
df.to_parquet("df.parquet", use_pyarrow=False) # this doesn't
_ = pl.read_parquet("df.parquet")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment