Created
November 20, 2024 19:59
-
-
Save TomNicholas/5990ffb06fccce99deec2ca8b540bf93 to your computer and use it in GitHub Desktop.
Testing writing huge amounts of virtual references into icechunk
Author
A single msgpack reference file is 186MB. If I format it as such:
out = msgspec.msgpack.decode(open("07534EYBEC0SJ5P0B700", "rb").read())
out2 = [{"node": key[0], "indices": key[1], "vitual": {"absolute": val["Virtual"][0]['Absolute'], "offset": val["Virtual"][1], "size": val["Virtual"][2]}, "Inline": None, "Ref": None}
for key, val in chunks.items()]
and save this to parquet with Zstd and schema like
1000000 * {
node: string,
indices: (
int64,
int64,
int64
),
vitual: {
absolute: string,
offset: int64,
size: int64
},
Inline: ?unknown,
Ref: ?unknown
}
I get sizes:
178M 07534EYBEC0SJ5P0B700
5.9M 07534EYBEC0SJ5P0B700.parquet
I accept that the case of having exactly the same URL for every chunk is unlikely, but in general they will tend to be very similar.
(btw: I got way higher memory use while running this, over 45GB allocated at peak, which would explain crashing Coiled)
Now will look at the linked discussion...
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I agree - I made those points in the issue I raised:
earth-mover/icechunk#401