Created
November 20, 2024 19:59
-
-
Save TomNicholas/5990ffb06fccce99deec2ca8b540bf93 to your computer and use it in GitHub Desktop.
Testing writing huge amounts of virtual references into icechunk
Author
I agree - I made those points in the issue I raised:
A single msgpack reference file is 186MB. If I format it as such:
out = msgspec.msgpack.decode(open("07534EYBEC0SJ5P0B700", "rb").read())
out2 = [{"node": key[0], "indices": key[1], "vitual": {"absolute": val["Virtual"][0]['Absolute'], "offset": val["Virtual"][1], "size": val["Virtual"][2]}, "Inline": None, "Ref": None}
for key, val in chunks.items()]
and save this to parquet with Zstd and schema like
1000000 * {
node: string,
indices: (
int64,
int64,
int64
),
vitual: {
absolute: string,
offset: int64,
size: int64
},
Inline: ?unknown,
Ref: ?unknown
}
I get sizes:
178M 07534EYBEC0SJ5P0B700
5.9M 07534EYBEC0SJ5P0B700.parquet
I accept that the case of having exactly the same URL for every chunk is unlikely, but in general they will tend to be very similar.
(btw: I got way higher memory use while running this, over 45GB allocated at peak, which would explain crashing Coiled)
Now will look at the linked discussion...
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
So results: 10M references, 3.2GB of references in-memory takes 16GB of space on disk, and >128GB peak during writing to remote?
I realise this is at the extreme, but the on-=disk size really ought to be smaller than in-memory, not bigger, and the memory peak should be no bigger than double.