Testing writing huge amounts of virtual references into icechunk

A single msgpack reference file is 186MB. If I format it as such:

out = msgspec.msgpack.decode(open("07534EYBEC0SJ5P0B700", "rb").read())
out2 = [{"node": key[0], "indices": key[1], "vitual": {"absolute": val["Virtual"][0]['Absolute'], "offset": val["Virtual"][1], "size": val["Virtual"][2]}, "Inline": None, "Ref": None} 
    for key, val in chunks.items()]

and save this to parquet with Zstd and schema like

1000000 * {
    node: string,
    indices: (
        int64,
        int64,
        int64
    ),
    vitual: {
        absolute: string,
        offset: int64,
        size: int64
    },
    Inline: ?unknown,
    Ref: ?unknown
}

I get sizes:
178M 07534EYBEC0SJ5P0B700
5.9M 07534EYBEC0SJ5P0B700.parquet

I accept that the case of having exactly the same URL for every chunk is unlikely, but in general they will tend to be very similar.

(btw: I got way higher memory use while running this, over 45GB allocated at peak, which would explain crashing Coiled)

Now will look at the linked discussion...

TomNicholas/test_icechunk_refs_at_scale.ipynb

Select an option

No results found

Select an option

No results found

TomNicholas commented Dec 3, 2024

Uh oh!

martindurant commented Dec 3, 2024

Uh oh!