TomNicholas/test_icechunk_refs_at_scale.ipynb

Created November 20, 2024 19:59

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/TomNicholas/5990ffb06fccce99deec2ca8b540bf93.js"></script>
Save TomNicholas/5990ffb06fccce99deec2ca8b540bf93 to your computer and use it in GitHub Desktop.

Download ZIP

Testing writing huge amounts of virtual references into icechunk

Raw

test_icechunk_refs_at_scale.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

martindurant commented Dec 3, 2024

So results: 10M references, 3.2GB of references in-memory takes 16GB of space on disk, and >128GB peak during writing to remote?

I realise this is at the extreme, but the on-=disk size really ought to be smaller than in-memory, not bigger, and the memory peak should be no bigger than double.

TomNicholas commented Dec 3, 2024

Author

I agree - I made those points in the issue I raised:

earth-mover/icechunk#401

martindurant commented Dec 3, 2024

A single msgpack reference file is 186MB. If I format it as such:

out = msgspec.msgpack.decode(open("07534EYBEC0SJ5P0B700", "rb").read())
out2 = [{"node": key[0], "indices": key[1], "vitual": {"absolute": val["Virtual"][0]['Absolute'], "offset": val["Virtual"][1], "size": val["Virtual"][2]}, "Inline": None, "Ref": None} 
    for key, val in chunks.items()]

and save this to parquet with Zstd and schema like

1000000 * {
    node: string,
    indices: (
        int64,
        int64,
        int64
    ),
    vitual: {
        absolute: string,
        offset: int64,
        size: int64
    },
    Inline: ?unknown,
    Ref: ?unknown
}

I get sizes:
178M 07534EYBEC0SJ5P0B700
5.9M 07534EYBEC0SJ5P0B700.parquet

I accept that the case of having exactly the same URL for every chunk is unlikely, but in general they will tend to be very similar.

(btw: I got way higher memory use while running this, over 45GB allocated at peak, which would explain crashing Coiled)

Now will look at the linked discussion...

TomNicholas/test_icechunk_refs_at_scale.ipynb

Select an option

No results found

Select an option

No results found

martindurant commented Dec 3, 2024

Uh oh!

TomNicholas commented Dec 3, 2024

Uh oh!

martindurant commented Dec 3, 2024

Uh oh!