Skip to content

Instantly share code, notes, and snippets.

@TomNicholas
Created November 20, 2024 19:59
Show Gist options
  • Select an option

  • Save TomNicholas/5990ffb06fccce99deec2ca8b540bf93 to your computer and use it in GitHub Desktop.

Select an option

Save TomNicholas/5990ffb06fccce99deec2ca8b540bf93 to your computer and use it in GitHub Desktop.
Testing writing huge amounts of virtual references into icechunk
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@martindurant
Copy link
Copy Markdown

So results: 10M references, 3.2GB of references in-memory takes 16GB of space on disk, and >128GB peak during writing to remote?

I realise this is at the extreme, but the on-=disk size really ought to be smaller than in-memory, not bigger, and the memory peak should be no bigger than double.

@TomNicholas
Copy link
Copy Markdown
Author

I agree - I made those points in the issue I raised:

earth-mover/icechunk#401

@martindurant
Copy link
Copy Markdown

A single msgpack reference file is 186MB. If I format it as such:

out = msgspec.msgpack.decode(open("07534EYBEC0SJ5P0B700", "rb").read())
out2 = [{"node": key[0], "indices": key[1], "vitual": {"absolute": val["Virtual"][0]['Absolute'], "offset": val["Virtual"][1], "size": val["Virtual"][2]}, "Inline": None, "Ref": None} 
    for key, val in chunks.items()]

and save this to parquet with Zstd and schema like

1000000 * {
    node: string,
    indices: (
        int64,
        int64,
        int64
    ),
    vitual: {
        absolute: string,
        offset: int64,
        size: int64
    },
    Inline: ?unknown,
    Ref: ?unknown
}

I get sizes:
178M 07534EYBEC0SJ5P0B700
5.9M 07534EYBEC0SJ5P0B700.parquet

I accept that the case of having exactly the same URL for every chunk is unlikely, but in general they will tend to be very similar.

(btw: I got way higher memory use while running this, over 45GB allocated at peak, which would explain crashing Coiled)

Now will look at the linked discussion...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment