Skip to content

Instantly share code, notes, and snippets.

@tomcrane
Created October 15, 2019 17:17
Show Gist options
  • Save tomcrane/441f82c635292c737323e98d6340f8df to your computer and use it in GitHub Desktop.
Save tomcrane/441f82c635292c737323e98d6340f8df to your computer and use it in GitHub Desktop.
StorageMap efficiency

(the answer to this might be completely different for DDS on AWS - e.g., a massive location-resolving key-value store.)

Given a relative file path in a METS file (e.g., a JP2 or ALTO file), the DDS needs to resolve the S3 location of that file.

The simple way to do this is load the storage manifest, find the entry corresponding to the path in METS, and deduce the S3 URI. But that would be very inefficient.

So the DDS caches a StorageMap for each storage manifest, which is this object, serialised:

    [Serializable]
    public class ArchiveStorageMap
    {
        public Dictionary<string, string> AwsKeys;
        public string BucketName;
        public string FullUriTemplate;
        public DateTime StorageManifestCreated;
        public DateTime Built;
    }

Here's an example of one:

{
    "AwsKeys": {
        "b3012802x.xml": "digitised/b3012802x/v1/data/b3012802x.xml",
        "alto/b3012802x_0001.xml": "digitised/b3012802x/v1/data/alto/b3012802x_0001.xml",
        "alto/b3012802x_0002.xml": "digitised/b3012802x/v1/data/alto/b3012802x_0002.xml",
        "alto/b3012802x_0003.xml": "digitised/b3012802x/v1/data/alto/b3012802x_0003.xml",
        "alto/b3012802x_0004.xml": "digitised/b3012802x/v1/data/alto/b3012802x_0004.xml",
        "alto/b3012802x_0005.xml": "digitised/b3012802x/v1/data/alto/b3012802x_0005.xml",
        "objects/b3012802x_0001.jp2": "digitised/b3012802x/v1/data/objects/b3012802x_0001.jp2",
        "objects/b3012802x_0002.jp2": "digitised/b3012802x/v1/data/objects/b3012802x_0002.jp2",
        "objects/b3012802x_0003.jp2": "digitised/b3012802x/v1/data/objects/b3012802x_0003.jp2",
        "objects/b3012802x_0004.jp2": "digitised/b3012802x/v1/data/objects/b3012802x_0004.jp2",
        "objects/b3012802x_0005.jp2": "digitised/b3012802x/v1/data/objects/b3012802x_0005.jp2"
    },
    "BucketName": "xxxxxx-bucket-name-xxxxxx",
    "FullUriTemplate": "https://s3-eu-west-1.amazonaws.com/xxxxxx-bucket-name-xxxxxx/{0}",
    "StorageManifestCreated": "2019-09-14T00:30:50.414464Z",
    "Built": "2019-10-15T17:51:27.9448392+01:00"
}

The DDS caches these in memory with a very short expiry, and will deserialise from disk if not in memory. Whenever Goobi notifies DDS, the cached storage map is rebuilt.

This is really fast, and works great for small-to-medium sized items. But for a huge item, the size of the storage map, the deserialisation time from disk, and the memory usage starts to nag at me.

There is a lot of repetition in those keys and values. A lot of redundant information. Chemist and Druggist has two and a half million entries. This doesn't affect the end user, because the hard work has already been done in advance, to transform all that into two levels of IIIF Collections and then Manifests-per-volume.

You would notice the overhead in the dashboard, or if you built a more general purpose storage file explorer based on METS, and generally when the DDS is building things.

The job of this class is to resolve storage locations given a relative file path in METS. You could make use of special knowledge of how those paths work, and use templates for keys and values, and only actually store the strings that differ per key and per value, so the serialised object carries the same information in many fewer bytes.

You could do this in a fragile way based on the fact that you know it's 0001, 0002, etc. that change. But this would be too fragile and hard to maintain. Better to solve it for the general case; at creation time analyse all the keys and values and produce the optimum set of templates for the minimum sizes of required keys and values to be stored.

So that even for Chemist and Druggist there's a map with very short keys and very short values. Or a partitioned map that requires one or more further deserialisations to load the right partition to fulfill the lookup.

What's the general case solution for this?

It may be that you can't beat the redundancy-busting potential of just gzipping it, but it's still going to be big in memory once deserialised and unzipped. Is there a more sophisticated data structure that holds the location-resolving info required? And how do you build it?

@tomcrane
Copy link
Author

And if one image changed (let's say the ALTO is re-OCRed, too):

{
    "VersionSets": [
    {
        "Key": "v1",
        "Value": [
            "#.xml",
            "alto/#_0001.xml",
            "alto/#_0002.xml",
            "alto/#_0004.xml",
            "alto/#_0005.xml",
            "objects/#_0001.jp2",
            "objects/#_0002.jp2",
            "objects/#_0004.jp2",
            "objects/#_0005.jp2"
        ]
    },
    {
        "Key": "v2",
        "Value": [
            "alto/#_0003.xml",
            "objects/#_0003.jp2"
        ]
    }
    ],
    "BucketName": "xxxxxx-bucket-name-xxxxxx",
    "StorageManifestCreated": "2019-09-14T00:30:50.414464Z",
    "Built": "2019-10-19T17:59:39.0192978+01:00"
}

The .NET object is:

    [Serializable]
    public class WellcomeBagAwareArchiveStorageMap
    {
        // a List of: "v1" => { "alto/#_0001.xml", ... }
        // in decreasing order of size of set
        public List<KeyValuePair<string, HashSet<string>>> VersionSets;
        public string BucketName;
        public DateTime StorageManifestCreated;
        public DateTime Built;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment