de-dupe

The Problem

Persistent Data Strctures use structural sharing to create an efficient in-memory representation, but these same structures can be a problem to serialize/deserialize.

First, the shared parts are duplicated in the serialised output which can lead to a large and inefficient representation.

Second, when the process is reversed, deserialization doesn't recreate the original, shared, effcient in-memory representation because sharing information was lost in the original serialisation.

The Solution Provided

This library "re-factors" persistent datastrcutures so they can be more efficiently serialised.

It provides two functions factor and expand.

factor takes a persistent data structure, pds, and it returns two structures (in a vector):

dups: a dictionary which maps "tokens" (ids) to duplicated sub-nodes.
pds-prime: pds with the shared sub-nodes replaced by tokens (keys of dups)

Having used factor, you are expected to then serialise these two strcutures using whatever method makes sense to your usecase - perhaps transit. So factor is a pre-processor for use before transit.

Later, expand can be used to reverse the process - you give it two data strcutures and it reforms the original, sharing and all.

How it works

So factor is doing dictionary compression. It walks pds identifing duplicated nodes, places them into dups, and tweaks pds-prime so the duplicates it finds are replaced by tokens.

factor can use two ways of identifing structures as being duplicates:

via identical?
via =

XXX how do we supply one of these two alternatives to factor

Using

Add this dependency to your project.clj

XXX

Add this require:

(require  (:require [day8.re-de-dup :as re-de]

(def v (re-de/factor some-data))   ;; XXX add a way to use wither = or identical?

(= some-data (re-de/expand v))

If you want to de-dupe items that are not identical (i.e the same object reference)

=>(use 'de-dupe.core :refer [create-eq-cache])
=>(def compressed (create-eq-cache some-data))
=>(= (.decompress compressed) some-data)
true

State of play

This Library is ok for speed at the moment but can maybe can benefit from more optimisation.

Hash seems to take a big chunk of time but if we use the ECMA 6 (Harmony) (js/Map.) which tests by identity, the time taken does not reduce. This may be because the number of elements in the map increases. As this implementation uses js/Map you will need to run it on a modern browser (Chrome, IE 10, Firefox). The browser will only need to implement the Map.set() and Map.get() methods.

Limitations

This implementation can only cache things that can have meta-data attached to them (lists, sets, vectors, maps).
This implementation caches everthing it can, even if the value is only used once it will be cached, which means that the decompression phase will always take longer than it should.
This implementation by default will consider two objects as different if (identical? x y) returns false. This is to save time computing the hash of the objects to check for equality. Use de-dupe/create-eq-cache for de-duplication for non-identical structures.

Implementation details

This implementation uses a special version of clojure.walk which keeps track of both the original form (or more correctly that returned from the inner function), and the newly modified form from the outer function.

This makes it possible in the prewalk phase (stepping down the tree from root to leaf) to cache all the forms in a js/Map (from now on referred to as the 'values' cache, this is mutable). In addition the key generated for the values cache (itself just a cljs symbol) is added to the metadata of the form.

If there is a cache 'hit' on the values cache, in this phase the form will be replaced by the cache key that is found and the algorithm will not continue to walk down the structure.

On the way back up the tree the algorithm will begin to construct the 'compressed' cache. This is the cache where the value for each cache key itself contains cache keys. This compressed cache is constructed as the outer function will return the cache key for each value which is found by examining the metadata attached to the object on the way down.

Finally the top level compressed value is returned and assigned to the cache as cache-0.

Testing

This project comes with unit tests and benchmark tests on random data.

To run the unit tests with a browser

>lein cljsbuild once test-dev

Then open resources/index_dev.html in a browser and open the javascript console to see the results

To run the benchmarks with a browser

>lein cljsbuild once bench

Then open resources/index_bench.html in a browser and open the javascript console to see the results

License

Distributed under The MIT License (MIT) - See LICENSE.txt

mike-thompson-day8/gist:5410c5cd2afdce056c44