Persistent Data Strctures use structural sharing to create an efficient in-memory representation, but these same structures can be a problem to serialize/deserialize.
First, the shared parts are duplicated in the serialised output which can lead to a large and inefficient representation.
Second, when the process is reversed, deserialization doesn't recreate the original, shared, effcient in-memory representation because sharing information was lost in the original serialisation.
This library "re-factors" persistent datastrcutures so they can be more efficiently serialised.
It provides two functions factor
and expand
.
factor
takes a persistent data structure, pds
, and it returns two structures (in a vector):
dups
: a dictionary which maps "tokens" (ids) to duplicated sub-nodes.pds-prime
:pds
with the shared sub-nodes replaced by tokens (keys ofdups
)
Having used factor
, you are expected to then serialise these two strcutures using whatever method makes sense to your usecase - perhaps transit. So factor
is a pre-processor for use before transit.
Later, expand
can be used to reverse the process - you give it two data strcutures and it reforms the original, sharing and all.
So factor
is doing dictionary compression. It walks pds
identifing duplicated nodes, places them into dups
, and tweaks pds-prime
so the duplicates it finds are replaced by tokens.
factor
can use two ways of identifing structures as being duplicates:
- via
identical?
- via
=
XXX how do we supply one of these two alternatives to factor
Add this dependency to your project.clj
XXX
Add this require:
(require (:require [day8.re-de-dup :as re-de]
(def v (re-de/factor some-data)) ;; XXX add a way to use wither = or identical?
(= some-data (re-de/expand v))
If you want to de-dupe items that are not identical (i.e the same object reference)
=>(use 'de-dupe.core :refer [create-eq-cache])
=>(def compressed (create-eq-cache some-data))
=>(= (.decompress compressed) some-data)
true
This Library is ok for speed at the moment but can maybe can benefit from more optimisation.
Hash seems to take a big chunk of time but if we use the ECMA 6 (Harmony) (js/Map.) which tests by identity, the time taken does not reduce. This may be because the number of elements in the map increases. As this implementation uses js/Map you will need to run it on a modern browser (Chrome, IE 10, Firefox). The browser will only need to implement the Map.set() and Map.get() methods.
- This implementation can only cache things that can have meta-data attached to them (lists, sets, vectors, maps).
- This implementation caches everthing it can, even if the value is only used once it will be cached, which means that the decompression phase will always take longer than it should.
- This implementation by default will consider two objects as different if (identical? x y) returns false. This is to save time computing the hash of the objects to check for equality. Use de-dupe/create-eq-cache for de-duplication for non-identical structures.
This implementation uses a special version of clojure.walk which keeps track of both the original form (or more correctly that returned from the inner function), and the newly modified form from the outer function.
This makes it possible in the prewalk phase (stepping down the tree from root to leaf) to cache all the forms in a js/Map (from now on referred to as the 'values' cache, this is mutable). In addition the key generated for the values cache (itself just a cljs symbol) is added to the metadata of the form.
If there is a cache 'hit' on the values cache, in this phase the form will be replaced by the cache key that is found and the algorithm will not continue to walk down the structure.
On the way back up the tree the algorithm will begin to construct the 'compressed' cache. This is the cache where the value for each cache key itself contains cache keys. This compressed cache is constructed as the outer function will return the cache key for each value which is found by examining the metadata attached to the object on the way down.
Finally the top level compressed value is returned and assigned to the cache as cache-0.
This project comes with unit tests and benchmark tests on random data.
To run the unit tests with a browser
>lein cljsbuild once test-dev
Then open resources/index_dev.html in a browser and open the javascript console to see the results
To run the benchmarks with a browser
>lein cljsbuild once bench
Then open resources/index_bench.html in a browser and open the javascript console to see the results
Copyright © 2015 Michael Thompson
Distributed under The MIT License (MIT) - See LICENSE.txt