FilterMaps data structure explanation

FilterMaps is a search structure that maps the log events of an Ethereum blockchain in a way that allows the implementation of pattern matchers with reasonable amount of data access. It is suitable both for the acceleration of local lookups and for generating reasonably sized trustless proofs for the complete set of results for a certain match pattern and block range. The latter assumes that the structure is tree hashed and the receiver of the proof knows the root hash of this tree. Note that the tree hashing scheme and the proof format are outside the scope of this document.

Linear log value index space

Log values are defined as the SHA2 hashes of log addresses and topics. FilterMaps maps all log values onto a linear index space as shown below:

Block number	Transaction index	Log index	Log event	Log value indices
0			block delimiter	0
1	0	0	Addr, Topic1, Topic2, Topic3	1, 2, 3, 4
1	0	1	Addr, Topic1, Topic2, Topic3	5, 6, 7, 8
1	1	0	Addr, Topic1, Topic2	9, 10, 11
1	1	1	Addr, Topic1	12, 13
1	1	2	Addr, Topic1, Topic2	14, 15, 16
1			block delimiter	17
2	0	0	Addr, Topic1, Topic2, Topic3, Topic4	18, 19, 20, 21, 22

Each block may contain multiple transactions, each transaction may emit multiple logs and each log consists of an address and 0 to 4 topics. All addresses and topics are assigned consecutive unique log value indices. In addition, after assigning indices to each log event emitted in a block, an index called the block delimiter is reserved which has no role in the search structure itself but simplifies proving result sets for block ranges in the proposed proof scheme.

Filter maps

For each VALUES_PER_MAP long section of the log value index space a filter map is generated. These are fixed size MAP_WIDTH by MAP_HEIGHT sparse bit maps and each log value is marked on the map with a single bit being set to one. Block delimiters are not marked on the map but otherwise the density of the maps are close to constant. Since the maps are sparse, rows should typically be encoded as lists of column indices where marks are placed. The number of marks in a row is referred to as "row length", not to be confused with the constant MAP_WIDTH.

These maps allow searching for any specific log value and the search yields a set of potential matches, each in the form of a log value index. It is guaranteed that each actual occurence of the searched log value yields a match and the mapping scheme also ensures that the number of false positives is also sufficiently low. With the right choice of parameters the cost of looking up or proving extra log value indices due to false positives can be balanced out against the cost savings coming from using a smaller map width. Lookups resulting from false positive matches are greatly reduced if sequence matching is used. For example if the pattern is [Addr, Topic1, Topic2] then three log value searches are performed and an actual log lookup is only necessary if the first search yields N, the second N+1 and the third N+2 simultaneously. The chance of coincidence of false positives is decreased exponentially with the pattern length.

Note that in order to be able to search for block number range and determine the blocks belonging to the resulting matching indices, the implementation also requires some kind of two-way mapping between block numbers and the log value index space.

Design goals

The main design goal of the proposed structure was to improve log search efficiency as much as possible while ensuring that the additional storage requirement of the structure is significantly smaller than the size of the logs themselves. Since the search structure compresses information compared to the original data, false positive rate is always in inverse correlation with the amount of generated search data. In an ideal case assuming uniform quasi-random distribution of bits on the bit map, false positive rate would be inversely proportional to the map height and width while search data size per log value would be roughly proportional to the logarithm of map size depending on the sparse encoding.

Using a quasi-random mapping function from log value and log value index to map row and column would be easy to implement but unfortunately it would not result in great search performance as the positions to look up would be all around the map and typically all or most of the search data would need to be accessed in order to search for a single log value. In order to allow efficient search of a certain log value in a long historical range, log value to row mapping stays the same for longer periods, allowing the searcher to only access certain rows from each map. This presents another issue though: since certain log values are much more frequent than others, the problem of certain rows being densely populated and thereby lowering search performance and yielding more false positives for certain log values with colliding row mapping should be mitigated.

Epochs and mapping layers

In order to allow efficient search of a certain log value in a long historical range, filter maps are organized into epochs, each consisting of a fixed MAPS_PER_EPOCH number of maps. In the most usual case (when row density is around average or below) row mapping stays the same during an entire epoch. The database and the hash tree both should be organized in a way that instead of putting all rows of a single map close to each other, rows of the same row index throughout an entire epoch are close to each other and therefore are cheap to access and/or prove with a Merkle proof.

In order to mitigate collisions in densely populated rows, the concept of mapping layers is introduced, meaning that if a certain row already has a certain number of entries then the row mapping is changed and very frequent log values are mapped into multiple rows. Initially, when a map is empty, every log value is mapped using mapping layer 0 or the "base layer". If a row reaches BASE_ROW_LENGTH then any further log values mapped onto that row in the base layer mapping will use layer 1 mapping instead. On layer 1 the row length limit is BASE_ROW_LENGTH * ROW_LENGTH_STEP. If this row also reaches its limit, layer 2 mapping is used and so on. The row length limit increases exponentially until is reaches BASE_ROW_LENGTH * MAPS_PER_EPOCH where it does not grow further. Note that a row filled at a certain mapping layer can be grown further on a higher layer. Different log values colliding in the same row on a certain layer are probably mapped in different rows on the next layer, which means that a very popular value might populate multiple rows (also very long ones) but an unlucky less popular one colliding with it on base layer will probably just have to move one layer up. The search process is similar, if the searcher finds that the row belonging to the searched value is full according to the base layer limit then it also has to check the next layer and so on, until in finds a non-full row. If a row is longer than the limit according to the layer the searcher is looking at then it can safely ignore the extra entries assuming that they were added by another value on a higher layer.

Row mapping

While base layer row mapping stays the same for an entire epoch, higher layer mappings are changed more frequently. Each mapping change has a cost in terms of database access overhead and Merkle proof size overhead and epoch size is determined in a way that these overheads stay sufficiently low compared to the cost of accessing the actual useful data. On higher layers where the rows are longer, a more frequent remapping is possible because the useful data size per map is also higher. It is also desirable so that a less frequent log value will only suffer from colliding with longer rows for a shorter time.

The row index and maximum row length are calculated as follows (note that fromBinary32 and toBinary32 uses little endian encoding):

layerFactor(layerIndex) = MIN(ROW_LENGTH_STEP ** layerIndex, MAPS_PER_EPOCH)
maxRowLength(layerIndex) = BASE_ROW_LENGTH * layerFactor(layerIndex)
rowFrequency(layerIndex) = MAPS_PER_EPOCH // layerFactor(layerIndex)
rowIndex(logValue, mapIndex, layerIndex) = fromBinary32(SHA2(logValue + toBinary32(mapIndex - mapIndex % rowFrequency(layerIndex)) + toBinary32(layerIndex))[0:4]) % MAP_HEIGHT

The following figure shows how log values are mapped to rows on different mapping layers. Note that it might happen that a higher layer mapping coincides with a lower layer mapping for the same value; this does not cause any problem though as the row is simply grown further on the higher layer. The search algorithm can also simply revisit the same row in a higher layer iteration if necessary and process the rest of the row that it first ignored.

map index        111111 1111222222222233 3333333344444444 4455555555556666
       0123456789012345 6789012345678901 2345678901234567 8901234567890123
      +----------------+----------------+----------------+----------------+
row 0 |2........2......|2...............|...2............|........2.......|
row 1 |........1111.2..|.....2..1111....|1111.....2...2..|2111..2......2..|
row 2 |0000000000000000|.2..2....2..2...|....2111....2...|..2.....11112...|
row 3 |....2..22...1111|..........2.1111|2.......2.2.1111|.......2...2....|
row 4 |.2..1111..2....2|1112..2.2....2..|0000000011110020|...21111.2....2.|
row 5 |...2........2...|...............2|.2...2.........2|.2...2..........|
row 6 |1111.22....2..2.|0020000000020000|.......2...2....|..........2.1112|
row 7 |..2.............|....1112......2.|..2...2.........|0000200000000000|
      +----------------+----------------+----------------+----------------+
            epoch 0          epoch 1          epoch 2          epoch 3

Fig 1. Row mapping of a single log value on different mapping layers

MAP_HEIGHT = 8
MAPS_PER_EPOCH = 16
ROW_LENGTH_STEP = 4

Column mapping

Column mapping assumes that MAP_WIDTH is a multiple of VALUES_PER_MAP. Column index is calculated as follows (function quasiRandom defined later):

logValueWidth = MAP_WIDTH // VALUES_PER_MAP            # constant
mapIndex = logValueIndex // VALUES_PER_MAP
transformHash = SHA2(logValue + toBinary32(mapIndex))  # calculated once per map during search
logValueSubindex = logValueIndex % VALUES_PER_MAP
columnIndex(logValueSubindex, transformHash) = logValueSubindex * logValueWidth + quasiRandom(logValueSubindex, transformHash) % logValueWidth

As shown on the figure below, this mapping practically assigns a logValueWidth by MAP_HEIGHT rectangle to each log value index and ensures that each log value places exactly one mark in its own rectangle. This property also ensures that column indices never collide and keep the original order of log value indices, allowing efficient Merkle exclusion proofs of certain column indices in long rows.

column             11 1111 1111 2222 2222 2233
       0123 4567 8901 2345 6789 0123 4567 8901
      +---------------------------------------+
row 0 |.... .... .... .... .... .... .... ....|
row 1 |.... .... .... .... .... .C.. .... ....|
row 2 |.A.. .... ...A .... A... .... ...A ....|
row 3 |.... .... .... .... .... .... .... ....|
row 4 |.... .... .... .... .... .... .... ....|
row 5 |.... .B.. .... ...B .... .... .... ....|
row 6 |.... .... .... .... .... .... .... ..D.|
row 7 |.... .... .... .... .... .... .... ....|
      +---------------------------------------+

Fig 2. A single filter map with 8 entries of 4 different log values

MAP_WIDTH = 32
MAP_HEIGHT = 8
VALUES_PER_MAP = 8

The quasiRandom function is an ad hoc function that does not have to satisfy strong cryptographic requirements and is about 20 times cheaper to evaluate than a SHA2 hash, making search more efficient:

def quasiRandom(x, transformHash):
	x = x * (fromBinary32(transformHash[4:8]) * 2 + 1) % 2**32
	x = rotateLeft32(x, transformHash[0] % 32)
	x = x ^ (fromBinary32(transformHash[8:12]) * 2 + 1)
	x = x * (fromBinary32(transformHash[12:16]) * 2 + 1) % 2**32
	x = rotateLeft32(x, transformHash[1] % 32)
	x = x + ((fromBinary32(transformHash[16:20]) * 2 + 1)) % 2**32
	x = x * (fromBinary32(transformHash[20:24]) * 2 + 1) % 2**32
	x = rotateLeft32(x, transformHash[2] % 32)
	x = x ^ (fromBinary32(transformHash[24:28]) * 2 + 1)
	x = x * (fromBinary32(transformHash[28:32]) * 2 + 1) % 2**32
	x = rotateLeft32(x, transformHash[3] % 32)
	return x

Proposed parameters

Name	Value
MAP_WIDTH	2**24
MAP_HEIGHT	2**16
VALUES_PER_MAP	2**16
MAPS_PER_EPOCH	2**10
BASE_ROW_LENGTH	2**3
ROW_LENGTH_STEP	2**4

zsfelfoldi/filter_maps.md