Log index search implementation overview

The purpose of the log index search mechanism is to return a set of logs that might potentially match a given pattern. The main search function GetPotentialMatches receives the following parameters:

block range
a set of addresses one of which has to match the log address (an empty set is interpreted as a wild card)
a set of topics for each of at most 4 topic positions, one of which has to match the topic at the given position (an empty set is interpreted as a wild card)

GetPotentialMatches constructs a pattern matcher based on the given addresses and topics and translates the block range into log value index range. Then it runs the pattern matcher on the filter maps covering the searched range, running multiple workers in parallel, each processing one epoch at a time. After evaluating the matcher it looks up the actual logs at the resulting potentially matching log value indices. Parallel worker results are collected and returned in chronological order.

Note that GetPotentialMatches may still return false positives and the final filtering for actual matches is done by the filterLogs function in eth/filters.

Pattern matcher construction

The pattern matcher is a modular construction built from building blocks that all implement the matcher interface (see below). For every individual log value a singleMatcher is constructed that returns log value positions where the given log value might be found. A matchAny is constructed from singleMatcher child instances for the set of matching addresses and the sets of matching topics for each position. matchAny returns log value positions where one of the children might have a match. Note that an empty matchAny acts as a wild card. Finally a matchSequence is constructed from matchAny child instances that returns a potential match for log value index N if the child matchers all returned a match for their position in the sequence (N for the address, N+1 for the first topic, N+2 for the second topic and so on). Note that sequence matchers are often constructed from child matchers that give matches with a very different frequency. In these cases efficiency is achieved by adaptively adjusting the evaluation order to first evaluate the child that is cheapest to evaluate and/or gives empty sets of results most often so that the more expensive matchers do not need to be evaluated at all for most of the maps.

Pattern matcher example:

addresses = [A1, A2]
topics = [[T1], [T2, T3]]

The table below shows the occurences of A1, A2, T1, T2 and T3 in a single map, expected pattern matches shown in bold:

log value index	topic or address
565	A1
566	T1
1381	A1
1382	T1
1383	T3
24578	A2
24580	T2
33402	T2
45293	A2
45294	T1
45295	T3
55322	T1

The next table shows the result sets of individual matchers, false positives shown in italics:

matcher	results
singleMatcher(A1)	565, 1381, 5567
singleMatcher(A2)	24578, 45293
singleMatcher(T1)	566, 1382, 45294, 55322
singleMatcher(T2)	24580, 33402
singleMatcher(T3)	1383, 12343, 45295
matchAny(A1, A2)	565, 1381, 5567, 24578, 45293
matchAny(T1)	566, 1382, 45294, 55322
matchAny(T2, T3)	1383, 12343, 24580, 33402, 45295
matchSequence	1381, 45293

The matcher backend

The matcher backend (represented by the MatcherBackend interface) provides access to the log index data. It is implemented by FilterMapsMatcherBackend that serves log index data directly from the log indexer's local database but it can also be implemented using a remote provider and local proof verification. Since the underlying data source is updated as the chain progresses, the interface does not assume that the returned data stays consistent with a single view of the chain throughout the search process. Instead it provides the SyncLogIndex function which lets the searcher know the block range that wasn't modified during the search and therefore the results in that range can be considered correct. If this guaranteed valid range is shorter than desired then the filter mechanism starts another search request for the invalidated range.

The matcher and matcherInstance interfaces

A matcher that implements the matcher interface is constructed for a specific search pattern. Its newInstance function creates a matcherInstance looking for matches in a given set of filter maps. The instance can run in multiple getMatchesForLayer iterations, in each iteration processing data on the next mapping layer and returning a list of matching log value indices for those maps where it has already found all matches. The set of maps processed by a single matcherInstance typically covers one epoch or the interesting subset of it, except for matchSequence which in some cases also requests the first map of the next epoch from its child matcher instances.

The dropIndices function of matcherInstance signals to the instance that the results for the given maps are no longer required, should not be processed further in subsequent layer iterations and the available partial results can be discarded. This per-layer evaluation method with optional abort is useful for matchSequence which does not always know in advance which one of its child matchers will evaluate in multiple expensive iterations and which one will quickly return an empty set of results, allowing the evaluation of the other, expensive children to be aborted after processing the first (still cheap) layer. Though this interface adds some extra complexity, in some cases it can improve search efficiency by an order of magnitude.

Results for finished maps are returned in a matcherResult which consists of a map index and a corresponding potentialMatches list. getMatchesForLayer should be called for more layer iterations until results for all requested maps that have not been dropped during the process are returned. Note that an empty set of results is represented as a zero-length potentialMatches list while nil value has a special purpose; it represents a "wild card" result meaning that the entire map range is a potential match. A zero length matchAny returns wild cards for all maps. This simplifies matchSequence construction with wild cards at certain positions. An all wild card matcher is not allowed (GetPotentialMatches fails if the top level matcher returns one) as it would require an extra special case to process and it would also be much slower to look up every log by index than to simply revert to unindexed log search.

Matcher implementation

singleMatcher

A singleMatcher creates singleMatcherInstance instances that can filter for a single log value. The instance maintains a set of requested maps and their partial results if available. In each mapping layer iteration getMatchesForLayer fetches the appropriate row of the requested maps from the matcher backend, filters out potential matches from the list of column indices and adds them to the partial results. If a row had more entries than the maximum row length for the given layer then the list of entries is truncated and extra entries are ignored. If the row had fewer entries than the maximum then the final results are returned for the corresponding map and the map is removed from the requested set. dropIndices also removes the specified maps from the requested set and removes partial results.

matchAny

matchAny is constructed from a list of child matchers and creates matchAnyInstance instances that contain a similar list of instances of the child matchers, each initialized with the same set of requested maps. getMatchesForLayer always requests results from all children. matchAnyInstance keeps track of any results returned by children until all children have returned results for a certain map. When this happens, it returns the merged results. dropIndices discards existing results of the specified maps and drops the same indices from all children.

matchSequence

matchSequence is constructed from two child matchers: a base and next matcher. matchSequenceInstance contains the corresponding child instances. getMatchesForLayer evaluates the two children in an order determined by a heuristic function that approximates the cost of evaluation and the chance of empty result sets from each child. After evaluating each child, map indices are dropped from the other child if the results returned from the recently evaluated child guarantee that no possible result from the other would yield a combined result from matchSequenceInstance. A sequence matched result N is returned if baseInstance returned potential match N and nextInstance returned N+offset where offset is a fixed parameter of matchSequence. Note that though matchSequence only handles two children in order to reduce code complexity, the newMatchSequence function can recursively construct a sequence pattern matcher from any number of child matchers.

baseInstance is instantiated with the same set of requested maps as matchSequenceInstance. On the other hand, for each map M requested from matchSequenceInstance, both maps M and M+1 are requested from nextInstance because baseInstance results for the last few log value indices of map M might have to be matched against nextInstance results for the first few log value indices of map M+1. Similarly, dropIndices always drops all specified map indices from baseInstance but only drops map M from nextInstance if map M-1 is also either dropped or the baseInstance results for map M-1 are known and do not contain any log value indices from the end of the map that could potentially match with anything from map M.

zsfelfoldi/log_search.md