Stream Processing for Polygonal Summary over Rasters

Problem

We have a stream of requests for aggregation for polygons over a set of rasters (eg: Sum of all values under polygon (AOI) in rasters A, B and C. The set of rasters is potentially unbounded and the aggregation itself can be sophisticated, where the aggregation value is continguent on per-pixel values of A and B and not their summaries (eg. Producing multi-dimensional histogram).

Challanges

The request stream does not have uniform rate
The requests in a stream have relative priority
Stream latency needs to be optimized
Stream throughput needs to be optimized
The request AOI is unbounded (small or big)
The rasters are not uniformly distributed
The set of rasters may evolve over time, sometimes during query execution

Analysis

Because latency needs to be optimized and request AOI is unbounded in some cases it will be neccesary to parallize the processing of requests. That is to break up the request AOI to be summaraized in parallel and later aggregated for final request.

The summary is computationally simple therefore most of the cost of the process is actually around query and IO: finding and fetching the relevant raster pixels before they can be summarized. Therefore effeective caching can play a crusial role in both response latency and stream throughput. That is we need to maximise the number of requests we can serve per each read from the rasters.

That most intuitive way to achieve above is to construct some kind of agent locality, where each agent responding to summary request is responsible for specific geographic area where its possible to keep portions of existing rasters either in memory or locally on disk. However, we knost that raster density is not uniform. For instance no rasters cover the ocean and coverarage per country may vary dramatically as well. Therefore the locality of each agent (the area they’re responsible for) must be variable in order to keep the amount of data they’re resonsible for querying constant.

Design

Lets explore the likely design through describing agent roles.

Worker Agent

This agent should own a scoped geographic area, keeping the underlying rasters in hot-cache. This agent respons to either full summary requests when AOI fits fully in its geographic area or with partial summary requests when AOI only intersects its geographic area.

Agent areas should not overlap to prevent double counting. A convinient way to designate owned areas is through variable length geohash. Lets say an agent owns area under geohash key. If geohash has more bits, the area is owned is further reduced. Lets say an agent is resonsible for either a single raster dataset or a set of rasters that are always/often queried together.

Supervision Agent

Since the area owned by agents is determined at runtime and may change as new rasters are added there has to be a mechanism to insure that sufficient agents exist to cover pending requests. Additionally the number of worker agents is expected to fluctuate to match the request backlog. This is ambigius in terms of implementation and can be done two ways:

The supervision agent actually spins up and assigns agents based on load
The worker agent picks up backlogged queues and builds up a cache for them

Query Planner

Each request may go to one or more agents. The exact plan depends on metadata of available rasters and the structure of the stream/number of agents. Query planner is an agent that keeps this metadata hot so it can break down user request per summary into ballanced number of work requests over available agents.

Query Aggregator

Each worker agent may emit a partial result, something needs to be responsible for aggregating the results so they can returned to the client.

Implementation

Proposed communcation channel for this system is a durable queue like Kafka.

Incoming requests are queued in INBOX queue
Query Planning deques INBOX request and placed n WORK requests on analysis que split by variable geohash.
Worker deques WORK request and places either RESULT request on OUTBOUND que or PARTIAL_RESULT request on PARTIAL queue
Aggregator deques messages from PARTIAL_RESULT and maintaines an up-to-date accumulation. When final requelt arrives (number of expected results is known at query planning time) its places RESULT request on OUTBOUND queue
OUTBOUND queue is served to the client possibly through SSE or is further broken up into per-request or per-job queue.

Performance

The dynamic overhead of this system is in message processing time. There is a delay between the request being know and the request being processed. This overhead is expected to be lower than the overhead of naivly reading raster data for each request with much lower IO costs as data is kept locally and not re-fetched from blob storage for each request.

Concerns

The mechanism for scaling the workers actually is a little unclear. What is the perfered method here? To supervise the workers, which has a point of failure and additional latency or let workers scale themselves, in which case how do they de-dupe requests and pick-up new ownership?

echeipesh/Stream.md