Storage format of Static Analysis Warnings

There are multiple options for storing the static analysis warnings that WatchDog tracks. Since the frequency of warnings is high, we can not send every single event to the server. In that case, the networkload and storage capacity would quickly be overwhelming.

The following requirements are imposed:

The networkload imposed by WatchDog may not increase by more than 20% compared to the current network usage.
All static analysis warning events must be preserved.
All static analysis warnings must be anonymized on the client.
Warnings can be clustered based on warning type.
Warnings that existed for less than X seconds can be grouped to 1 event. (X to be determined)

Possible solutions

All solutions should adhere to the above requirements. Moreover, they must be compatible with and non-intrusive to the existing event-based WatchDog networking infrastructure.

Only store max, min, latest

The warnings can be grouped by time-interval, which means that any warning generated in interval X from timestamp T1 to T2 is grouped and sent as once. The warnings are sent as a map, with as key the warning type and as value the maximum, minimum and latest number of warnings present in the document. To be able to gather this data, every interval needs to maintain these three values for each category. Data is only sent over if one of the values is different compared to the previous sent interval. This means that we have to store the data from the previous interval we sent.

Pseudo code

class TimeStamped {
  int max,
  int min,
  int latest
}

class TimeStamps {
  Map<Category, TimeStamped> previousTimestamps;
  Map<Category, TimeStamped> currentTimestamps;

  increaseForCategory(category) {
    stamped = currentTimestamps[category];
    stamped.latest++;
    stamped.max = max(stamped.max, stamped.latest);
  }

  decreaseForCategory(category) {
    stamped = currentTimestamps[category];
    stamped.latest--;
    stamped.min = min(stamped.min, stamped.latest);
  }

  class TimerTask {
    run() {
      for (category in currentTimestamps) {
        if (categoryIsdifferentInMaps(category)) {
          trackEventManager.addEvent(currentTimestamps[category]);
          if (currentTimestamps[category].latest == 0) {
            delete previousTimestamps[category];
            delete currentTimestamps[category];
          } else {
            previousTimestamps[category] = currentTimestamps[category];
          }
        }
      }
    }

    categoryIsdifferentInMaps(category) {
      if (!previousTimestamps[category]) {
        return true;
      }
      for (field in {max, min, latest}) do {
        if (previousTimestamps[category][field] != currentTimestamps[category][field]) {
          return true;
        }
      }
      return false;
    }
  }
}

Pros

Grouped timestamps significantly reduces networkload
Small memory footprint, only storing current warnings
Easy algorithm for checking updates

Cons

Significantly less granular data due to timestamps
Only max and min, no in between data. Trendlines are harder to determine

Group changes in array

Instead of generating an event for every single warning, we can group them up per interval. This means that we still store every single change, but prevent the overhead of every event. The effect on the networkload and storage will be reduced, but could still be severe depending on the frequency of warning generation. A timestamp for each change could be optional, as we can make the assumption it is uniformly distributed in the interval.

Pseudo code

class Change {
  Direction direction,
  optional DateTime time,

  enum Direction {
    INCREASE, DECREASE
  }
}

class CategoryChanges {
  Map<Category, List<Change>> currentChanges;

  increaseForCategory(category) {
    currentChanges[category] = currentChanges[category] || new List()
    currentChanges[category].push(new Change(INCREASE, now()))
  }

  decreaseForCategory(category) {
    currentChanges[category] = currentChanges[category] || new List()
    currentChanges[category].push(new Change(DECREASE, now()))
  }

  class TimerTask {
    run() {
      for (category in currentChanges) {
        changes = currentChanges[category];
        if (changes.length == 0) {
          delete currentChanges[category]
        } else {
          trackEventManager.addEvent(currentChanges[category]);
          currentChanges[category].clear();
        }
      }
    }
  }
}

Pros

Extremely granular data, storing every change
Less overhead than sending every event

Cons

Potentially still very data-intensive, as no filtering is in place

TimvdLippe/static-analysis-storage-proposal.md

Storage format of Static Analysis Warnings

Possible solutions

Only store max, min, latest

Pseudo code

Pros

Cons

Group changes in array

Pseudo code

Pros

Cons