Often it is useful to have access to the median value for fields of a data stream since they are more robust with respect to outliers. The median is defined as the value of a dataset such that, when sorted, 50% of the data is smaller than the value and 50% of the data is larger then the value. Ordinarily this is difficult to calculate on a stream because it requires the collection and sorting of all data.
The median of a data stream can be approximated with a technique called stochastic averaging. To approximate the median value of a data stream one could use the following approach:
Given the current estimate of the median M. If the next observed value in the stream is larger than M, increase the current estimate by r (= the learning rate). If it is smaller, decrease the estimate by r. When M is close to the median, it increases as often as it decreases, and therefore it stabilizes.
This approach was taken from the book "Real-time Analytics - Techniques to Analyze and Visualize Streaming Data" P. 296 / Byron Ellis / Wiley
We store the data we need for the necessary stats book-keeping in a redis hash.
hmset some_stats_value
current NaN -- the current / last seen value, defaults to NaN
median NaN -- the current median, defaults to NaN
learning_rate 0.7 -- the learning rate - how fast to move towards the read median
Paste into redis:
hmset some_stats_value current NaN median NaN learning_rate 0.7
-- script load "
local key, value = KEYS[1], tonumber(ARGV[1]);
local values = redis.call('HMGET', key, 'current', 'median', 'learning_rate');
local current, median, learning_rate = tonumber(values[1]), tonumber(values[2]), tonumber(values[3]);
if (value ~= median) then -- only update of median is different than current value
if (median ~= median) then -- check for NaN
median = value; -- initialize median
else
median = median + (median < value and learning_rate or -learning_rate); -- update median
end;
redis.call('HMSET', key, 'current', value, 'median', median);
end;
if (ARGV[2] == 'return_median') then
return {'current', value, 'median', median};
end;
--"
Paste into redis: Note that I removed the comments from the script above.
script load "local key, value = KEYS[1], tonumber(ARGV[1]); local values = redis.call('HMGET', key, 'current', 'median', 'learning_rate'); local current, median, learning_rate = tonumber(values[1]), tonumber(values[2]), tonumber(values[3]); if (value ~= median) then if (median ~= median) then median = value; else median = median + (median < value and learning_rate or -learning_rate); end; redis.call('HMSET', key, 'current', value, 'median', median); end; if(ARGV[2]=='return_median') then return {'current', value, 'median', median}; end;"
Note that the script load
command should return the sha hash f572d4de2b58fd955e2b29ae3c55f6c0c0972553
.
- Variant 1 Just updates the median.
evalsha SCRIPT_SHA 1 THE_STATS_KEY THE_VALUE
- Variant 2 Update median and return stored values.
evalsha SCRIPT_SHA 1 THE_STATS_KEY THE_VALUE return_median
Initialize stats container some_stats_value
:
hmset some_stats_value current NaN median NaN learning_rate 0.7
Send some data to the script:
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 100
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 150
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 200
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 10
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 20
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 30
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 40
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 50
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 90
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 500
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 299
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 344
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 999
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 73
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 89
Print the values:
127.0.0.1:6379> hgetall some_stats_value
1) "current"
2) "89"
3) "median"
4) "98.599999999999994"
5) "learning_rate"
6) "0.7"
Note that the actual median in this example is 90.
Idea: Compute a "stabilization factor" sf -> count how many times we are above / below "the current median". If both counts are roughly equal - then the current median value is near the actual median - otherwise we know whether we are below or above.
sf = (count we are to small + 1) / (count we are to big + 1) -> we know that we are near true median once sf approaches 1.0 +- epsilon.