Given three options a,b,c with a certain weight each which controls the probability this option will be selected with
let $options := (
[1, "a"],
[2, "b"],
[1, "c"]
)
The probability that b
will be selected is twice as high
as for a
and c
.
Each weight has a probability of max probability (1.0) divided by the sum of all weights.
The sum of all weights can be obtained by the following expression.
sum($options?1)
In the above example we get 4 as the sum. The factor to divide all given weights is therefore 0.25
let $factor := 1.0 div sum($options?1)
Multiply each weight by this factor
let $prepared := $options ! [.?1 * $factor, .?2]
For the example this yields
(
[0.25, "a"],
[0.5, "b"],
[0.25, "c"]
)
This way the weight table is fit to be used in a function that selects by probability using a random number between 0 and 1.0
declare
function prob:select-by-p($options as array(*)+, $p as xs:double) as xs:string+ {
fold-left(
$options,
map {
"result": (),
"sum-p": 0,
"p": $p
},
prob:select#2
)
?result
};
declare %private
function prob:select($acc as map(*), $next as array(*)) as map(*) {
if (exists($acc?result))
then $acc (: pick first match, skip the rest :)
else
(: carry over sum of all previous probabilities :)
let $current-sum := $acc?sum-p + $next?1
let $match := if ($current-sum > $acc?p) then $next?2 else ()
return [ $match, $current-sum, $acc?p ]
};
The selection process can be simplified dramatically by calculating the sum of probabilities beforehand.
declare
function prob:prepare-options ($options as array(*)+) as array(*)+ {
fold-left($options?1, map{ "sum": 0, "adjusted": () },
prob:adjust-weight(?, ?, 1.0 div sum($options?1)))
=> map:get("adjusted")
=> for-each-pair($options, prob:adjust-option#2)
};
declare
function prob:adjust-weight($acc as map(*), $next-weight as xs:double, $factor as xs:double) as map(*) {
let $sum := $acc?sum + $next-weight
return map{
"sum": $sum,
"adjusted": ($acc?adjusted, $sum * $factor)
}
};
declare
function prob:adjust-option ($adjusted-weight as xs:double, $option as array(*)) as array(*) {
[$adjusted-weight, $option?2]
};
So that our options become
(
[0.25, "a"],
[0.75, "b"],
[1.0, "c"]
)
With these we can select with
declare
function prob:select-by-p($options as array(*)+, $p as xs:double) as item()* {
fold-left($options, (), prob:select(?, ?, $p))
};
(:~
: either there is a result or the p is too small: move on to the next
:)
declare
function prob:select($result as item()*, $next as array(*), $p as xs:double) as item()* {
if (empty($result) and $next?1 >= $p)
then $next?2
else $result
};
I chose to provide reducers with all constant values as function parameters to avoid having complex accumulators.
The lossy nature of floating points can lead to weight tables where the sum of all weights is
not exactly 1.
This happens for sums that are base 3, 6 or 7 and probably there are more.
The last probability is then one of 0.999999999999999999
, 1.000000000000000002
.
Here are a few examples:
([1, "a"], [1, "b"], [1, "c"])
([1, "a"], [2, "b"], [3, "c"])
([1, "a"], [2, "b"], [4, "c"])
Luckily, XQuery has a built-in function that solves that problem entirely, as if someone else had this issue in the past.
let $sane-floating-point-value := round-half-to-even($any-floating-point-value, 17))
The precision, the second argument, might have to be adjusted to your needs. While testing this code on eXist-db (7.0.0-SNAPSHOT) it produced both precise and clean values.
Would it be more performant to sort the options by weight?
Considering an unsorted weight table heavily skewed towards selection of the last option.
(
[1, "a"],
[1, "b"],
[99999, "c"],
)
In this case the fold-left in prob:select
will have to go through the entire list of options before
being able to skip for the majority of calls.
On the other hand the one additional comparison ($next?1 >= $p
) is so light-
weight, that the effect is negligible. For very long lists of options
it might be a little more interesting.