Skip to content

Instantly share code, notes, and snippets.

@line-o
Last active September 22, 2023 19:17
Show Gist options
  • Save line-o/4166755e5307f16bb0ae0c26d921d7e2 to your computer and use it in GitHub Desktop.
Save line-o/4166755e5307f16bb0ae0c26d921d7e2 to your computer and use it in GitHub Desktop.
Selecting items skewed by weights was an interesting problem to solve

Selection by weight

Given three options a,b,c with a certain weight each which controls the probability this option will be selected with

let $options := (
    [1, "a"],
    [2, "b"],
    [1, "c"]
)

The probability that b will be selected is twice as high as for a and c.

Each weight has a probability of max probability (1.0) divided by the sum of all weights.

The sum of all weights can be obtained by the following expression.

sum($options?1)

In the above example we get 4 as the sum. The factor to divide all given weights is therefore 0.25

let $factor := 1.0 div sum($options?1)

Multiply each weight by this factor

let $prepared := $options ! [.?1 * $factor, .?2]

For the example this yields

(
    [0.25, "a"],
    [0.5,  "b"],
    [0.25, "c"]
)

This way the weight table is fit to be used in a function that selects by probability using a random number between 0 and 1.0

declare
function prob:select-by-p($options as array(*)+, $p as xs:double) as xs:string+ {
    fold-left(
        $options,
        map {
            "result": (),
            "sum-p": 0,
            "p": $p
        },
        prob:select#2
    )
    ?result
};

declare %private
function prob:select($acc as map(*), $next as array(*)) as map(*) {
    if (exists($acc?result))
    then $acc (: pick first match, skip the rest :)
    else
        (: carry over sum of all previous probabilities :)
        let $current-sum := $acc?sum-p + $next?1
        let $match := if ($current-sum > $acc?p) then $next?2 else ()
        return [ $match, $current-sum, $acc?p ]
};

Further optimisations

The selection process can be simplified dramatically by calculating the sum of probabilities beforehand.

declare
function prob:prepare-options ($options as array(*)+) as array(*)+ {
    fold-left($options?1, map{ "sum": 0, "adjusted": () },
        prob:adjust-weight(?, ?, 1.0 div sum($options?1)))
    => map:get("adjusted")
    => for-each-pair($options, prob:adjust-option#2)
};

declare
function prob:adjust-weight($acc as map(*), $next-weight as xs:double, $factor as xs:double) as map(*) {
    let $sum := $acc?sum + $next-weight
    return map{
        "sum": $sum,
        "adjusted": ($acc?adjusted, $sum * $factor)
    }
};

declare
function prob:adjust-option ($adjusted-weight as xs:double, $option as array(*)) as array(*) {
    [$adjusted-weight, $option?2]
};

So that our options become

(
    [0.25, "a"],
    [0.75, "b"],
    [1.0,  "c"]
)

With these we can select with

declare
function prob:select-by-p($options as array(*)+, $p as xs:double) as item()* {
    fold-left($options, (), prob:select(?, ?, $p))
};

(:~
 : either there is a result or the p is too small: move on to the next
 :)
declare
function prob:select($result as item()*, $next as array(*), $p as xs:double) as item()* {
    if (empty($result) and $next?1 >= $p)
    then $next?2
    else $result 
};

I chose to provide reducers with all constant values as function parameters to avoid having complex accumulators.

Floating points are "fun"

The lossy nature of floating points can lead to weight tables where the sum of all weights is not exactly 1. This happens for sums that are base 3, 6 or 7 and probably there are more. The last probability is then one of 0.999999999999999999, 1.000000000000000002.

Here are a few examples:

([1, "a"], [1, "b"], [1, "c"])
([1, "a"], [2, "b"], [3, "c"])
([1, "a"], [2, "b"], [4, "c"])

Luckily, XQuery has a built-in function that solves that problem entirely, as if someone else had this issue in the past.

let $sane-floating-point-value := round-half-to-even($any-floating-point-value, 17))

The precision, the second argument, might have to be adjusted to your needs. While testing this code on eXist-db (7.0.0-SNAPSHOT) it produced both precise and clean values.

Performance Considerations

Would it be more performant to sort the options by weight?

Considering an unsorted weight table heavily skewed towards selection of the last option.

(
    [1, "a"],
    [1, "b"],
    [99999, "c"],
)

In this case the fold-left in prob:select will have to go through the entire list of options before being able to skip for the majority of calls.

On the other hand the one additional comparison ($next?1 >= $p) is so light- weight, that the effect is negligible. For very long lists of options it might be a little more interesting.

module namespace prob = "//line-o.de/ns/prob";
declare namespace map = "http://www.w3.org/2005/xpath-functions/map";
declare namespace array = "http://www.w3.org/2005/xpath-functions/array";
(: select by p from prepared sequence of options :)
declare
function prob:select-by-p($options as array(*)+, $p as xs:double) as item()* {
fold-left($options, (),
prob:select(?, ?, $p)
)?2
};
(: This is just a convenience function :)
declare
function prob:prepare-and-select($options as array(*)+, $p as xs:double) as item()* {
prob:prepare-options($options)
=> prob:select-by-p($p)
};
(: prepare a sequence of options :)
declare
function prob:prepare-options ($options as array(*)+) as array(*)+ {
prob:prepare-options($options, prob:adjust-option#2)
};
(: if you want to tweak option adjustment :)
declare
function prob:prepare-options (
$options as array(*)+,
$adjust as function(xs:double, array(*)) as array(*)
) as array(*)+ {
prob:adjust-weights($options?1)
=> for-each-pair($options, $adjust)
};
(: ---- helper functions ---- :)
declare %private
function prob:adjust-weights ($weights as xs:numeric*) as xs:numeric* {
fold-left($weights,
map{ "sum": 0, "adjusted": () },
prob:adjust-weight(?, ?, 1.0 div sum($weights))
)
?adjusted
};
(:~
: either there is a result or the p is too small: move on to the next
:)
declare %private
function prob:select($result as array(*)?, $next as array(*), $p as xs:double) as array(*)? {
if (empty($result) and $next?1 >= $p)
then $next
else $result
};
declare %private
function prob:adjust-weight($acc as map(*), $next-weight as xs:double, $factor as xs:double) as map(*) {
let $sum := $acc?sum + $next-weight
return map{
"sum": $sum,
"adjusted": ($acc?adjusted,
round-half-to-even($sum * $factor, 17)) (: get rid of floating point problems :)
}
};
declare %private
function prob:adjust-option ($adjusted-weight as xs:double, $option as array(*)) as array(*) {
array:put($option, 1, $adjusted-weight)
};
import module namespace prob="//line-o.de/ns/prob" at "./prob.xqm";
(: get a random number to select with :)
declare variable $p := random-number-generator()?number;
declare variable $data :=
<options>
<option />
<option><a/></option>
<option weight="3"><b/></option>
</options>
;
(: let's see this in action with different applications :)
map{
(: show the value of p to check results :)
"p": serialize($p)
,
(: convenient way of selecting from unprepared sets of options :)
"select from 3":
([1, "a"], [2, "b"], [3, "c"]) => prob:prepare-and-select($p)
,
"select from 4":
([1, "a"], [2, "b"], [4, "c"], [10, "d"]) => prob:prepare-and-select($p)
,
(: one advantage of using an array is the ability to select any sequence :)
"select arbitrary": (
[1, map{"complex": "thing"}],
[1, ()],
[1, ("c", 1, <node />)]
)
=> prob:prepare-options()
=> prob:select-by-p($p)
,
(: can also be used with pre-prepared sets :)
"pre-prepared":
([0.5, 0], [1.0, 1])
=> prob:select-by-p($p)
,
(: applicable to different styles of declaring options :)
"map-style":
map { "a": 1, "b": 30, "c": 23 }
=> map:for-each(function ($k, $v) { [$v, $k] })
=> prob:prepare-and-select($p)
,
(: gather options from any data source :)
"from data":
prob:prepare-options(
for $option in $data//option
return [
($option/@weight/number(), 1)[1], (: default weight is 1 :)
$option/node()
]
)
=> prob:select-by-p($p)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment