A GNU C / Asm implementation is here: https://gist.github.com/vsivsi/8511aca1bac528f49fbb45a636afa4b5
NOTE! This must be run on an Intel processor supporting AVX512F/DQ
To test: go test -count 1 -timeout 15m -run '^TestMask$' gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32
This should fail with something like: