Input data is stored as
input = [r[0], g[0], b[0], a[0], r[1], g[1], b[1], a[1], r[2], g[2], b[2], a[2], ...]
Weights are float values computed for each output pixel and rescaled to uint16:
weights[i] = [w[i, 0], w[i, 1], ..., w[i, K - 1]]
We want to compute the output as following:
output = [oR[0], oG[0], oB[0], oA[0], oR[1], oG[1], oB[1], oA[1], ...]
where
oR[i] = r[xmin[i]] * w[i, 0] + r[xmin[i] + 1] * w[i, 1] + ... + r[xmin[i] + K - 1] * w[i, K - 1]
oG[i] = g[xmin[i]] * w[i, 0] + g[xmin[i] + 1] * w[i, 1] + ... + g[xmin[i] + K - 1] * w[i, K - 1]
oB[i] = b[xmin[i]] * w[i, 0] + b[xmin[i] + 1] * w[i, 1] + ... + b[xmin[i] + K - 1] * w[i, K - 1]
oR[i] = r[xmin[i]] * w[i, 0] + r[xmin[i] + 1] * w[i, 1] + ... + r[xmin[i] + K - 1] * w[i, K - 1]
where r
is uint8 and w
is float.
Here is a way to perform computations in integer with a minimal precision loss.
- Rescale float weights into int16
- find max float weight to estimate
weights_precision
unsigned int weights_precision = 0;
for (weights_precision = 0; weights_precision < 22; weights_precision += 1) {
int next_value = (int) (0.5 + w_max * (1 << (weights_precision + 1)));
if (next_value >= (1 << 15))
break;
}
- transform float value into int16 value:
w_i16[i] = (int16) (sign(w_f32) * 0.5 + w_f32 * (1 << weights_precision));
- Compute output value using int dtype:
uint8 dst = ...
uint8 src = ...
int16 wts = ...
int output = 1 << (weights_precision - 1);
output += src[0] * wts[0];
output += src[1] * wts[1];
...
output += src] * wts];
output = (output >> weights_precision);
dst[o] = (uint8) clamp(output, 0, 255);
As data format is RGBA with R,G,B,A being uint8, we can encode 4 values as a single uint32 value.
Working register, avx2 = 32 uint8 places
reg = [0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0]
We can split K (size of weight vector for a given output index) as a sum: K = n * 4 + m * 2 + k
.
We load and process 4 weights values in a loop ("block 4") then we process 2 weights values in another loop ("block 2") and finally we process 1 weights value in the final loop ("block 1").
- As we are doing computations in integer dtype, we add the offset (=
1 << (weights_precision - 1)
):
reg = [
0 128 0 0 0 128 0 0 | 0 128 0 0 0 128 0 0 | 0 128 0 0 0 128 0 0 | 0 128 0 0 0 128 0 0
]
- Load weights. For "block 4" we load 4 int16 values
(w0, w1)
and(w2, w3)
. Each value then will be represented in the register with uint8 valueswl_0
andwh_0
:
w01 = [
wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1 | ... | ... | wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1
]
For example,
w01 = [
183 45 0 64 183 45 0 64 | 183 45 0 64 183 45 0 64 | 183 45 0 64 183 45 0 64 | 183 45 0 64 183 45 0 64
]
w23 = [
wl_2 wh_2 wl_3 wh_3 wl_2 wh_3 wl_2 wh_3 | ... | ... | wl_2 wh_2 wl_3 wh_3 wl_2 wh_2 wl_3 wh_3
]
On the next iteration we will load next pair of weights (w4, w5)
as w45
and (w6, w7)
as w67
in case of "block 4".
In case of "block 2" we will load 2 int16 values (w0, w1)
:
w01 = [
wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1 | ... | ... | wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1
]
And in case of "block 1" we will load only 1 int16 value w0
:
w0 = [
wl_0 wh_0 0 0 wl_0 wh_0 0 0 | ... | ... | wl_0 wh_0 0 0 wl_0 wh_0 0 0
]
- Load source data. Each RGBA pixel has 4 uint8 size, so half of 256-bits register (=16 uint8 places) can be filled with 4 pixels. To fill 32 uint8 places (=256 bits) we can load 4 pixels from two lines, e.g.
r0
-r3
andrr0
-rr3
whereri
is a red value from line0 andrri
is a red value from line1.
Thus, we can process in parallel 2 lines. The number of loaded pixels determines block option. For "block 4" we load pixels 0-3:
data = [
r0 g0 b0 a0 r1 g1 b1 a1 | r2 g2 b2 a2 r3 g3 b3 a3 | rr0 gg0 bb0 aa0 rr1 gg1 bb1 aa1 | rr2 gg2 bb2 aa2 rr3 gg3 bb3 aa3
]
For example,
data = [
0 1 2 255 3 4 5 255 | 6 7 8 255 9 10 11 255 | 27 28 29 255 30 31 32 255 | 33 34 35 255 36 37 38 255
]
In case of "block 2", we load
data = [
r0 g0 b0 a0 r1 g1 b1 a1 | 0 0 0 0 0 0 0 0 | rr0 gg0 bb0 aa0 rr1 gg1 bb1 aa1 | 0 0 0 0 0 0 0 0
]
and in case of "block 1", we load
data = [
r0 g0 b0 a0 0 0 0 0 | 0 0 0 0 0 0 0 0 | rr0 gg0 bb0 aa0 0 0 0 0 | 0 0 0 0 0 0 0 0
]
- As we loaded weights only 2 values we have to split and shuffle the source data such we could correctly multiply
r0 * w0 + r1 * w1
andr2 * w2 + r3 * w3
. For "block 4" we obtain:
data_01 = [
r0 0 r1 0 g0 0 g1 0 | b0 0 b1 0 a0 0 a1 0 | rr0 0 rr1 0 gg0 0 gg1 0 | bb0 0 bb1 0 aa0 0 aa1 0
]
data_23 = [
r2 0 r3 0 g2 0 g3 0 | b2 0 b3 0 a2 0 a3 0 | rr2 0 rr3 0 gg2 0 gg3 0 | bb2 0 bb3 0 aa2 0 aa3 0
]
For "block 2" we will have
data_01 = [
r0 0 r1 0 g0 0 g1 0 | b0 0 b1 0 a0 0 a1 0 | rr0 0 rr1 0 gg0 0 gg1 0 | bb0 0 bb1 0 aa0 0 aa1 0
]
and for "block 1" we will have
data_0 = [
r0 0 0 0 g0 0 0 0 | b0 0 0 0 a0 0 0 0 | rr0 0 0 0 gg0 0 0 0 | bb0 0 0 0 aa0 0 0 0
]
- Multiply and add weights and source data using integer 32-bits precision. Integer 32-bits precision means the output will take 4 placeholders (a b c d).
# w01 = [
# wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1 | ... | ... | wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1
# ]
out01 = data_01 * w01
out01 = [
(r0 0) * (wl_0 wh_0) + (r1 0) * (wl_1 wh_1), (g0 0) * (wl_0 wh_0) + (g1 0) * (wl_1 wh_1) |
(b0 0) * (wl_0 wh_0) + (b1 0) * (wl_1 wh_1), (a0 0) * (wl_0 wh_0), (a1 0) * (wl_1 wh_1) |
(rr0 0) * (wl_0 wh_0) + (rr1 0) * (wl_1 wh_1), (gg0 0) * (wl_0 wh_0) + (gg1 0) * (wl_1 wh_1) |
(bb0 0) * (wl_0 wh_0) + (bb1 0) * (wl_1 wh_1), (aa0 0) * (wl_0 wh_0) + (a1 0) * (wl_1 wh_1)
]
where (pi 0) * (wl_j wh_j) + (pk 0) * (wl_n wh_n) = (out_0, out_1, out_2, out_3)
.
out23 = data_23 * w23
out23 = [
(r2 0) * (wl_2 wh_2) + (r3 0) * (wl_3, wh_3), (g2 0) * (wl_2 wh_2) + (g3 0) * (wl_3 wh_3) |
(b2 0) * (wl_2 wh_2) + (b3 0) * (wl_3 wh_3), (a2 0) * (wl_2 wh_2) + (a3 0) * (wl_3 wh_3) |
(rr2 0) * (wl_2 wh_2) + (rr3 0) * (wl_3 wh_3), (gg2 0) * (wl_2 wh_2) + (gg3 0) * (wl_3 wh_3) |
(bb2 0) * (wl_2 wh_2) + (bb3 0) * (wl_3 wh_3), (aa2 0) * (wl_2 wh_2) + (a3 0) * (wl_3 wh_3)]
For "block 1" we will have
out0 = [
(r0 0) * (wl_0 wh_0), (g0 0) * (wl_0 wh_0) |
(b0 0) * (wl_0 wh_0), (a0 0) * (wl_0 wh_0) |
(rr0 0) * (wl_0 wh_0), (gg0 0) * (wl_0 wh_0) |
(bb0 0) * (wl_0 wh_0), (aa0 0) * (wl_0 wh_0)
]
Here each element like (r0 0) * (wl_0 wh_0)
represent int32 and takes 4 placeholders.
Output is accumulated with the results from previous iterations.
- Add registers
out01
andout23
together in case of "block 4"
out1234 = [
(r0 0) * (wl_0 wh_0) + (r1 0) * (wl_1 wh_1) + (r2 0) * (wl_2 wh_2) + (r3 0) * (wl_3, wh_3),
(g0 0) * (wl_0 wh_0) + (g1 0) * (wl_1 wh_1) + (g2 0) * (wl_2 wh_2) + (g3 0) * (wl_3 wh_3) |
(b0 0) * (wl_0 wh_0) + (b1 0) * (wl_1 wh_1) + (b2 0) * (wl_2 wh_2) + (b3 0) * (wl_3 wh_3),
(a0 0) * (wl_0 wh_0), (a1 0) * (wl_1 wh_1) + (a2 0) * (wl_2 wh_2) + (a3 0) * (wl_3 wh_3) |
(rr0 0) * (wl_0 wh_0) + (rr1 0) * (wl_1 wh_1) + (rr2 0) * (wl_2 wh_2) + (rr3 0) * (wl_3 wh_3),
(gg0 0) * (wl_0 wh_0) + (gg1 0) * (wl_1 wh_1) + (gg2 0) * (wl_2 wh_2) + (gg3 0) * (wl_3 wh_3) |
(bb0 0) * (wl_0 wh_0) + (bb1 0) * (wl_1 wh_1) + (bb2 0) * (wl_2 wh_2) + (bb3 0) * (wl_3 wh_3),
(aa0 0) * (wl_0 wh_0) + (aa1 0) * (wl_1 wh_1) + (aa2 0) * (wl_2 wh_2) + (aa3 0) * (wl_3 wh_3)
]
- Shift back the output integer values (
output = (output >> weights_precision)
)
out12 = out12 >> weights_precision
# or
out1234 = out1234 >> weights_precision
- Convert packed signed 32-bit integers to packed 16-bit integers using signed saturation
(a a a a b b b b | c c c c d d d d) -> (a a b b c c d d | 0 0 0 0 0 0 0 0)
- Convert packed signed 16-bit integers to packed 8-bit integers using unsigned saturation
(a a b b c c d d) -> (a b c d 0 0 0 0)
- Write the output into single uint32
(a b c d) -> x_uint32
Input data is stored as
input = [
r[0, 0], g[0, 0], b[0, 0], a[0, 0], r[0, 1], g[0, 1], b[0, 1], a[0, 1], r[0, 2], g[0, 2], b[0, 2], a[0, 2], ...
r[1, 0], g[1, 0], b[1, 0], a[1, 0], r[1, 1], g[1, 1], b[1, 1], a[1, 1], r[1, 2], g[1, 2], b[1, 2], a[1, 2], ...
...
r - 1, 0], g - 1, 0], b - 1, 0], a - 1, 0], r - 1, 1], g - 1, 1], b - 1, 1], a - 1, 1], r - 1, 2], g - 1, 2], b - 1, 2], a - 1, 2], ...
...
]
Weights are float values computed for each output pixel and rescaled to uint16:
weights[i] = [w[i, 0], w[i, 1], ..., w[i, K - 1]]
We want to compute the output as following:
output = [
oR[0, 0], oG[0, 0], oB[0, 0], oA[0, 0], oR[0, 1], oG[0, 1], oB[0, 1], oA[0, 1], ...
]
where
oR[j, i] = r[ymin[j], i] * w[j, 0] + r[ymin[j] + 1, i] * w[j, 1] + ... + r[ymin[j] + K - 1] * w[j, K - 1]
oG[j, i] = g[ymin[j], i] * w[j, 0] + g[ymin[j] + 1, i] * w[j, 1] + ... + g[ymin[j] + K - 1] * w[j, K - 1]
oB[j, i] = b[ymin[j], i] * w[j, 0] + b[ymin[j] + 1, i] * w[j, 1] + ... + b[ymin[j] + K - 1] * w[j, K - 1]
As data format is RGBA with R,G,B,A being uint8, we can encode 4 values as a single uint32 value.
Working accumulating register, avx2 = 32 uint8 places
reg = [0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0]
We can split K (size of weight vector for a given output index) as a sum: K = m * 2 + k
.
We load and process 2 weights values in another loop ("block 2") and finally we process 1 weights value in the final loop ("block 1").
- As we are doing computations in integer dtype, we add the offset (=
1 << (weights_precision - 1)
):
reg = [
0 128 0 0 0 128 0 0 | 0 128 0 0 0 128 0 0 | 0 128 0 0 0 128 0 0 | 0 128 0 0 0 128 0 0
]
- Load weights. For "block 2" we load 2 int16 values
(w0, w1)
. Each value then will be represented in the register with uint8 valueswl_0
andwh_0
:
w01 = [
wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1 | ... | ... | wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1
]
And in case of "block 1" we will load only 1 int16 value w0
:
w0 = [
wl_0 wh_0 0 0 wl_0 wh_0 0 0 | ... | ... | wl_0 wh_0 0 0 wl_0 wh_0 0 0
]
- Load source data. Each RGBA pixel has 4 uint8 size, so half of 256-bits register (=16 uint8 places) can be filled with 4 pixels. To fill 32 uint8 places (=256 bits) we can load 8 pixels from each line, e.g.
r0
-r7
andrr0
-rr7
whereri
is a red value from line0 andrri
is a red value from line1.
For vertical pass we need to compute together values from different lines.
line0 = [
r0 g0 b0 a0 r1 g1 b1 a1 | r2 g2 b2 a2 r3 g3 b3 a3 | r4 g4 b4 a4 r5 g5 b5 a5 | r6 g6 b6 a6 r7 g7 b7 a7
]
line1 = [
rr0 gg0 bb0 aa0 rr1 gg1 bb1 aa1 | rr2 gg2 bb2 aa2 rr3 gg3 bb3 aa3 | rr4 gg4 bb4 aa4 rr5 gg5 bb5 aa5 | rr6 gg6 bb6 aa6 rr7 gg7 bb7 aa7
]
We process 8 pixels within each line in parallel and two lines contribute to the output: r0 * w0 + rr0 * w1
.
When it remains less then 8 pixels we can process 2 pixels within each line in parallel and finally just 1 pixel.
- We loaded weights 2 values as
(wl_0 wh_0 wl_1 wh_1)
thus we have to split and shuffle the source data such we could correctly multiplyr0 * w0 + rr0 * w1
andg0 * w0 + gg0 * w1
.
data_01_ll = [
r0 0 rr0 0 g0 0 gg0 0 | b0 0 bb0 0 a0 0 aa0 0 | r1 0 rr1 0 g1 0 gg1 0 | b1 0 bb1 0 a1 0 aa1 0
]
data_01_lh = [
r2 0 rr2 0 g2 0 gg2 0 | b2 0 bb2 0 a2 0 aa2 0 | r3 0 rr3 0 g3 0 gg3 0 | b3 0 bb3 0 a3 0 aa3 0
]
data_01_hl = [
r4 0 rr4 0 g4 0 gg4 0 | b4 0 bb4 0 a4 0 aa4 0 | r5 0 rr5 0 g5 0 gg5 0 | b5 0 bb5 0 a5 0 aa5 0
]
data_01_hh = [
r6 0 rr6 0 g6 0 gg6 0 | b6 0 bb6 0 a6 0 aa6 0 | r7 0 rr7 0 g7 0 gg7 0 | b7 0 bb7 0 a7 0 aa7 0
]
For "block 1" we will have
data_0_ll = [
r0 0 0 0 g0 0 0 0 | b0 0 0 0 a0 0 0 0 | r1 0 0 0 g1 0 0 0 | b1 0 0 0 a1 0 0 0
]
data_0_lh = [
r2 0 0 0 g2 0 0 0 | b2 0 0 0 a2 0 0 0 | r3 0 0 0 g3 0 0 0 | b3 0 0 0 a3 0 0 0
]
data_0_hl = [
r4 0 0 0 g4 0 0 0 | b4 0 0 0 a4 0 0 0 | r5 0 0 0 g5 0 0 0 | b5 0 0 0 a5 0 0 0
]
data_0_hh = [
r6 0 0 0 g6 0 0 0 | b6 0 0 0 a6 0 0 0 | r7 0 0 0 g7 0 0 0 | b7 0 0 0 a7 0 0 0
]
- Multiply and add weights and source data using integer 32-bits precision. Integer 32-bits precision means the output will take 4 placeholders (a b c d).
# w01 = [
# wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1 | ... | ... | wl_0 wh_0 wl_1 wh_1 wl_0 wh_0 wl_1 wh_1
# ]
out01_ll = data_01_ll * w01
out01_ll = [
(r0 0) * (wl_0 wh_0) + (rr0 0) * (wl_1 wh_1), (g0 0) * (wl_0 wh_0) + (gg0 0) * (wl_1 wh_1) |
(b0 0) * (wl_0 wh_0) + (bb0 0) * (wl_1 wh_1), (a0 0) * (wl_0 wh_0), (aa0 0) * (wl_1 wh_1) |
(r1 0) * (wl_0 wh_0) + (rr1 0) * (wl_1 wh_1), (g1 0) * (wl_0 wh_0) + (gg1 0) * (wl_1 wh_1) |
(b1 0) * (wl_0 wh_0) + (bb1 0) * (wl_1 wh_1), (a1 0) * (wl_0 wh_0) + (aa1 0) * (wl_1 wh_1)
]
where (pi 0) * (wl_j wh_j) + (ppi 0) * (wl_n wh_n) = (out_0, out_1, out_2, out_3)
.
out01_lh = data_01_lh * w01
out01_lh = [
(r2 0) * (wl_0 wh_0) + (rr2 0) * (wl_1 wh_1), (g2 0) * (wl_0 wh_0) + (gg2 0) * (wl_1 wh_1) |
(b2 0) * (wl_0 wh_0) + (bb2 0) * (wl_1 wh_1), (a2 0) * (wl_0 wh_0), (aa2 0) * (wl_1 wh_1) |
(r3 0) * (wl_0 wh_0) + (rr3 0) * (wl_1 wh_1), (g3 0) * (wl_0 wh_0) + (gg3 0) * (wl_1 wh_1) |
(b3 0) * (wl_0 wh_0) + (bb3 0) * (wl_1 wh_1), (a3 0) * (wl_0 wh_0) + (aa3 0) * (wl_1 wh_1)
]
out01_hl = ...
out01_hh = ...
For "block 1" we will have
out0_ll = [
(r0 0) * (wl_0 wh_0), (g0 0) * (wl_0 wh_0) |
(b0 0) * (wl_0 wh_0), (a0 0) * (wl_0 wh_0) |
(r1 0) * (wl_0 wh_0), (g1 0) * (wl_0 wh_0) |
(b1 0) * (wl_0 wh_0), (a1 0) * (wl_0 wh_0)
]
out0_lh = ...
out0_hl = ...
out0_hh = ...
Here each element like (r0 0) * (wl_0 wh_0)
represent int32 and takes 4 placeholders.
- Shift back the output integer values (
output = (output >> weights_precision)
)
out01_ll = out01_ll >> weights_precision
out01_lh = out01_lh >> weights_precision
out01_hl = out01_hl >> weights_precision
out01_hh = out01_hh >> weights_precision
- Convert packed signed 32-bit integers to packed 16-bit integers using signed saturation
(a a a a b b b b | c c c c d d d d) -> (a' a' b' b' c' c' d' d')
(out01_ll, out01_lh) -> out_01_l
(out01_hl, out01_hh) -> out_01_h
- Convert packed signed 16-bit integers to packed 8-bit integers using unsigned saturation
(a a b b | c c d d) -> (a' b' c' d')
(out01_l, out01_h) -> out_01
- Write the output into single uint32
(a b c d) -> x_uint32