Skip to content

Instantly share code, notes, and snippets.

@JimBobSquarePants
Created January 24, 2017 07:42
Show Gist options
  • Save JimBobSquarePants/1c7a56db5b6b68dedf2852da78f4afbe to your computer and use it in GitHub Desktop.
Save JimBobSquarePants/1c7a56db5b6b68dedf2852da78f4afbe to your computer and use it in GitHub Desktop.
Benchmarking integer clamping methods using BenchMarkDotNet
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-6600U CPU 2.60GHz, ProcessorCount=4
Frequency=2742193 Hz, Resolution=364.6716 ns, Timer=TSC
  [Host]     : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1586.0
  DefaultJob : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1586.0

Allocated=0 B
Method Value Mean StdErr StdDev Median Scaled Scaled-StdDev
'Maths Clamp' -1 1.2373 ns 0.0154 ns 0.0595 ns 1.2511 ns 1.00 0.00
'No Maths Clamp' -1 1.2183 ns 0.0147 ns 0.0570 ns 1.2248 ns 0.99 0.07
'No Maths No Equals Clamp' -1 2.0156 ns 0.0107 ns 0.0399 ns 2.0256 ns 1.63 0.08
'No Maths Clamp No Ternary' -1 1.1519 ns 0.0069 ns 0.0247 ns 1.1617 ns 0.93 0.05
'No Maths No Equals Clamp No Ternary' -1 1.1801 ns 0.0145 ns 0.0562 ns 1.1920 ns 0.96 0.06
'Clamp using Bitwise Abs' -1 0.9129 ns 0.0090 ns 0.0326 ns 0.9092 ns 0.74 0.04
'Maths Clamp' 0 1.2245 ns 0.0103 ns 0.0385 ns 1.2348 ns 1.00 0.00
'No Maths Clamp' 0 1.2442 ns 0.0116 ns 0.0450 ns 1.2560 ns 1.02 0.05
'No Maths No Equals Clamp' 0 1.9620 ns 0.0140 ns 0.0541 ns 1.9488 ns 1.60 0.07
'No Maths Clamp No Ternary' 0 1.1913 ns 0.0280 ns 0.1047 ns 1.1699 ns 0.97 0.09
'No Maths No Equals Clamp No Ternary' 0 1.9778 ns 0.0112 ns 0.0432 ns 1.9723 ns 1.62 0.06
'Clamp using Bitwise Abs' 0 0.9206 ns 0.0074 ns 0.0287 ns 0.9145 ns 0.75 0.03
'Maths Clamp' 255 1.2562 ns 0.0187 ns 0.0725 ns 1.2320 ns 1.00 0.00
'No Maths Clamp' 255 0.9064 ns 0.0299 ns 0.1157 ns 0.8482 ns 0.72 0.10
'No Maths No Equals Clamp' 255 1.9107 ns 0.0031 ns 0.0118 ns 1.9119 ns 1.53 0.08
'No Maths Clamp No Ternary' 255 0.9426 ns 0.0081 ns 0.0314 ns 0.9389 ns 0.75 0.05
'No Maths No Equals Clamp No Ternary' 255 1.9511 ns 0.0047 ns 0.0182 ns 1.9515 ns 1.56 0.08
'Clamp using Bitwise Abs' 255 0.9079 ns 0.0052 ns 0.0195 ns 0.9053 ns 0.72 0.04
'Maths Clamp' 256 1.1499 ns 0.0038 ns 0.0137 ns 1.1484 ns 1.00 0.00
'No Maths Clamp' 256 1.2231 ns 0.0114 ns 0.0410 ns 1.2134 ns 1.06 0.04
'No Maths No Equals Clamp' 256 2.0982 ns 0.0285 ns 0.1067 ns 2.0862 ns 1.82 0.09
'No Maths Clamp No Ternary' 256 0.9000 ns 0.0036 ns 0.0139 ns 0.8975 ns 0.78 0.01
'No Maths No Equals Clamp No Ternary' 256 0.7765 ns 0.0433 ns 0.1785 ns 0.6871 ns 0.68 0.15
'Clamp using Bitwise Abs' 256 1.1191 ns 0.0479 ns 0.2582 ns 0.9897 ns 0.97 0.22
@JimBobSquarePants
Copy link
Author

namespace ImageSharp.Benchmarks.General
{
    using System;
    using System.Runtime.CompilerServices;

    using BenchmarkDotNet.Attributes;

    public class Clamp
    {
        [Params(-1, 0, 255, 256)]
        public int Value { get; set; }

        [Benchmark(Baseline = true, Description = "Maths Clamp")]
        public byte ClampMaths()
        {
            int value = this.Value;
            return (byte)Math.Min(Math.Max(byte.MinValue, value), byte.MaxValue);
        }

        [Benchmark(Description = "No Maths Clamp")]
        public byte ClampNoMaths()
        {
            int value = this.Value;
            value = value >= byte.MaxValue ? byte.MaxValue : value;
            return (byte)(value <= byte.MinValue ? byte.MinValue : value);
        }

        [Benchmark(Description = "No Maths No Equals Clamp")]
        public byte ClampNoMathsNoEquals()
        {
            int value = this.Value;
            value = value > byte.MaxValue ? byte.MaxValue : value;
            return (byte)(value < byte.MinValue ? byte.MinValue : value);
        }

        [Benchmark(Description = "No Maths Clamp No Ternary")]
        public byte ClampNoMathsNoTernary()
        {
            int value = this.Value;

            if (value >= byte.MaxValue)
            {
                return byte.MaxValue;
            }

            if (value <= byte.MinValue)
            {
                return byte.MinValue;
            }

            return (byte)value;
        }

        [Benchmark(Description = "No Maths No Equals Clamp No Ternary")]
        public byte ClampNoMathsEqualsNoTernary()
        {
            int value = this.Value;

            if (value > byte.MaxValue)
            {
                return byte.MaxValue;
            }

            if (value < byte.MinValue)
            {
                return byte.MinValue;
            }

            return (byte)value;
        }

        [Benchmark(Description = "Clamp using Bitwise Abs")]
        public byte ClampBitwise()
        {
            int x = this.Value;
            int absmax = byte.MaxValue - x;
            x = (x + byte.MaxValue - AbsBitwiseVer(ref absmax)) >> 1;
            x = (x + byte.MinValue + AbsBitwiseVer(ref x)) >> 1;

            return (byte)x;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static int AbsBitwiseVer(ref int x)
        {
            int y = x >> 31;
            return (x ^ y) - y;
        }
    }
}

@AndreyAkinshin
Copy link

Ok, here you have the following situation. Consider the ClampNoMathsNoTernary() method

        [Benchmark(Description = "No Maths No Equals Clamp No Ternary")]
        public byte ClampNoMathsEqualsNoTernary()
        {
            int value = this.Value;

            if (value > byte.MaxValue)
            {
                return byte.MaxValue;
            }

            if (value < byte.MinValue)
            {
                return byte.MinValue;
            }

            return (byte)value;
        }

When the value is 255 or 256, the first branch will be taken; a const will be returned. The branch predictor works perfectly here, so you have about 0ns (the same performance as in the return 0; benchmark). I suppose you used it as a baseline in the last run (where you have only three different methods in the summary table). Unfortunately, it's impossible to calculate scaled values when your baseline is zero. The current solution is just print ?. Probably, we should also print a warning in such case (feel free to file an issue, if you know how it should look).

In BenchmarkDotNet v0.10.1, we had some troubles with benchmarks which take nanoseconds and return a value (you could have a 1ns-noise sometimes). It's really hard to measure performance with precision < 1ns, but I try to improve accuracy in each version.

@antonfirsov
Copy link

@JimBobSquarePants One Vector4 to rule them all!

Don't know what's this for, but consider moving to floating point representation! :)

@antonfirsov
Copy link

Two remakrs:

  1. I think using [Param] makes the results of these microbenchmarks noisy. I'd rather have multiple Abs() calls per method. Special inputs are rare here.

  2. Don't use byte for arithmetic operations in your algorithms! Convert your inputs to int-s or Vector4-s in a batched way instead. (I did this in my latest refactor in HuffmanTree with a positive perf outcome!)

@antonfirsov
Copy link

If the input byte-s are members of color space data, I'd rather have a benchmark that operates on Color[] or Color*, and also compare different color spaces, including Vector4-based ones.

If you have specific goals it's better to emulate your whole algorithmic scenario. There's simply too much context dependent stuff. Check out RoundSinglePrecisionBlocks for an example of this approach.

@JimBobSquarePants
Copy link
Author

@anton it's for a very specific purpose. Clamping int YCbCr values from the Jpeg decoder to a byte value. We call Clamp currently.

https://github.com/JimBobSquarePants/ImageSharp/blob/051423cfec096a6c86ae19054554ff7783a8c68a/src/ImageSharp.Formats.Jpeg/JpegDecoderCore.cs#L644

We may however, be able to use a lookup table instead.

https://github.com/google/guetzli/blob/master/guetzli/color_transform.h

@antonfirsov
Copy link

antonfirsov commented Jan 24, 2017

@JimBobSquarePants
It means the context is extremely useful here! I'd rather move the loop cores of ConvertFromYCbCr()-kind operations to utility methods and benchmark (+unit test!) different approaches as a whole. Clamping is just one of the many suboptimal stuff, I don't think the tactical microoptimization approach applied on small pieces is worth the efforts here.

Can't prove this right now, but I'm pretty 80% sure that the winner algorithm for the whole JpegChannels-->TColor conversion is:

  • Store YCbCr images (and other JpegPixelArea data) as a float[] instead of byte[]. It will turn Block8x8.CopyColorsTo() into a zero cost method and entirely eliminate back-conversion!
  • Inside ConvertFromYCbCr() Pack y[], cb[] and cr[] float arrays into a single Vect4[] YCbCr array in a batched way
  • Color transform and clamp all the YCbCr Vect4 values with SIMD operations
  • Pack colors from Vector4-s
  • Bonus: Refactor IPackedVector.PackFromVector4(v) to a batching solution (large scale, library-wide refactor)

It's also important to note, that after merging #90 , the JpegDecoder bottleck would be the Huffman decoder:

JpegDecoderProfile
I have optimization ideas for this too. If only I had a clone to implement them all :D

@JimBobSquarePants
Copy link
Author

@antonfirsov I'd love to be able to refactor to allow batch conversion and pack from Vector4, though we'd have to figure out a way to handle the different scaling values for the different PackedPixel types since they don't operate on the same ranges. (0-255, -1-1, -32767-32767 etc)

@JimBobSquarePants
Copy link
Author

@AndreyAkinshin I think my computer must have been having a meltdown or something the other day. Everything seems to be working perfectly now. Thanks!

Method Value Mean StdErr StdDev Median Scaled Scaled-StdDev
'Maths Clamp' -1 1.0185 ns 0.0161 ns 0.0579 ns 0.9970 ns 1.00 0.00
'No Maths Clamp' -1 0.6535 ns 0.0139 ns 0.0539 ns 0.6643 ns 0.64 0.06
'No Maths No Equals Clamp' -1 0.7720 ns 0.0193 ns 0.0721 ns 0.7542 ns 0.76 0.08
'No Maths Clamp No Ternary' -1 0.4456 ns 0.0253 ns 0.1695 ns 0.3517 ns 0.44 0.17
'No Maths No Equals Clamp No Ternary' -1 0.4274 ns 0.0317 ns 0.1267 ns 0.3737 ns 0.42 0.12
'Clamp using Bitwise Abs' -1 0.6840 ns 0.0232 ns 0.0837 ns 0.6593 ns 0.67 0.09
'Maths Clamp' 0 1.0146 ns 0.0328 ns 0.3007 ns 0.8503 ns 1.00 0.00
'No Maths Clamp' 0 0.6696 ns 0.0293 ns 0.2290 ns 0.5613 ns 0.71 0.30
'No Maths No Equals Clamp' 0 0.5152 ns 0.0304 ns 0.1994 ns 0.4151 ns 0.54 0.25
'No Maths Clamp No Ternary' 0 0.2416 ns 0.0080 ns 0.0288 ns 0.2361 ns 0.26 0.07
'No Maths No Equals Clamp No Ternary' 0 0.7601 ns 0.0288 ns 0.1705 ns 0.7235 ns 0.80 0.26
'Clamp using Bitwise Abs' 0 0.8561 ns 0.0319 ns 0.2299 ns 0.7471 ns 0.90 0.32
'Maths Clamp' 255 1.2078 ns 0.0323 ns 0.3028 ns 1.0666 ns 1.00 0.00
'No Maths Clamp' 255 1.0327 ns 0.0052 ns 0.0193 ns 1.0347 ns 0.90 0.17
'No Maths No Equals Clamp' 255 0.6788 ns 0.0080 ns 0.0311 ns 0.6773 ns 0.59 0.11
'No Maths Clamp No Ternary' 255 0.2715 ns 0.0059 ns 0.0212 ns 0.2680 ns 0.24 0.05
'No Maths No Equals Clamp No Ternary' 255 0.6427 ns 0.0103 ns 0.0371 ns 0.6434 ns 0.56 0.11
'Clamp using Bitwise Abs' 255 0.9125 ns 0.0299 ns 0.1118 ns 0.8660 ns 0.79 0.18
'Maths Clamp' 256 1.0398 ns 0.0151 ns 0.0584 ns 1.0320 ns 1.00 0.00
'No Maths Clamp' 256 1.0572 ns 0.0095 ns 0.0369 ns 1.0552 ns 1.02 0.06
'No Maths No Equals Clamp' 256 0.9951 ns 0.0130 ns 0.0487 ns 0.9902 ns 0.96 0.07
'No Maths Clamp No Ternary' 256 0.3176 ns 0.0243 ns 0.1479 ns 0.2750 ns 0.31 0.14
'No Maths No Equals Clamp No Ternary' 256 0.0812 ns 0.0173 ns 0.0625 ns 0.0713 ns 0.08 0.06
'Clamp using Bitwise Abs' 256 0.8001 ns 0.0094 ns 0.0327 ns 0.7955 ns 0.77 0.05

@antonfirsov
Copy link

@JimBobSquarePants as a quick solution PackFromBytes() is also OK. My main point is to perform the float->byte conversion in the end.

In long term we have to review the .PackFrom...() mechanism. It's a major, library-wide performance bottleneck in it's current form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment