.Net Core 3.0
(currently in preview) adds support for hardware intrinsics.
So now we can write simd instructions in c# 🎉
Intel has a nice website for browsing through the available instructions: Intel intrinsics guide
Everything from Avx2
down seems supported. (Given that your cpu supports it ofcourse)
For double checking what c# call maps to what instruction you can check the summary of the methods: Avx2.cs
As as test i created a utility that searches a collection of ints 8 entries at a time. Its a bit silly implementation because it only works if the target exists in the collection only once, but it shows you can get pretty big speed ups.
public static class GetIndex
{
/// <summary> Implemented with a simple loop </summary>
public static int GetIndexSimple(ReadOnlySpan<int> span, int item)
{
for (int i = 0; i < span.Length; i++)
{
if (span[i] == item)
return i;
}
return -1;
}
/// <summary> Implemented using build-in library 'IndexOf'</summary>
public static int GetIndexLibrary(ReadOnlySpan<int> span, int item)
{
return span.IndexOf(item);
}
/// <summary> Implemented using 'Avx2' intrinsics</summary>
/// <remarks> Without 'Avx2' support will behave as a simple loop</remarks>
public static unsafe int GetIndexIntrinsics(ReadOnlySpan<int> span, int item)
{
// Get a fixed pointer so the garbage-collector doesn't move the collection
fixed (int* startPointer = span)
{
int* endPointer = startPointer + span.Length;
int* pointer = startPointer;
// Query if the cpu actually supports 'Avx2' instructions
if (Avx2.IsSupported)
{
// Load '1' item into a 128 bit vector
Vector128<int> itemScaler = Avx2.LoadScalarVector128(&item);
// Copy that first item into the other 7 slots of a 256 bit vector, this means
// we now have a vector that is holding 8 times the value 'item'
Vector256<int> itemVector = Avx2.BroadcastScalarToVector256(itemScaler);
// Loop through the span 8 elements at a time (256 bit / 32 bit = 8)
for (; (pointer + 8) < endPointer; pointer += 8)
{
// Load 8 elements from the span
Vector256<int> elements = Avx2.LoadVector256(pointer);
// Compare those 8 elements with our item. This will give us 8 values of
// 'FFFF' or '0000' (32 bits of either 1 or 0) in a 256 bit vector
Vector256<int> elementEquals = Avx2.CompareEqual(elements, itemVector);
/*
Because 256 bit is a too big type to work with we combine it into a single
integer by taking 4 bits from each 32 bit value (bit 7, 15, 23 and 31).
eq 32: 0 0 1 0 0 0 0 0 0
MoveMask: 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Hex: 0 0 F 0 0 0 0 0 0
*/
int mask = Avx2.MoveMask(elementEquals.AsByte());
// If we make the assumption that the item only exists in the span once then
// we can construct a jump table for it.
switch (mask)
{
case (int)0x0000000F: // At element 0
return (int)(pointer - startPointer);
case (int)0x000000F0: // At element 1
return (int)(pointer + 1 - startPointer);
case (int)0x00000F00: // At element 2
return (int)(pointer + 2 - startPointer);
case (int)0x0000F000: // At element 3
return (int)(pointer + 3 - startPointer);
case (int)0x000F0000: // At element 4
return (int)(pointer + 4 - startPointer);
case (int)0x00F00000: // At element 5
return (int)(pointer + 5 - startPointer);
case (int)0x0F000000: // At element 6
return (int)(pointer + 6 - startPointer);
case unchecked((int)0xF0000000): // At element 7
return (int)(pointer + 7 - startPointer);
case (int)0x00000000: // Not found
continue;
default:
throw new Exception("Item found in span multiple times");
}
}
}
// Handle the remaiming items with a simple loop
for (; pointer < endPointer; pointer++)
{
if (*pointer == item)
return (int)(pointer - startPointer);
}
}
return -1;
}
}
Test proving that this actually works: (Using XUnit)
[Fact]
public static void GetIndex_BehavesConsistentlyAtEveryIndex()
{
const int TestKey = 1337;
int[] testElements = new int[100];
for (int i = 0; i < testElements.Length; i++)
{
// Set test key at this element
testElements[i] = TestKey;
// Verify that it can be found
Assert.Equal(i, GetIndex.GetIndexSimple(testElements, TestKey));
Assert.Equal(i, GetIndex.GetIndexLibrary(testElements, TestKey));
Assert.Equal(i, GetIndex.GetIndexIntrinsics(testElements, TestKey));
// Unset the test key
testElements[i] = 0;
}
}
Benchmark to test the performance: (Using BenchmarkDotNet)
public class GetIndexBenchmark
{
private const int ElementCount = 100_000;
private readonly int[] testElements = new int[ElementCount];
public GetIndexBenchmark()
{
// Set the element we want to find
testElements[50_000] = 1337;
}
[Benchmark(Baseline = true)]
public int GetIndexSimple() => GetIndex.GetIndexSimple(testElements, 1337);
[Benchmark]
public int GetIndexLibrary() => GetIndex.GetIndexLibrary(testElements, 1337);
[Benchmark]
public int GetIndexIntrinsics() => GetIndex.GetIndexIntrinsics(testElements, 1337);
}
Benchmark results:
BenchmarkDotNet=v0.11.3, OS=macOS Mojave 10.14 (18A391) [Darwin 18.0.0]
Intel Core i9-8950HK CPU 2.90GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview-009812
[Host] : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
DefaultJob : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
Method | Mean | Error | StdDev | Ratio |
---|---|---|---|---|
GetIndexSimple | 23.118 us | 0.3721 us | 0.3299 us | 1.00 |
GetIndexLibrary | 9.985 us | 0.1479 us | 0.1384 us | 0.43 |
GetIndexIntrinsics | 5.419 us | 0.0847 us | 0.0792 us | 0.23 |
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Ratio : Mean of the ratio distribution ([Current]/[Baseline])
1 us : 1 Microsecond (0.000001 sec)