A Bit of Vectorization

Speeding up a function using SIMD vector operations.

I needed a function to output a byte in the format 0b01010101. This can be done quite easily with a bit of string manipulation:

[Pure]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static string ToBinaryString(this byte value) => $"0b{System.Convert.ToString(value, 2).PadLeft(8, '0')}";

It could be a lot faster though. We're converting the number to a string, padding it and adding a prefix. There is a lot going on. We could make it faster by converting each bit to a char and then using the string.Create overload that let's you write directly to a Span<char>. And we could make it even faster still if we could convert all the bits simultaneously. This sounds like a job for SIMD instructions. Single Instruction, Multiple Data - we perform the same operation simultaneously for multiple input values. Luckily SIMD intrinsics are supported by .NET, and we can even do it in a processor agnostic way using the Vector* classes in the System.Runtime.Intrinsics namespace.

We will need to generate 8 chars for our string. A char in .NET is two bytes, meaning 16 bytes in total, or 16 x 8 = 128 bits. The perfect size for a Vector128<ushort>. To convert our bits to chars we're going to follow a similar procedure to the previous post whereby we converted the bit to the number 0 or 1 and then added it to the char value '0' to get either a '0' or a '1'.

We will start by converting our byte to a ushort and filling all 8 positions in the vector with it:

var vector = Vector128.Create((ushort)value);

We can now isolate a different bit in each of the 8 positions using a bit mask. This will leave us with just the first bit in the first ushort, just the second bit in the second ushort and so on:

var masks = Vector128.Create((ushort)0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001);

var isolatedBits = Vector128.BitwiseAnd(vector, masks);

If we then perform an Equals between isolatedBits and masks then for each of the 8 ushorts we will get all 1s (a truthy value in the world of vectors) if the isolated bit was set and all 0s (falsy) if the bit was not set:

var compared = Vector128.Equals(isolatedBits, masks);

To convert all 1s to a single 1 we could use masks again but instead I'm going to shift all the bits right by 15 (because a ushort is 16 bits) which will move the leftmost bit to the rightmost position and set all the other positions to 0. This avoids us loading another vector of masks so should be quicker:

var shifted = Vector128.ShiftRightLogical(compared, 15);

We can now create a vector of '0's and add our values to it to get the final result:

var zeroes = Vector128.Create((ushort)'0');

var result = Vector128.Add(zeroes, shifted);

Done. Then we just need to get them into a string. The full method is:

[Pure]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static string ToBinaryString(this byte value) =>
    string.Create(10, value, (chars, @byte) =>
    {
        chars[0] = '0';
        chars[1] = 'b';

        // Copy the value to 8 ushort positions in the vector. Using ushorts because chars are 16-bit values.
        var vector = Vector128.Create((ushort)@byte);

        // Isolate one bit in each position.
        var masks = Vector128.Create((ushort)0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001);
        var isolatedBits = Vector128.BitwiseAnd(vector, masks);

        // Now compare each position to its mask. This will give all ones if the original bit was a 1, and all zeroes if it was a 0.
        var compared = Vector128.Equals(isolatedBits, masks);

        // Shift everything right by 15. All ones will become just 1, and 0 will stay as 0.
        var shifted = Vector128.ShiftRightLogical(compared, 15);

        // Create a vector filled with ASCII 0s.
        var zeroes = Vector128.Create((ushort)'0');

        // Add our 1s or 0s from the ASCII 1s, giving us '0' for a 0 bit and '1' for a 1 bit.
        var result = Vector128.Add(zeroes, shifted);

        // Cast the chars to ushorts and copy the vector in.
        var castedChars = MemoryMarshal.Cast<char, ushort>(chars[2..]);
        result.CopyTo(castedChars);
    });

How much quicker is it though? Lets run a benchmark that converts all 256 bytes:

| Method  | Mean        | Error     | StdDev    | Ratio |
|-------- |------------:|----------:|----------:|------:|
| Convert | 10,354.7 ns | 200.21 ns | 238.33 ns |  1.00 |
| Vector  |  2,833.3 ns |  55.55 ns |  70.26 ns |  0.27 |

Not bad at all! Can we get it even faster though?

Well of course. Just use a lookup. It's only 256 values, hardly a huge memory footprint:

private static readonly string[] lookup = Enumerable.Range(0, 256).Select(b => $"0b{System.Convert.ToString(b, 2).PadLeft(8, '0')}").ToArray();

[Pure]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static string ToBinaryString(this byte value) => lookup[value];

The compiler can even skip the bounds check on the array if it's clever enough as a byte can never go out of bounds. Should be superfast:

| Method  | Mean       | Error    | StdDev   | Ratio |
|-------- |-----------:|---------:|---------:|------:|
| Convert | 9,744.3 ns | 87.59 ns | 77.65 ns |  1.00 |
| Vector  | 2,590.9 ns | 46.18 ns | 43.19 ns |  0.27 |
| Lookup  |   125.5 ns |  0.25 ns |  0.20 ns |  0.01 |

Yup. The simplest solutions are usually the best after all.