In this article, let’s explore the various SIMD (Single Instruction, Multiple Data) accelerated numeric types in .NET.

We’ll start by examining what SIMD stands for and how it works in .NET. Then, we’ll look at the different SIMD accelerated numeric types available in .NET. To conclude, we’ll implement a benchmark to compare the performance of a SIMD-enhanced algorithm and a non-SIMD-enhanced algorithm.

To download the source code for this article, you can visit our GitHub repository.

Without further ado, let’s begin.

Support Code Maze on Patreon to get rid of ads and get the best discounts on our products!
Become a patron at Patreon!

What Is SIMD and How Do We Implement It in .NET?

SIMD, short for Single Instruction, Multiple Data, is a special kind of parallel processing where the input data is split into multiple streams and processed with a single instruction. This approach takes our input data and the operation to be performed on them. Then, it utilizes dedicated hardware to divide the data into batches and performs the specified operation on all the batches concurrently.

Processing data using the SIMD approach allows us to efficiently handle very large datasets that require extensive computation. For instance, during the training of machine learning models or processing of images, SIMD provides a way to quickly and efficiently perform the necessary matrix-related calculations.

Now, to use this SIMD processing technique in .NET, the .NET framework provides a set of unique numeric types in the System.Numerics and System.Runtime.Intrinsics namespaces. These types allow us to represent our input data as vectors and matrices, which can further be split into multiple segments and operated upon by a single instruction.

In the next section, let’s briefly discuss what these vectors and matrices are and how they are implemented.

What Are Vectors and Matrices in .NET

In computer programming, a vector is an array of numeric values, while a matrix is a rectangular array of numeric values. To represent these numeric data structures in .NET, we have a very powerful set of vector and matrix types present in the System.Numerics namespace.

Some of these SIMD-accelerated vector and matrix types include Vector2, Vector3, Vector4, Vector<T>, Matrix3x2, and Matrix4x4. With these types, we can easily store a set of numbers as a single collection, similar to how arrays and lists work.

However, what sets these types apart is their use of the SIMD (Single Instruction, Multiple Data) data processing approach. With this approach we can perform numeric operations on multiple data points simultaneously, using a single instruction. We achieve this through low-level optimization of parallel programming algorithms.

Additionally, using these vector and matrix types with SIMD operations brings two key benefits. Firstly, they improve our application’s performance, especially when dealing with large datasets. Secondly, these types simplify parallel programming, making our code more readable and less complex.

Now that we have a basic understanding of what vectors and matrices are in .NET, let’s explore some of the specific types in the System.Numeric namespace.

Different SIMD Accelerated Numeric Types in C#

First, let’s consider the vector types.

Simple Vectors

We refer to the Vector2, Vector3, and Vector4 types as simple vectors. This is because we can only use these types to represent vectors containing 2, 3, or 4 Single values (single-precision floating-point numbers):

var vector2 = new Vector2(1f, 2f);
var vector3 = new Vector3(4f, 5f, 6f);
var vector4 = new Vector4(7f, 8f, 9f, 10f);

With these types, we can also perform various vector operations, such as finding the dot product of two vectors, transforming a vector, and so on.

Representing and working with our data in this format can increase our app’s performance. This is because all our operations on the data will be vectorized. By “vectorized”, we mean that instead of working on a single value at a time, our input data will be split, and we’ll be working on multiple values concurrently.

As an example, let’s find the dot product of two vectors:

public static float GetDotProductOfTwoVectors()
{
    var vector1 = new Vector3(1f, 2f, 3f);
    var vector2 = new Vector3(4f, 5f, 6f);

    return Vector3.Dot(vector1, vector2);
}

Here, we get the dot product of our input Vector3 objects by invoking the Dot() method. Note that we can also perform this and other vector operations on instances of the Vector2 and Vector4 types.

Next up, let’s explore the generic Vector<T> type.

Vector<T>

We use the Vector<T> type to represent advanced vectors. Unlike the simple vector types, we can use this type to create longer vectors.

This vector type provides a static Count property for checking the number of items in a particular vector. We should note that this Count property returns a fixed value that is dependent on the CPU of the computer on which we are executing our code.

Also, since it is generic, it allows us to specify the particular primitive numeric type we want to work with. As an illustration, let’s see how to create a Vector<T> instance:

var intVector = new Vector<int>(new Span<int>([1, 2, 3, 4, 5, 6, 7, 8]));

Here, we create an integer vector from a span over an array of numbers.

It’s important to note that creating a vector via this approach may not always return the expected result. If our system allows us to create vectors with 8 elements (that is, if the Vector<T>.Count property of our computer equals 8), we will obtain the expected vector. However, if we use a system where the vector count is 4, the resulting vector will contain only the first four elements of our array.

To avoid such issues when working with vectors in a vectorized algorithm, we should always load our vector from the input array or span using a constructor that takes an array and an offset, or by using a span slice. This is typically done in a loop that increments the offset by the Vector<T>.Count value.

Now, alongside this Vector<T> type, we also have the Vector128 and Vector256 types for representing 128-bit and 256-bit vectors respectively.

Moving on, let’s now see how to create and use SIMD-accelerated matrices.

Matrix

The SIMD-accelerated matrix types in .NET include the Matrix3x2 and Matrix4x4 structs. These types enable us to represent our data in rows and columns:

var matrix1 = new Matrix3x2(
        1f, 2f,
        5f, 6f,
        9f, 10f);

var matrix2 = new Matrix4x4(
        1f, 2f, 3f, 4f,
        5f, 6f, 7f, 8f,
        9f, 10f, 11f, 12f,
        13f, 14f, 15f, 16f);

With these types, we can perform a wide range of matrix-related operations like multiplying two matrices, or finding the transpose of a matrix:

var matrix3 = Matrix4x4.Transpose(matrix2);
var matrixMultResult = Matrix4x4.Multiply(matrix2, matrix3);

First, we invoke the Transpose() method to get the transpose of our matrix2 object. Then, we multiply the original matrix and its transpose, using the Multiply() method.

Other operations that we can perform with these matrix types include inverting our matrices, calculating their determinants, etc.

With that, we’ve covered the most common SIMD-accelerated numeric types in .NET.

Next up, let’s perform a test to quantify the performance gains that we can realize when we utilize SIMD-accelerated types.

Benchmarking the Performance of SIMD Accelerated Numeric Types

For this benchmark, we will utilize the BenchmarkDotNet library to compare the performance of creating and multiplying two 4×4 matrices with SIMD and without SIMD.

To do this, let’s first implement our SIMD-accelerated multiplication method:

public static Matrix4x4 CreateAndMultiplyTwoMatricesWithSIMD()
{
    var matrix1 = new Matrix4x4(
            1f, 2f, 3f, 4f,
            5f, 6f, 7f, 8f,
            9f, 10f, 11f, 12f,
            13f, 14f, 15f, 16f);

    var matrix2 = matrix1;

    return Matrix4x4.Multiply(matrix1, matrix2);
}

Here, we create two Matrix4x4 instances and multiply them.

Next, let’s create a manual non-SIMD matrix multiplier:

public static float[,] CreateAndMultiplyTwoMatricesWithoutSIMD()
{
    float[,] matrix1
        = {
        { 1f, 2f, 3f, 4f },
        { 5f, 6f, 7f, 8f },
        { 9f, 10f, 11f, 12f },
        { 13f, 14f, 15f, 16f }
    };

    float[,] matrix2 = matrix1;

    float[,] result = new float[4, 4];
    for (int i = 0; i < 4; i++)
    {
        for (int j = 0; j < 4; j++)
        {
            result[i, j] = 0;
            for (int k = 0; k < 4; k++)
            {
                result[i, j] += matrix1[i, k] * matrix2[k, j];
            }
        }
    }

    return result;
}

In this method, we create and multiply two 4×4 multidimensional float arrays. Now, from the number of loops we used to perform the multiplication, we can already deduce that this method will take some time to run.

However, let’s not conclude hastily. Let’s first run our benchmark before we make our deductions:

| Method                                  | Mean      | Error    | StdDev   | Ratio         | Allocated |
|---------------------------------------- |----------:|---------:|---------:|--------------:|----------:|
| CreateAndMultiplyTwoMatricesWithSIMD    |  19.52 ns | 0.068 ns | 0.064 ns | 12.90x faster |         - |
| CreateAndMultiplyTwoMatricesWithoutSIMD | 251.79 ns | 3.207 ns | 3.000 ns |      baseline |     208 B |

As we can see, using the SIMD-accelerated matrix multiplication is over 12 times faster than performing a non-SIMD matrix multiplication. Also, we see that creating and utilizing our SIMD matrices doesn’t require any managed memory, whereas the manual method does. This shows that processing our data in parallel via the SIMD approach can drastically improve the performance of our applications.

However, it’s important to note that this benchmark on the matrix type doesn’t represent all SIMD-accelerated numeric types. We should always conduct benchmarks to determine if implementing a particular SIMD-enhanced algorithm will lead to any performance improvements in our application.

Conclusion

In this article, we’ve explored what Single Instruction, Multiple Data (SIMD) is all about. We also examined the various types available in .NET that we can use to take advantage of this powerful parallel data processing technique.

Furthermore, we observed the performance improvements achievable by utilizing SIMD-accelerated numeric types, particularly the Matrix4x4 struct.

However, we should always remember that in some scenarios, using these types may not always lead to substantial performance improvements. Therefore, we should always run benchmarks to determine the expected performance gains before implementing any SIMD-enhanced algorithm.

Liked it? Take a second to support Code Maze on Patreon and get the ad free reading experience!
Become a patron at Patreon!