In software development, it is often necessary to divide data into manageable batches for more efficient processing and improved application performance. In this article, we will look at some approaches to dividing data into batches using LINQ methods, with the C# programming language.
Let’s take a closer look.
What Is Batch Processing
To better comprehend the various techniques available in C# for dividing data into batches, we need first to understand the concept of batch processing. Batch processing is a data processing technique that processes a set of data, known as a batch, as a group.
Instead of processing individual pieces of data one at a time, batch processing allows for the simultaneous handling of multiple data items. This approach is particularly useful when dealing with large datasets that might be impractical or inefficient to process simultaneously.
Moreover, using batch processing has several advantages, including:
- Efficiency: Processing data in batches can be more efficient than handling individual elements, especially when dealing with extensive datasets
- Parallelism: Processing batches concurrently enhances performance in multi-core or distributed computing environments
- Resource Utilization: Batch processing enables better utilization of system resources by optimizing memory and processing power
Therefore, it is important to consider using batch processing when dealing with large datasets to ensure efficient and optimal processing. Let’s now see how we can extract the data in batches.
Generate Batches Using Foreach Loop
Before we can utilize LINQ methods, it’s important to understand how to achieve the same result using a traditional loop.
Let’s take a look at an example:
public static List<List<T>> BatchByTraditional<T>(this IEnumerable<T> source, int batchSize) { var batches = new List<List<T>>(); var batch = new List<T>(batchSize); foreach (var item in source) { batch.Add(item); if (batch.Count != batchSize) { continue; } batches.Add(batch); batch = new List<T>(batchSize); } if (batch.Count > 0) { batches.Add(batch); } return batches; }
First, we create an empty list of batches
. Then, we iterate through the source
and add elements to a temporary batch. When the batch reaches the desired size, we add it to the list and create a new batch to continue the process. We repeat this until we have processed all the elements.
Generate Batches via LINQ Methods
Let’s explore how we can obtain an identical outcome by employing LINQ methods:
public static List<List<T>> BatchByLinq<T>(this IEnumerable<T> source, int batchSize) { return source .Select((x, i) => new { Index = i, Value = x }) .GroupBy(x => x.Index / batchSize) .Select(x => x.Select(v => v.Value).ToList()) .ToList(); }
In our BatchByLinq()
method, we leverage the power of LINQ to partition an enumerable efficiently source
into batches of a specified batchSize
. We begin by transforming each element in the source into a tuple containing its index
and value
. We achieve this using the Select()
method.
Next, we group these tuples based on the division of the index by the batch size, employing the GroupBy()
method. This clever grouping strategy ensures that elements falling within the same batch are grouped.
Then, we utilize the Select()
method again to extract the original values from each group and convert them into individual batches using ToList()
. Finally, we wrap up the process by converting the entire result into a list of batches.
Let’s see now how we can consume this method:
var data = Enumerable.Range(0, 10000).ToList(); var linqBatches = data.BatchByLinq(100); foreach (var batch in linqBatches) { Console.WriteLine(string.Join(", ", batch)); } Console.WriteLine($"LINQ method created {linqBatches.Count} batches");
We create a list of integers called data
, and use the BatchByLinq()
method to split it into 100-element batches. Then, we iterate through each batch and display its elements as a comma-separated string. Finally, we print the total number of batches created by our LINQ method.
Despite working, the code for creating the batches can be complex. So, let’s explore how we can simplify it with a single line of code.
LINQ Chunk Method to Generate Batches
With the release of .NET 6, a new LINQ extension method Chunk()
was introduced, which can divide data into batches. Let’s take a look at its usage:
public static IEnumerable<IEnumerable<T>> BatchByChunk<T>(this IEnumerable<T> source, int batchSize) { return source.Chunk(batchSize); }
We use the built-in Chunk()
method providing the batchSize
as a parameter to get the data into batches.
Now that we have a good understanding of various approaches to dividing large data into batches, let’s examine which approach is the best.
Performance Benchmarks
We will be using the BenchmarkDotNet library. Once it has been installed via the NuGet package, we can proceed with implementing the benchmarks:
[Orderer(SummaryOrderPolicy.FastestToSlowest)] public class DataBatchingBenchmarks { private const int BatchSize = 100; private readonly List<int> _data = Enumerable.Range(0, 10000).ToList(); // code removed for brevity }
To compare the performance of different methods for dividing the data into batches, we decorate them as separate methods and use the [Benchmark]
attribute. We divide the data into batches, each containing 10000 integer values, and use a batch size of 100. Let’s now review the results:
| Method | Mean | Error | StdDev | |---------------------------- |---------------:|------------:|------------:| | BatchByChunkBenchmark | 5.901 ns | 0.0194 ns | 0.0172 ns | | BatchByTraditionalBenchmark | 24,072.385 ns | 127.2362 ns | 119.0168 ns | | BatchByLinqBenchmark | 204,392.100 ns | 452.0273 ns | 400.7104 ns |
It is clear that the Chunk()
method approach surpasses the other two methods in terms of performance. This implementation is more performant because it uses a custom iterator to produce the chunks on the fly, which saves memory and CPU cycles. It also uses a value type for the chunks, which avoids heap allocations. It implements the IEnumerable<T>
and IList<T>
interfaces, which makes it compatible with foreach
loops and LINQ queries.
Conclusion
As a developer, understanding the concept of batching is crucial, particularly when working with large datasets. In C#, there are several ways to divide data into batches, but the Chunk() method is the most efficient.