In the vast landscape of C# programming, handling strings stands out as a fundamental and frequently encountered task. However, ensuring a consistent byte representation of strings without explicitly specifying an encoding can be a bit perplexing for developers navigating the intricacies of character encoding in the C# ecosystem.

To download the source code for this article, you can visit our GitHub repository.

Let’s get started!

The Inner Workings of C# Strings

In C#, strings operate as Unicode entities, utilizing the UTF-16 encoding by default. While this default encoding ensures broad character support, it may not align seamlessly with the requirements for a consistent byte representation. The challenge is that different scenarios, such as file operations, network communication, or data storage, demand a reliable byte representation without the encumbrance of encoding specifics.

Support Code Maze on Patreon to get rid of ads and get the best discounts on our products!
Become a patron at Patreon!

Benefits of Encoding Independence in Consistent Byte Representation

By adopting an encoding-independent approach, we unlock several advantages. Most notably, the method accommodates invalid characters without complicating the encoding/decoding process. This is particularly beneficial in scenarios where strings might contain characters outside the standard ASCII range.

The simplicity of the approach ensures that we can focus on the task at hand without unnecessary encumbrances. Additionally, it doesn’t matter if the string contains invalid characters. This is because we can still get the data and reconstruct the original string anyway.

Consistent Byte Representation of Strings With an Encoding-Free Solution

To address this challenge, we propose a straightforward approach that sidesteps encoding concerns, emphasizing the simplicity and efficiency of obtaining byte arrays without delving into the complexities of character encodings. Contrary to conventional wisdom, achieving a consistent byte representation doesn’t necessarily hinge on encoding specifications.

This uncomplicated yet effective approach ensures that we get “the bytes the string has been stored in” and facilitates the reconstruction of the original string from those bytes.

Implementation

Let’s dive into the practical implementation of this encoding-free solution. We can seamlessly convert between strings and byte arrays. First, we need a function for converting strings into byte arrays:

public static class StringByteConverter 
{ 
    public static byte[] GetBytes(string str)
    {
        var bytes = new byte[str.Length * sizeof(char)];
        Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);

        return bytes;
    }
}

The GetBytes() method takes a string input and transforms it into a byte array. It achieves this by calculating the necessary size for the byte array, and then efficiently copying the characters of the string into the byte array using Buffer.BlockCopy() method.

Then we need a second method for converting the byte array back to a string:

public static class StringByteConverter
{
    public static string GetString(byte[] bytes)
    {
        var chars = new char[bytes.Length / sizeof(char)];
        Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);

        return new string(chars);
    }
}

The GetString() method reverses the process by taking a byte array and converting it back into a string. It calculates the number of characters in the resulting string, then efficiently copies the bytes from the byte array to the character array using Buffer.BlockCopy(). Finally, it creates a new string from the character array.

We can now call these two methods, to convert the string into a byte array and back:

const string sampleString = "𝓒𝓸𝓭𝓮𝓜𝓪𝔃𝓮";
byte[] byteArray = StringByteConverter.GetBytes(sampleString);
string reconstructedString = StringByteConverter.GetString(byteArray);

In this example, sampleString is first converted to a byte array using our StringByteConverter.GetBytes() method. Then, the byte array is converted back to a string using our StringByteConverter.GetString() method.

The reconstructedString should now be equal to the original sampleString.

Modern Implementation Using Span

There is also a more modern version using Span:

public static class StringByteConverter
{
    public static byte[] GetBytesWithSpan(ReadOnlySpan<char> charSpan)
    {
        ReadOnlySpan<byte> byteSpan = MemoryMarshal.AsBytes(charSpan);
        var bytes = new byte[byteSpan.Length];
        byteSpan.CopyTo(bytes);

        return bytes;
    }
}

The GetBytesWithSpan() method uses the modern Span approach to convert a string into a byte array.

We begin by converting the ReadOnlySpan<char> to a ReadOnlySpan<byte> using MemoryMarshal.AsBytes(). Next, we allocate our destination byte array. Finally, we copy the contents of the ReadOnlySpan<byte> into the destination array before returning it.

Using the same approach with Span<>, we can also create a string from a byte array:

public static class StringByteConverter
{
     public static string GetStringWithSpan(byte[] bytes)
     {
         Span<char> charsSpan = MemoryMarshal.Cast<byte, char>(bytes);

         return new string(charsSpan);
     }
}

The GetStringWithSpan() method employs Span to convert a byte array back into a string. We start by treating the byte array as a Span<char> using MemoryMarshal.Cast(). Then, we create a new string, our reconstructed string, from the character span and return it.

To learn more about the details and inner workings of Span<> check out our article here.

Consistent Byte Representation With a Look at Encoding

In some scenarios, encoding does matter, e.g. when working with specific character sets or interoperating with systems expecting a particular encoding.

Here the Encoding class from the System.Text namespace comes into play:

string cafe = "Café";
byte[] asciiBytes = Encoding.ASCII.GetBytes(cafe);
byte[] utf8Bytes = Encoding.UTF8.GetBytes(cafe);
Console.WriteLine("ASCII Output: " + Encoding.ASCII.GetString(asciiBytes));
Console.WriteLine("UTF-8 Output: " + Encoding.UTF8.GetString(utf8Bytes));

The Encoding.ASCII.GetBytes() method exemplifies this, allowing us to explicitly specify the encoding and obtain the corresponding byte representation of a string. When attempting to revert ASCII-encoded bytes to a string and displaying it on the console, we get the following output:

ASCII Output: Caf?

In this example, the letter é cannot be converted properly. When characters like é surpass the 7-bit ASCII range (0-127), the ASCII.GetBytes() method uses a default fallback character, the question mark ?, for encoding. Because é, with a value of 130, falls outside this range, it is encoded as a question mark. ASCII’s 7-bit constraint ensures a consistent single-byte representation for characters within its range.

In contrast, Unicode encodings, such as UTF-8, allocate two or more bytes for characters with values higher than 127, thus resulting in a potentially larger byte array, but preserving all characters. We employ Encoding.UTF8.GetBytes() to encode our string as a UTF-8 byte array, ensuring all characters are preserved:

UTF-8 Output: Café

Conclusion

In wrapping up our exploration, we reaffirm the simplicity and efficacy of a no-encoding approach for obtaining a consistent byte representation of strings in C#. With this encoding-free solution, we pave the way for a smoother and more straightforward coding experience.

However, we acknowledge that in specific situations where encoding nuances are crucial, the Encoding class offers the flexibility needed to handle diverse character sets and encoding requirements. The key is to strike a balance, leveraging simplicity when appropriate and diving into encoding specifics when necessary. This nuanced approach ensures versatility in our coding endeavors while maintaining the elegance of straightforward solutions.

Liked it? Take a second to support Code Maze on Patreon and get the ad free reading experience!
Become a patron at Patreon!