Description
Today when dealing with converting utf16 text to utf8 text (or vice versa), we have 2 options:
- Use higher-level methods that are inefficient
- Use lower-level methods that are very efficient but harder to use
Things are mostly great when the source text is in a single string or char[], but less great when the text is a stream of data, or broken into chunks. One of the problems is that we have multiple ways to represent these chunks of data and not enough APIs that support converting from one to the other.
Utf16 |
---|
string |
char[] |
ReadOnly/Span/Memory<char> |
StringBuilder |
StreamReader/StreamWriter |
ReadOnlySequence<char> |
Utf8 |
---|
byte[] |
ReadOnly/Span/Memory<byte> |
ReadOnlySequence<byte> |
IBufferWriter<byte> |
Stream |
PipeWriter/PipeReader |
Rune |
---|
APIs with encoding operations |
---|
Encoding.UTF8 |
Rune.* |
Utf8.* |
What further complicates things, is that there sometimes aren't good conversions between some of these types (and none that are allocation free).
Here was an example that came up recently:
public async Task ExecuteAsync(HttpContext httpContext)
{
httpContext.Response.ContentType = $"{MediaTypeNames.Text.Csv}; charset=utf-8";
if (!string.IsNullOrWhiteSpace(filename))
{
httpContext.Response.Headers.Append("Content-Disposition", $"attachment; filename=\"{filename}.csv\"");
}
// Don't dispose of writer as that will result in the synchronous version of
// httpContext.Response.Body.Write to be called which is not supported.
var writer = new StreamWriter(httpContext.Response.Body, Encoding.UTF8, bufferSize: 4096, leaveOpen: true);
var stringBuilder = new StringBuilder();
if (headerRowAction is not null)
{
headerRowAction.Invoke(stringBuilder);
await writer.WriteLineAsync(stringBuilder, cancellationToken);
}
await foreach (var item in items.WithCancellation(cancellationToken))
{
stringBuilder.Clear();
itemToRowAction.Invoke(item, stringBuilder);
await writer.WriteLineAsync(stringBuilder, cancellationToken);
}
await writer.FlushAsync(cancellationToken);
}
There are tons of copies here, StringBuilder ->(copy) StreamWriter (char[] ->(transcode) byte[]) -> HttpResponseStream (copy).
This could be improved by writing the StringBuilder (assuming that's the right public API here) directly to the underlying HttpResponse buffer (PipeWriter/IBufferWrite). The problem is, now I'm stuck writing this complex encoding logic to translate utf16 to utf8.
This was another example https://www.reddit.com/r/dotnet/comments/1gx11ex/reading_streams_efficiently/.
PS: We did some of this work in System.Memory for ReadOnlySequence<char>
a while back https://learn.microsoft.com/en-us/dotnet/api/system.text.encodingextensions?view=net-8.0. We'd need to do similar work to expand the set of types that can participate in this conversion (OR we can push IBufferWriter lower into the stack 😄).
I'd love to turn this into an API proposal once we get a handle on the problem.