Description
Updated by @MihaZupan on 2024-02-27
Proposed API
namespace System.Buffers.Text;
public static class Base64Url
{
// Encode bytes => utf8
public static OperationStatus EncodeToUtf8(ReadOnlySpan<byte> bytes, Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
public static OperationStatus EncodeToUtf8InPlace(Span<byte> buffer, int dataLength, out int bytesWritten);
// Decode from utf8 => bytes
public static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
public static OperationStatus DecodeFromUtf8InPlace(Span<byte> buffer, out int bytesWritten);
// Max length APIs
public static int GetMaxDecodedFromUtf8Length(int length);
public static int GetMaxEncodedToUtf8Length(int length);
// IsValid
public static bool IsValid(ReadOnlySpan<char> base64UrlText);
public static bool IsValid(ReadOnlySpan<char> base64UrlText, out int decodedLength);
public static bool IsValid(ReadOnlySpan<byte> base64UrlTextUtf8);
public static bool IsValid(ReadOnlySpan<byte> base64UrlTextUtf8, out int decodedLength);
// Up to this point, this is a mirror of System.Buffers.Text.Base64
// Below are more helpers that bring over functionality similar to Convert.*Base64*
// Encode to / decode from chars
public static bool TryEncodeToChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten) { }
public static bool TryDecodeFromChars(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten) { }
// These are just accelerator methods.
// Should be efficiently implementable on top of the other ones in just a few lines.
// Encode to string
public static string EncodeToString(ReadOnlySpan<char> chars, Encoding encoding) { }
public static string EncodeToString(ReadOnlySpan<byte> bytes) { }
// Decode from chars => string
// Decode from chars => byte[]
// The names could also just be "Decode" without naming the return type
public static string DecodeToString(ReadOnlySpan<char> chars, Encoding encoding) { }
public static byte[] DecodeToByteArray(ReadOnlySpan<char> chars) { }
}
Original issue
The Base64 implementation in System.Memory provides excellent low-level optimizations for RFC 4648, but it currently uses a fixed alphabet. Minor refactoring would add additional use cases using the existing implementation.
Rationale and Usage
The most obvious use case for this change is to support the encoding variant known as Base 64 URL also described in RFC 4648 §4.
This excerpt from the RFC describes the difference between the standard Base 64 alphabet and the Base 64 URL alphabet.
This encoding is technically identical to the previous one, except
for the 62:nd and 63:rd alphabet character...
I have already been able to verify the encoding and decoding in a copy of the current implementation by merely changing the 2 relevant characters in the alphabet mappings at:
Today, this logic is suboptimally implemented in at least:
- Base64UrlTextEncoder.cs (ASP.NET Core)
- WebEncoders.cs (ASP.NET Core)
Furthermore, the encoding is generic and has use cases outside of ASP.NET (e.g. you shouldn't have to reference ASP.NET to use it).
Proposed API Change
The existing Base64.EncodeToUtf8
and Base64.DecodeFromUtf8
should each add a new method overload that allows one of the following:
-
A new enumeration of allowed alphabets (ex:
Base64Alphabet
) which internally maps to well-known alphabet spans -
Allow any custom alphabet to be supplied as
ReadOnlySpan<sbyte>
that must have an exact length of 64 for encoding and 256 for decodinga. One or more new types could be provided with static properties for the well-known alphabet spans
Details
Option 1 requires less validation, but has less flexibility. Option 2 has more flexibility, including scenarios not described here, but requires more validation before using the alphabet.
Although the standard Base 64 and Base 64 URL are technically the same encoding with different alphabets, Base 64 URL typically does not include padding. RFC 4648 §3.2 indicates this is allowed (as it's explicitly stated), but there is no need to make that concession in this API. Including the =
character for padding in Base 64 URL encoding is still correct.
Both approaches would have a similar looking API:
public static unsafe OperationStatus EncodeToUtf8(
ReadOnlySpan<byte> bytes,
Span<byte> utf8,
Base64Alphabet alphabet,
out int bytesConsumed,
out int bytesWritten,
bool isFinalBlock = true);
public static unsafe OperationStatus DecodeFromUtf8(
ReadOnlySpan<byte> utf8,
Span<byte> bytes,
Base64Alphabet alphabet,
out int bytesConsumed,
out int bytesWritten,
bool isFinalBlock = true);
Figure 1: Supply an alphabet enumeration
public static unsafe OperationStatus EncodeToUtf8(
ReadOnlySpan<byte> bytes,
Span<byte> utf8,
ReadOnlySpan<byte> alphabet,
out int bytesConsumed,
out int bytesWritten,
bool isFinalBlock = true);
public static unsafe OperationStatus DecodeFromUtf8(
ReadOnlySpan<byte> utf8,
Span<byte> bytes,
ReadOnlySpan<byte> alphabet,
out int bytesConsumed,
out int bytesWritten,
bool isFinalBlock = true);
Figure 2: Supply a custom alphabet
The support for trimming off and re-adding padding should be implemented separately. This could be in a separate Base64Url
class or, perhaps more appropriately, added as a new type in System.Text.Encodings.Web.
Padding is generally very cheap to deal with compared to other implementation methods. Trimming involves walking the tail end of the span while there are padding characters and then slicing it off. Re-padding fills the end of the span buffer before decoding. Neither operation requires additional allocations.
Open Questions
- Does Option 1 or Option 2 make more sense?
- To complete the cycle, it feels like there should be a type that fully handles the encoding and decoding with trimmed padding. Does that make sense to be in System.Memory, System.Text.Encodings.Web, or perhaps some other place?