-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit opted-in string literals into data section as UTF8 #76036
base: main
Are you sure you want to change the base?
Conversation
16f5293
to
43aa343
Compare
43aa343
to
e9e701d
Compare
error CS8103: Combined length of user strings used by the program exceeds allowed limit. Try to decrease use of string literals. | ||
``` | ||
|
||
By turning on the feature flag `utf8-string-encoding`, string literals (where possible) are instead emitted as UTF-8 data into a different section of the PE file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spec, I've renamed it to data-section-string-literals
which IMO best reflects what the feature is about.
|
||
namespace Microsoft.CodeAnalysis.CodeGen; | ||
|
||
internal sealed class DataStringHolder : DefaultTypeDef, Cci.INamespaceTypeDefinition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
public DataStringHolder( | ||
CommonPEModuleBuilder moduleBuilder, | ||
string dataHash, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// </param> | ||
/// <returns>The field. This may have been newly created or may be an existing field previously created for the same data and alignment.</returns> | ||
internal Cci.IFieldReference CreateDataField(ImmutableArray<byte> data, ushort alignment) | ||
internal Cci.ITypeReference GetOrAddProxyType(int length, ushort alignment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming _proxyTypes
as well
|
||
Debug.Assert(_mappedFields.IsEmpty, "We need to generate unique field names if we are synthesizing more than one."); | ||
|
||
(_, StringField stringField) = _mappedFields.GetOrAdd((data, Alignment: 1), key => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have an alternative implementation strategy that has more than one field in one class. But I can simplify the code for now.
@jjonescz Please add details how do we achieve the goal: what artifacts are getting generated, what APIs are needed and how they are used, what compilation phases/layers are involved and how, etc. Basically, I think we need something like an implementation speclet for the compiler feature. #Closed |
|
||
fieldsBuilder.Sort(PrivateImplementationDetails.FieldComparer.Instance); | ||
|
||
_orderedSynthesizedFields = fieldsBuilder.ToImmutableAndFree(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Call `<PrivateImplementationDetails>.BytesToString(byte*, int)`. | ||
bytesToStringHelper ??= GetOrSynthesizeBytesToStringHelper(diagnostics); | ||
ilBuilder.EmitOpCode(ILOpCode.Call, -1); | ||
ilBuilder.EmitToken(bytesToStringHelper, null, diagnostics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
namespace Microsoft.CodeAnalysis.CodeGen; | ||
|
||
internal sealed class DataStringHolder : DefaultTypeDef, Cci.INamespaceTypeDefinition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate?
I see there is PrivateImplementationDetails.TryAddSynthesizedType(Cci.INamespaceTypeDefinition type)
which we could definitely use, hopefully simplifying some thread safety / determinism code.
But we still need a Cci.INamespaceTypeDefinition
representing each generated <S>
class. That's the DataStringHolder
class.
I stopped reviewing the PR because I think first we need to agree on the implementation strategy and have a speclet to review against. #Closed |
public override IEnumerable<DataStringHolder> GetFrozenDataStringHolders() | ||
{ | ||
Debug.Assert(_lazyDataStringHolders is null || _dataStringHoldersFrozen == 1); | ||
return _lazyDataStringHolders?.Values ?? SpecializedCollections.EmptyEnumerable<DataStringHolder>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlekseyTs thanks for the review. Sounds like a good idea to get a spec reviewed first. Opened a separate PR for that: #76139 |
@jjonescz The current PR title: "Add feature flag for emitting string literals into data section as UTF8" sounds like the only thing this PR is doing is adding a flag. I suggest to adjust the title to reflect the main goal/purpose of the PR, which I think is to store values of string literals as Utf-8 encoded bytes in a blob, etc. By comparison to that, the feature flag is a minor detail, which, I think, can be dropped from the title if its length becomes too big. #Closed |
this); | ||
} | ||
|
||
internal DataSectionStringType? GetOrCreateDataSectionStringLiteralType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -21,6 +21,8 @@ internal interface ITokenDeferral | |||
|
|||
Cci.IFieldReference GetArrayCachingFieldForConstants(ImmutableArray<ConstantValue> constants, Cci.IArrayTypeReference arrayType, SyntaxNode syntaxNode, DiagnosticBag diagnostics); | |||
|
|||
Cci.IFieldReference? GetFieldForDataString(string text, ImmutableArray<byte> data, SyntaxNode syntaxNode, DiagnosticBag diagnostics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like the "GetFieldForDataString" name is somewhat confusing. What is a "data string" for which we are requesting a field? It looks like we are requesting a static field that is supposed to store the value of the text
parameter during at runtime. Therefore, "GetFieldForStringValue" feels better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, If we expect this function to fail to produce a field under some conditions, consider adding "Try" prefix to the name of the function.
@@ -21,6 +21,8 @@ internal interface ITokenDeferral | |||
|
|||
Cci.IFieldReference GetArrayCachingFieldForConstants(ImmutableArray<ConstantValue> constants, Cci.IArrayTypeReference arrayType, SyntaxNode syntaxNode, DiagnosticBag diagnostics); | |||
|
|||
Cci.IFieldReference? GetFieldForDataString(string text, ImmutableArray<byte> data, SyntaxNode syntaxNode, DiagnosticBag diagnostics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the data
parameter is redundant. Its value is defined by content of the text
parameter. Passing redundant data is a possible source of bugs because it becomes responsibility of each consumer to keep data in sync. Not to mention the fact that each consumer is responsible to do more work.
@@ -21,6 +21,8 @@ internal interface ITokenDeferral | |||
|
|||
Cci.IFieldReference GetArrayCachingFieldForConstants(ImmutableArray<ConstantValue> constants, Cci.IArrayTypeReference arrayType, SyntaxNode syntaxNode, DiagnosticBag diagnostics); | |||
|
|||
Cci.IFieldReference? GetFieldForDataString(string text, ImmutableArray<byte> data, SyntaxNode syntaxNode, DiagnosticBag diagnostics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Sort proxy types. | ||
_orderedProxyTypes = _proxyTypes.OrderBy(kvp => kvp.Key.Size).ThenBy(kvp => kvp.Key.Alignment).Select(kvp => kvp.Value).AsImmutable(); | ||
// Sort nested types. | ||
_orderedNestedTypes = _dataFieldTypes.OrderBy(kvp => kvp.Key.Size).ThenBy(kvp => kvp.Key.Alignment).Select(kvp => kvp.Value).OfType<ExplicitSizeStruct>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to strengthen the type of values stored in _dataFieldTypes
to ExplicitSizeStruct
or to Cci.INestedTypeDefinition
?
/// stored and what alignment might be required. | ||
/// </param> | ||
/// <returns>The field. This may have been newly created or may be an existing field previously created for the same data and alignment.</returns> | ||
internal MappedField CreateDataField(ImmutableArray<byte> data, ushort alignment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
_name = name; | ||
_systemObject = systemObject; | ||
_containingType = containingType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!tryGetWellKnownTypeMember(compilation, syntaxNode, diagnostics, WellKnownMember.System_Text_Encoding__get_UTF8, out var encodingUtf8) | | ||
!tryGetWellKnownTypeMember(compilation, syntaxNode, diagnostics, WellKnownMember.System_Text_Encoding__GetString, out var encodingGetString)) | ||
{ | ||
return new DefaultMethodDef( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it a responsibility of the only consumer of this functionality, the TryEmitStringLiteralAsUtf8Encoded
, to check the presence/goodness of required APIs and to report appropriate diagnostics. In this file, we could assume that relevant APIs won't be called unless the helpers are good. I think if we do that, the logic in this file can be simplified, changes to CommonMessageProvider
can be reverted, etc.
public override bool IsSpecialName => true; | ||
} | ||
|
||
file class DefaultMethodDef( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we already have RootModuleStaticConstructor
that implements Cci.IMethodDefinition, Cci.IMethodBody
. Let's extract a common abstract base "MethodDefinition", rename RootModuleStaticConstructor
to StaticConstructor
and reuse the new base across StaticConstructor
and BytesToStringHelper
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While extracting common bases, I recommend extract/copy/paste code from pre-existing types, without changing relative order of members. That is going to make code review much easier.
|
||
Debug.Assert(encodingUtf8 is not null && encodingGetString is not null); | ||
|
||
var ilBuilder = new ILBuilder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WellKnownMember member, | ||
[NotNullWhen(returnValue: true)] out Cci.IMethodReference? result) | ||
{ | ||
if (compilation.CommonGetWellKnownTypeMember(member) is { } symbol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -449,32 +599,40 @@ public Cci.IUnitReference GetUnit(EmitContext context) | |||
|
|||
public string NamespaceName => string.Empty; | |||
|
|||
private static string DataToHex(ImmutableArray<byte> data) | |||
internal static string DataToHex(ImmutableArray<byte> data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal static string DataToHexViaXxHash128(ImmutableArray<byte> data) | ||
{ | ||
Span<byte> hash = stackalloc byte[sizeof(ulong) * 2]; | ||
var bytesWritten = XxHash128.Hash(data.AsSpan(), hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -484,7 +642,7 @@ static void toHex(ImmutableArray<byte> source, Span<char> destination) | |||
static char hexchar(int x) => (char)((x <= 9) ? (x + '0') : (x + ('A' - 10))); | |||
} | |||
|
|||
private sealed class FieldComparer : IComparer<SynthesizedStaticField> | |||
internal sealed class FieldComparer : IComparer<SynthesizedStaticField> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -502,7 +660,7 @@ public int Compare(SynthesizedStaticField? x, SynthesizedStaticField? y) | |||
} | |||
} | |||
|
|||
private sealed class DataAndUShortEqualityComparer : EqualityComparer<(ImmutableArray<byte> Data, ushort Value)> | |||
internal sealed class DataAndUShortEqualityComparer : EqualityComparer<(ImmutableArray<byte> Data, ushort Value)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// | ||
/// https://github.com/dotnet/roslyn/blob/main/docs/features/string-literals-data-section.md | ||
/// </summary> | ||
internal sealed class DataSectionStringType : DefaultTypeDef, Cci.INestedTypeDefinition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var staticConstructor = synthesizeStaticConstructor(module, containingType, dataField, stringField, bytesToStringHelper, diagnostics); | ||
|
||
_field = stringField; | ||
_fields = [_field]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this); | ||
} | ||
|
||
internal DataSectionStringType? GetOrCreateDataSectionStringLiteralType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -911,4 +1158,192 @@ public sealed override int GetHashCode() | |||
throw Roslyn.Utilities.ExceptionUtilities.Unreachable(); | |||
} | |||
} | |||
|
|||
/// <summary> | |||
/// A helper method used in static constructor of <see cref="DataSectionStringType"/>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public override Cci.ITypeReference GetType(EmitContext context) => context.Module.GetPlatformType(_type, context); | ||
} | ||
|
||
file abstract class ParameterDefinitionBase( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1088,6 +1095,7 @@ internal PrivateImplementationDetails GetPrivateImplClass(TSyntaxNode syntaxNode | |||
this.GetSpecialType(SpecialType.System_Int16, syntaxNodeOpt, diagnostics), | |||
this.GetSpecialType(SpecialType.System_Int32, syntaxNodeOpt, diagnostics), | |||
this.GetSpecialType(SpecialType.System_Int64, syntaxNodeOpt, diagnostics), | |||
this.GetSpecialType(SpecialType.System_String, syntaxNodeOpt, diagnostics), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -415,7 +415,7 @@ ImmutableArray<ExceptionHandlerRegion> ExceptionRegions | |||
/// <summary> | |||
/// Debugging information associated with a MoveNext method of a state machine. | |||
/// </summary> | |||
StateMachineMoveNextBodyDebugInfo MoveNextBodyInfo { get; } | |||
StateMachineMoveNextBodyDebugInfo? MoveNextBodyInfo { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -101,6 +104,7 @@ public CodeGenerator( | |||
_builder = builder; | |||
_module = moduleBuilder; | |||
_diagnostics = diagnostics; | |||
_dataSectionStringLiteralThreshold = compilation.DataSectionStringLiteralThreshold; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -87,7 +89,8 @@ public CodeGenerator( | |||
PEModuleBuilder moduleBuilder, | |||
BindingDiagnosticBag diagnostics, | |||
OptimizationLevel optimizations, | |||
bool emittingPdb) | |||
bool emittingPdb, | |||
Compilation compilation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
var utf8 = new System.Text.UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with review pass (commit 2), tests are not looked at. |
Spec: #76139
When this opt-in feature flag is enabled by users, string literals with length over the configured threshold are emitted into data section of the PE file, hence allowing users to overcome the "error CS8103: Combined length of user strings used by the program exceeds allowed limit."
VB support is not implemented in this PR, could be added in the future if there's demand.
I've manually tried this on our unit tests (e.g., Emit2 compiled standalone resulted in significantly smaller file size due to the use of UTF8 instead of UTF16; run time was the same; combining Emit2 and Emit3 into one assembly worked without emit errors). It was also tested on aspnetcore benchmarks and that didn't reveal any significant runtime/build perf changes.
Main motivation for this feature are Razor projects which can accumulate lots of string literals over time (due to how
.razor
file compilation works) and then customers are one day hit with this error without any prior warnings (imagine they are fixing a small bug and suddenly they cannot compile anymore without refactoring their codebase into multiple projects).