Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit opted-in string literals into data section as UTF8 #76036

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jjonescz
Copy link
Member

@jjonescz jjonescz commented Nov 22, 2024

Spec: #76139

When this opt-in feature flag is enabled by users, string literals with length over the configured threshold are emitted into data section of the PE file, hence allowing users to overcome the "error CS8103: Combined length of user strings used by the program exceeds allowed limit."

VB support is not implemented in this PR, could be added in the future if there's demand.

I've manually tried this on our unit tests (e.g., Emit2 compiled standalone resulted in significantly smaller file size due to the use of UTF8 instead of UTF16; run time was the same; combining Emit2 and Emit3 into one assembly worked without emit errors). It was also tested on aspnetcore benchmarks and that didn't reveal any significant runtime/build perf changes.

Main motivation for this feature are Razor projects which can accumulate lots of string literals over time (due to how .razor file compilation works) and then customers are one day hit with this error without any prior warnings (imagine they are fixing a small bug and suddenly they cannot compile anymore without refactoring their codebase into multiple projects).

@dotnet-issue-labeler dotnet-issue-labeler bot added Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Nov 22, 2024
@jjonescz jjonescz force-pushed the DataSectionStringLiterals branch from 16f5293 to 43aa343 Compare November 22, 2024 16:42
@jjonescz jjonescz force-pushed the DataSectionStringLiterals branch from 43aa343 to e9e701d Compare November 25, 2024 13:45
@jjonescz jjonescz marked this pull request as ready for review November 25, 2024 15:25
@jjonescz jjonescz requested a review from a team as a code owner November 25, 2024 15:25
error CS8103: Combined length of user strings used by the program exceeds allowed limit. Try to decrease use of string literals.
```

By turning on the feature flag `utf8-string-encoding`, string literals (where possible) are instead emitted as UTF-8 data into a different section of the PE file
Copy link
Contributor

@AlekseyTs AlekseyTs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8-string-encoding

"utf8-string-literal-encoding"? #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spec, I've renamed it to data-section-string-literals which IMO best reflects what the feature is about.


namespace Microsoft.CodeAnalysis.CodeGen;

internal sealed class DataStringHolder : DefaultTypeDef, Cci.INamespaceTypeDefinition
Copy link
Contributor

@AlekseyTs AlekseyTs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataStringHolder

I think this type deserves a comment #Closed


public DataStringHolder(
CommonPEModuleBuilder moduleBuilder,
string dataHash,
Copy link
Contributor

@AlekseyTs AlekseyTs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataHash

Consider adding a comment #Closed

/// </param>
/// <returns>The field. This may have been newly created or may be an existing field previously created for the same data and alignment.</returns>
internal Cci.IFieldReference CreateDataField(ImmutableArray<byte> data, ushort alignment)
internal Cci.ITypeReference GetOrAddProxyType(int length, ushort alignment)
Copy link
Contributor

@AlekseyTs AlekseyTs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetOrAddProxyType

The name looks meaningless to me. "GetOrAddDataFieldType"? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming _proxyTypes as well


Debug.Assert(_mappedFields.IsEmpty, "We need to generate unique field names if we are synthesizing more than one.");

(_, StringField stringField) = _mappedFields.GetOrAdd((data, Alignment: 1), key =>
Copy link
Contributor

@AlekseyTs AlekseyTs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mappedFields

Why do we need a ConcurrentDictionary if we create at most one field? #Closed

Copy link
Member Author

@jjonescz jjonescz Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have an alternative implementation strategy that has more than one field in one class. But I can simplify the code for now.

@AlekseyTs
Copy link
Contributor

AlekseyTs commented Nov 27, 2024

@jjonescz Please add details how do we achieve the goal: what artifacts are getting generated, what APIs are needed and how they are used, what compilation phases/layers are involved and how, etc. Basically, I think we need something like an implementation speclet for the compiler feature. #Closed


fieldsBuilder.Sort(PrivateImplementationDetails.FieldComparer.Instance);

_orderedSynthesizedFields = fieldsBuilder.ToImmutableAndFree();
Copy link
Contributor

@AlekseyTs AlekseyTs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_orderedSynthesizedFields = fieldsBuilder.ToImmutableAndFree();

It looks like we always synthesize two fields. It feels like all the code around fieldsBuilder is unnecessary complicated. #Closed

// Call `<PrivateImplementationDetails>.BytesToString(byte*, int)`.
bytesToStringHelper ??= GetOrSynthesizeBytesToStringHelper(diagnostics);
ilBuilder.EmitOpCode(ILOpCode.Call, -1);
ilBuilder.EmitToken(bytesToStringHelper, null, diagnostics);
Copy link
Contributor

@AlekseyTs AlekseyTs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytesToStringHelper

Given that we only ever have one field to initialize, it doesn't feel like we need this helper method. We can simply inline what it is doing #Closed


namespace Microsoft.CodeAnalysis.CodeGen;

internal sealed class DataStringHolder : DefaultTypeDef, Cci.INamespaceTypeDefinition
Copy link
Contributor

@AlekseyTs AlekseyTs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataStringHolder

I could be missing something, but it feels like we don't really need this type. Everything what it is doing can be done by PrivateImplementationDetail #Closed

Copy link
Member Author

@jjonescz jjonescz Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

I see there is PrivateImplementationDetails.TryAddSynthesizedType(Cci.INamespaceTypeDefinition type) which we could definitely use, hopefully simplifying some thread safety / determinism code.

But we still need a Cci.INamespaceTypeDefinition representing each generated <S> class. That's the DataStringHolder class.

@AlekseyTs
Copy link
Contributor

AlekseyTs commented Nov 27, 2024

I stopped reviewing the PR because I think first we need to agree on the implementation strategy and have a speclet to review against. #Closed

public override IEnumerable<DataStringHolder> GetFrozenDataStringHolders()
{
Debug.Assert(_lazyDataStringHolders is null || _dataStringHoldersFrozen == 1);
return _lazyDataStringHolders?.Values ?? SpecializedCollections.EmptyEnumerable<DataStringHolder>();
Copy link
Contributor

@AlekseyTs AlekseyTs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_lazyDataStringHolders?.Values

Is this going to produce items in a deterministic order? #Closed

@jjonescz
Copy link
Member Author

@AlekseyTs thanks for the review. Sounds like a good idea to get a spec reviewed first. Opened a separate PR for that: #76139

@AlekseyTs
Copy link
Contributor

AlekseyTs commented Dec 5, 2024

@jjonescz The current PR title: "Add feature flag for emitting string literals into data section as UTF8" sounds like the only thing this PR is doing is adding a flag. I suggest to adjust the title to reflect the main goal/purpose of the PR, which I think is to store values of string literals as Utf-8 encoded bytes in a blob, etc. By comparison to that, the feature flag is a minor detail, which, I think, can be dropped from the title if its length becomes too big. #Closed

@jjonescz jjonescz changed the title Add feature flag for emitting string literals into data section as UTF8 Emit opted-in string literals into data section as UTF8 Dec 5, 2024
@jjonescz jjonescz marked this pull request as ready for review December 25, 2024 10:24
this);
}

internal DataSectionStringType? GetOrCreateDataSectionStringLiteralType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSectionStringType?

Does this function ever return null value? The name of the function implies that it doesn't.

@@ -21,6 +21,8 @@ internal interface ITokenDeferral

Cci.IFieldReference GetArrayCachingFieldForConstants(ImmutableArray<ConstantValue> constants, Cci.IArrayTypeReference arrayType, SyntaxNode syntaxNode, DiagnosticBag diagnostics);

Cci.IFieldReference? GetFieldForDataString(string text, ImmutableArray<byte> data, SyntaxNode syntaxNode, DiagnosticBag diagnostics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetFieldForDataString

It feels like the "GetFieldForDataString" name is somewhat confusing. What is a "data string" for which we are requesting a field? It looks like we are requesting a static field that is supposed to store the value of the text parameter during at runtime. Therefore, "GetFieldForStringValue" feels better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, If we expect this function to fail to produce a field under some conditions, consider adding "Try" prefix to the name of the function.

@@ -21,6 +21,8 @@ internal interface ITokenDeferral

Cci.IFieldReference GetArrayCachingFieldForConstants(ImmutableArray<ConstantValue> constants, Cci.IArrayTypeReference arrayType, SyntaxNode syntaxNode, DiagnosticBag diagnostics);

Cci.IFieldReference? GetFieldForDataString(string text, ImmutableArray<byte> data, SyntaxNode syntaxNode, DiagnosticBag diagnostics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableArray data

It looks like the data parameter is redundant. Its value is defined by content of the text parameter. Passing redundant data is a possible source of bugs because it becomes responsibility of each consumer to keep data in sync. Not to mention the fact that each consumer is responsible to do more work.

@@ -21,6 +21,8 @@ internal interface ITokenDeferral

Cci.IFieldReference GetArrayCachingFieldForConstants(ImmutableArray<ConstantValue> constants, Cci.IArrayTypeReference arrayType, SyntaxNode syntaxNode, DiagnosticBag diagnostics);

Cci.IFieldReference? GetFieldForDataString(string text, ImmutableArray<byte> data, SyntaxNode syntaxNode, DiagnosticBag diagnostics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetFieldForDataString

It looks like this API is not consumed from any shared code. Therefore, it looks unnecessary to have it in this interface.

// Sort proxy types.
_orderedProxyTypes = _proxyTypes.OrderBy(kvp => kvp.Key.Size).ThenBy(kvp => kvp.Key.Alignment).Select(kvp => kvp.Value).AsImmutable();
// Sort nested types.
_orderedNestedTypes = _dataFieldTypes.OrderBy(kvp => kvp.Key.Size).ThenBy(kvp => kvp.Key.Alignment).Select(kvp => kvp.Value).OfType<ExplicitSizeStruct>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.OfType()

The purpose of OfType API is to filter the input by type. I do not think that we intend to do that here. If our goal is to do a simple type cast, there is a Cast helper, but, I think it would be more efficient to perform the cast in the lambda passed to the preceding Select call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to strengthen the type of values stored in _dataFieldTypes to ExplicitSizeStruct or to Cci.INestedTypeDefinition?

/// stored and what alignment might be required.
/// </param>
/// <returns>The field. This may have been newly created or may be an existing field previously created for the same data and alignment.</returns>
internal MappedField CreateDataField(ImmutableArray<byte> data, ushort alignment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MappedField

Is there a specific reason for changing return type of this function?

{
_name = name;
_systemObject = systemObject;
_containingType = containingType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_containingType

Would it make sense to use PrivateImplementationDetails for this field and get rid of the _systemObject field? Also we could avoid the need to pass systemString and bytesToStringHelper explicitly.

if (!tryGetWellKnownTypeMember(compilation, syntaxNode, diagnostics, WellKnownMember.System_Text_Encoding__get_UTF8, out var encodingUtf8) |
!tryGetWellKnownTypeMember(compilation, syntaxNode, diagnostics, WellKnownMember.System_Text_Encoding__GetString, out var encodingGetString))
{
return new DefaultMethodDef(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return new DefaultMethodDef(

I am not sure what is going on here. Are we creating a "bogus" method definition in order to recover from an error? I think it would be preferred to not do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make it a responsibility of the only consumer of this functionality, the TryEmitStringLiteralAsUtf8Encoded, to check the presence/goodness of required APIs and to report appropriate diagnostics. In this file, we could assume that relevant APIs won't be called unless the helpers are good. I think if we do that, the logic in this file can be simplified, changes to CommonMessageProvider can be reverted, etc.

public override bool IsSpecialName => true;
}

file class DefaultMethodDef(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultMethodDef

I do not think we have a concept of "default method definition" and I think it would be better to not introduce one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we already have RootModuleStaticConstructor that implements Cci.IMethodDefinition, Cci.IMethodBody. Let's extract a common abstract base "MethodDefinition", rename RootModuleStaticConstructor to StaticConstructor and reuse the new base across StaticConstructor and BytesToStringHelper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While extracting common bases, I recommend extract/copy/paste code from pre-existing types, without changing relative order of members. That is going to make code review much easier.


Debug.Assert(encodingUtf8 is not null && encodingGetString is not null);

var ilBuilder = new ILBuilder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var ilBuilder = new ILBuilder(

I think it would be better to encapsulate the IL emit logic in BytesToStringHelper type.

WellKnownMember member,
[NotNullWhen(returnValue: true)] out Cci.IMethodReference? result)
{
if (compilation.CommonGetWellKnownTypeMember(member) is { } symbol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (compilation.CommonGetWellKnownTypeMember(member) is { } symbol)

In general, I think there is no guarantee that the symbol returned by CommonGetWellKnownTypeMember is good. We should find a way to do what Binder.GetWellKnownTypeMember is doing.

@@ -449,32 +599,40 @@ public Cci.IUnitReference GetUnit(EmitContext context)

public string NamespaceName => string.Empty;

private static string DataToHex(ImmutableArray<byte> data)
internal static string DataToHex(ImmutableArray<byte> data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal

Is the accessibility change necessary?

internal static string DataToHexViaXxHash128(ImmutableArray<byte> data)
{
Span<byte> hash = stackalloc byte[sizeof(ulong) * 2];
var bytesWritten = XxHash128.Hash(data.AsSpan(), hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var

Consider using an explicit type

@@ -484,7 +642,7 @@ static void toHex(ImmutableArray<byte> source, Span<char> destination)
static char hexchar(int x) => (char)((x <= 9) ? (x + '0') : (x + ('A' - 10)));
}

private sealed class FieldComparer : IComparer<SynthesizedStaticField>
internal sealed class FieldComparer : IComparer<SynthesizedStaticField>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal

Is the accessibility change necessary?

@@ -502,7 +660,7 @@ public int Compare(SynthesizedStaticField? x, SynthesizedStaticField? y)
}
}

private sealed class DataAndUShortEqualityComparer : EqualityComparer<(ImmutableArray<byte> Data, ushort Value)>
internal sealed class DataAndUShortEqualityComparer : EqualityComparer<(ImmutableArray<byte> Data, ushort Value)>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal

Is the accessibility change necessary?

///
/// https://github.com/dotnet/roslyn/blob/main/docs/features/string-literals-data-section.md
/// </summary>
internal sealed class DataSectionStringType : DefaultTypeDef, Cci.INestedTypeDefinition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultTypeDef

I think it would be good to extract a common abstract base "NestedTypeDefinition" to share across ExplicitSizeStruct and DataSectionStringType

var staticConstructor = synthesizeStaticConstructor(module, containingType, dataField, stringField, bytesToStringHelper, diagnostics);

_field = stringField;
_fields = [_field];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_field

It looks like the _field is redundant, we can get its value from _fields

this);
}

internal DataSectionStringType? GetOrCreateDataSectionStringLiteralType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetOrCreateDataSectionStringLiteralType

It doesn't look like anyone is interested in the type. The only consumer is interested in a field. I think we need to adjust this API accordingly.

@@ -911,4 +1158,192 @@ public sealed override int GetHashCode()
throw Roslyn.Utilities.ExceptionUtilities.Unreachable();
}
}

/// <summary>
/// A helper method used in static constructor of <see cref="DataSectionStringType"/>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A helper method used in static constructor of .

I think it would be good to include information what signature this method is supposed to have.

public override Cci.ITypeReference GetType(EmitContext context) => context.Module.GetPlatformType(_type, context);
}

file abstract class ParameterDefinitionBase(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParameterDefinitionBase

Consider whether there is a benefit in extracting and sharing a common base with ReturnValueParameter.

@@ -1088,6 +1095,7 @@ internal PrivateImplementationDetails GetPrivateImplClass(TSyntaxNode syntaxNode
this.GetSpecialType(SpecialType.System_Int16, syntaxNodeOpt, diagnostics),
this.GetSpecialType(SpecialType.System_Int32, syntaxNodeOpt, diagnostics),
this.GetSpecialType(SpecialType.System_Int64, syntaxNodeOpt, diagnostics),
this.GetSpecialType(SpecialType.System_String, syntaxNodeOpt, diagnostics),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.GetSpecialType(SpecialType.System_String, syntaxNodeOpt, diagnostics),

Let's say we have a program that utilizes a private implementation detail class, but doesn't use string type, and the string type is not defined. Is this code going to break such a program?

@@ -415,7 +415,7 @@ ImmutableArray<ExceptionHandlerRegion> ExceptionRegions
/// <summary>
/// Debugging information associated with a MoveNext method of a state machine.
/// </summary>
StateMachineMoveNextBodyDebugInfo MoveNextBodyInfo { get; }
StateMachineMoveNextBodyDebugInfo? MoveNextBodyInfo { get; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StateMachineMoveNextBodyDebugInfo? MoveNextBodyInfo { get; }

Are changes in this file primarily related to the goal of the PR?

@@ -101,6 +104,7 @@ public CodeGenerator(
_builder = builder;
_module = moduleBuilder;
_diagnostics = diagnostics;
_dataSectionStringLiteralThreshold = compilation.DataSectionStringLiteralThreshold;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_dataSectionStringLiteralThreshold = compilation.DataSectionStringLiteralThreshold;

Would it make sense to cache this information on the Compilation instead?

@@ -87,7 +89,8 @@ public CodeGenerator(
PEModuleBuilder moduleBuilder,
BindingDiagnosticBag diagnostics,
OptimizationLevel optimizations,
bool emittingPdb)
bool emittingPdb,
Compilation compilation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilation compilation

It looks like Compilation is available from PEModuleBuilder

{
var utf8 = new System.Text.UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var utf8 = new System.Text.UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);

Is there a specific reason behind caching this value? Would it make sense to initialize the cache on demand rather than using a field initializer?

@AlekseyTs
Copy link
Contributor

Done with review pass (commit 2), tests are not looked at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Feature - String Literals in Data Section as UTF8 untriaged Issues and PRs which have not yet been triaged by a lead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants