Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #7271 AOT for ML.Tokenizers #7272

Merged
merged 3 commits into from
Oct 17, 2024
Merged

Conversation

euju-ms
Copy link
Contributor

@euju-ms euju-ms commented Oct 16, 2024

Fixes #7271

This PR makes ML.Tokenizers project AOT compatible.

ML.Tokenizers is made to use SourceGenerationContext for deserializing Json.

I had to create a helper class Vocabulary in order to register a JsonConverter for it.

Before the change, we have following Aot warnings on the calls to Json.Deserialize:

  Microsoft.ML.Tokenizers failed with 8 error(s) (2.6s)
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\EnglishRobertaTokenizer.cs(173,25): error IL3050: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresDynamicCodeAttribute' can break functionality when AOT compiling. JSON serialization and deserialization might require types that cannot be statically analyzed and might need runtime code generation. Use System.Text.Json source generation for native AOT applications.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\EnglishRobertaTokenizer.cs(173,25): error IL2026: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresUnreferencedCodeAttribute' can break functionality when trimming application code. JSON serialization and deserialization might require types that cannot be statically analyzed. Use the overload that takes a JsonTypeInfo or JsonSerializerContext, or make sure all of the required types are preserved.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\BPETokenizer.cs(763,59): error IL3050: Using member 'System.Text.Json.JsonSerializer.DeserializeAsync<TValue>(Stream, JsonSerializerOptions, CancellationToken)' which has 'RequiresDynamicCodeAttribute' can break functionality when AOT compiling. JSON serialization and deserialization might require types that cannot be statically analyzed and might need runtime code generation. Use System.Text.Json source generation for native AOT applications.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\BPETokenizer.cs(764,53): error IL3050: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresDynamicCodeAttribute' can break functionality when AOT compiling. JSON serialization and deserialization might require types that cannot be statically analyzed and might need runtime code generation. Use System.Text.Json source generation for native AOT applications.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\BPETokenizer.cs(763,59): error IL2026: Using member 'System.Text.Json.JsonSerializer.DeserializeAsync<TValue>(Stream, JsonSerializerOptions, CancellationToken)' which has 'RequiresUnreferencedCodeAttribute' can break functionality when trimming application code. JSON serialization and deserialization might require types that cannot be statically analyzed. Use the overload that takes a JsonTypeInfo or JsonSerializerContext, or make sure all of the required types are preserved.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\BPETokenizer.cs(764,53): error IL2026: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresUnreferencedCodeAttribute' can break functionality when trimming application code. JSON serialization and deserialization might require types that cannot be statically analyzed. Use the overload that takes a JsonTypeInfo or JsonSerializerContext, or make sure all of the required types are preserved.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\CodeGenTokenizer.cs(1771,25): error IL3050: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresDynamicCodeAttribute' can break functionality when AOT compiling. JSON serialization and deserialization might require types that cannot be statically analyzed and might need runtime code generation. Use System.Text.Json source generation for native AOT applications.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\CodeGenTokenizer.cs(1771,25): error IL2026: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresUnreferencedCodeAttribute' can break functionality when trimming application code. JSON serialization and deserialization might require types that cannot be statically analyzed. Use the overload that takes a JsonTypeInfo or JsonSerializerContext, or make sure all of the required types are preserved.

After the change, warnings are no more :)

Note that netstandard2.0 framework is not Aot compatible as trimming is only supported for .NET 6 and later. Therefore, in order to test the compatibility for Microsoft.ML.Tokenizers project, you need to set the TargetFramework to net8.0.

Then you can add <PublishAot>true</PublishAot>.

e.g. Microsoft.ML.Tokenizers.csproj

  <PropertyGroup>
    <TargetFramework>net8.0</TargetFramework>
    <Nullable>enable</Nullable>
    <IsPackable>true</IsPackable>
    <PackageDescription>Microsoft.ML.Tokenizers contains the implmentation of the tokenization used in the NLP transforms.</PackageDescription>
    <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
    <PublishAot>true</PublishAot>
  </PropertyGroup>

Then building specifically for Microsoft.ML.Tokenizers should result in warnings if the project is not AOT compatible.

We are excited to review your PR.

So we can do the best job, please check:

  • There's a descriptive title that will make sense to other developers some time from now.
  • There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
  • Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
  • You have included any necessary tests in the same PR.

@euju-ms
Copy link
Contributor Author

euju-ms commented Oct 16, 2024

@dotnet-policy-service agree company="Microsoft"

@tarekgh
Copy link
Member

tarekgh commented Oct 16, 2024

@eiriktsarpalis could you please have a quick look at this change. The change is just changing the json deserialization code to use the source generator.

Copy link

codecov bot commented Oct 16, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 68.81%. Comparing base (823fc17) to head (5483e4f).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7272   +/-   ##
=======================================
  Coverage   68.81%   68.81%           
=======================================
  Files        1461     1461           
  Lines      272405   272400    -5     
  Branches    28176    28176           
=======================================
  Hits       187442   187442           
+ Misses      77727    77725    -2     
+ Partials     7236     7233    -3     
Flag Coverage Δ
Debug 68.81% <85.71%> (+<0.01%) ⬆️
production 63.30% <85.71%> (+<0.01%) ⬆️
test 89.07% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 76.75% <100.00%> (-0.04%) ⬇️
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs 72.76% <100.00%> (-0.03%) ⬇️
...oft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs 74.30% <100.00%> (-0.04%) ⬇️
...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs 76.08% <50.00%> (-0.51%) ⬇️

... and 3 files with indirect coverage changes

Copy link
Member

@eiriktsarpalis eiriktsarpalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

Co-authored-by: Eirik Tsarpalis <eirik.tsarpalis@gmail.com>
@euju-ms
Copy link
Contributor Author

euju-ms commented Oct 17, 2024

Thanks for the review :)

@tarekgh
Copy link
Member

tarekgh commented Oct 17, 2024

/ba-g unrelated infrastructure failure.

@tarekgh tarekgh merged commit f385b06 into dotnet:main Oct 17, 2024
22 of 25 checks passed
@euju-ms euju-ms deleted the Tokenizers/aot branch October 17, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aot compatibility for ML.Tokenizers
3 participants