Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added in standard conversions from types to ReadOnlyMemory<char> #5106

Merged
merged 4 commits into from
May 28, 2020

Conversation

michaelgsharp
Copy link
Member

Currently in the standard transformations we don't support converting types to a string representation. Since all C# types support ToString() functionality, this PR adds in standard transformations to type ReadOnlyMemory<char> into the TypeConvertingTransformer.

@michaelgsharp michaelgsharp requested review from codemzs, harishsk, eerhardt and a team May 7, 2020 20:02
@michaelgsharp michaelgsharp self-assigned this May 7, 2020
@michaelgsharp
Copy link
Member Author

After syncing with @harishsk, we thought that since all C# types natively support a ToString(), that just adding it in to the standard conversions made sense. @codemzs and @eerhardt we wanted your thoughts on that. If you think its ok we will leave it as is. If not, we can move it into the aux conversions and add extra helper methods as needed to match what we have for the standard conversions.

@harishsk
Copy link
Contributor

harishsk commented May 7, 2020

@michaelgsharp To clarify: Once this PR goes in, we can remove the ToStringTransformer from Microsoft.ML.Featurizers and use the existing TypeConvertingTransformer. Is that correct?

@harishsk
Copy link
Contributor

harishsk commented May 7, 2020

@michaelgsharp Please look at the test failures. You may need to update the baselines as a result of this change.

@michaelgsharp
Copy link
Member Author

You are correct Harish, once this goes in we can remove the ToStringTransformer.

@@ -121,7 +121,7 @@ private sealed class TestStringClass
public string A;
}

[Fact]
[Fact, TestCategory("RunSpecificTest")]
Copy link
Contributor

@harishsk harishsk May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be reverted? #Resolved

Copy link
Member Author

@michaelgsharp michaelgsharp May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. Reverting now. #Resolved

@codecov
Copy link

codecov bot commented May 11, 2020

Codecov Report

Merging #5106 into master will decrease coverage by 0.04%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##           master    #5106      +/-   ##
==========================================
- Coverage   75.61%   75.56%   -0.05%     
==========================================
  Files         993      995       +2     
  Lines      178514   179337     +823     
  Branches    19191    19294     +103     
==========================================
+ Hits       134979   135523     +544     
- Misses      38302    38554     +252     
- Partials     5233     5260      +27     
Flag Coverage Δ
#Debug 75.56% <77.77%> (-0.05%) ⬇️
#production 71.50% <66.66%> (-0.07%) ⬇️
#test 88.68% <100.00%> (+0.03%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Data/Data/Conversion.cs 76.62% <66.66%> (-0.29%) ⬇️
...st/Microsoft.ML.Tests/Transformers/ConvertTests.cs 100.00% <100.00%> (ø)
...rosoft.ML.AutoML/ColumnInference/TextFileSample.cs 59.60% <0.00%> (-2.65%) ⬇️
src/Microsoft.ML.OnnxConverter/SaveOnnxCommand.cs 71.42% <0.00%> (-2.11%) ⬇️
...L.AutoML/TrainerExtensions/TrainerExtensionUtil.cs 85.15% <0.00%> (-1.75%) ⬇️
...rosoft.ML.Data/DataView/RowToRowMapperTransform.cs 92.89% <0.00%> (-0.48%) ⬇️
...cenariosWithDirectInstantiation/TensorflowTests.cs 91.56% <0.00%> (-0.32%) ⬇️
...est/Microsoft.ML.Predictor.Tests/TestPredictors.cs 70.11% <0.00%> (-0.04%) ⬇️
src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs 56.44% <0.00%> (ø)
...crosoft.ML.TimeSeries/RootCauseLocalizationType.cs 51.19% <0.00%> (ø)
... and 14 more

@harishsk
Copy link
Contributor

Can you also please add a note in docs\code\IDataViewTypeSystem.md that we conversions to string are now supported?

public void Convert(in BL src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in TS src, ref TX dst) => dst = string.Format("{0:c}", src).AsMemory();
public void Convert(in DT src, ref TX dst) => string.Format("{0:o}", src).AsMemory();
public void Convert(in DZ src, ref TX dst) => string.Format("{0:o}", src).AsMemory();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation behind this change?

My concern here is this could potentially increase garbage collector pressure, however, in the case of a transformer you could control that pressure by re-using the buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microsoft.ML.Featurizers has a few transformers that have overlapping functionality with transformers that are already present in ML.NET. The motivation for this change is to bring in the those featurizers and merge their functionality with existing transformers. The ToStringTransformer only implements conversion from basic data types to string. So we are looking to merge it with the more general TypeConvertingTransformer by adding string support for it.

In a transformer, can we truly reuse the buffer? Wouldn't we hand over the reference to the string back to the caller after the conversion?

Do you see this conversion function being in the hot path to cause garbage collector pressure?

Or do you think this garbage collection concern is serious enough that we should have string conversion being done in a separate transformer with a separate API?


In reply to: 423448756 [](ancestors = 423448756)

Copy link
Member

@codemzs codemzs May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller hands over a reference to the transfomer which is why the getters of the transformers take output as ref type, for example look at the getter for ImageLoader Transform. We reuse buffer and keep the GC pressure low, this is a fundamental principle in ML .NET that gives all the performance benefits that have been outline in KDD'2019 publication. With strings you will try to reuse the buffer when you can and only expand when needed using ReadOnlyMemory.

I have a reason to believe this can be a hot path, may be if you can share benchmarks to show me this isn't by actually running performance tests by using several string conversions using this PR vs using transformer where you can reuse the buffer then it would be great. The last time we overhauled the type system we did extensive benchmarks and testing that you can see here. I expect the same anytime type system is changed or updated.

CC: @eerhardt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Zeeshan,

In this case the caller hands over a ReadOnlyMemory<char> reference. From my understanding, since its readonly, we cannot modify it or the underlying buffer, and must allocate a new one. I also looked at the documentation for ReadOnlyMemory and couldn't see any way to modify it at all. Do you know of a way to modify it? Or do you know if we have any examples in code I can use as a reference? I was unable to find anything.

Assuming we can't modify it and need to return a new instance, I did some testing converting an int to a string (not in ML.Net) and then returning a ReadOnlyMemory<char> pointing to it.

The first way took a ref ReadOnlyMemory<char> and assigned it to value.ToString().AsMemory().

for (int x = 0; x < NumberOfIterations; x++)
{
    // readOnlyMemory is passed in by ref
    readOnlyMemory = x.ToString().AsMemory();
}

The second way was to also take a ref ReadOnlyMemory<char> but also a ref StringBuilder and use the StringBuilder to create the string and memory.

for (int x = 0; x < NumberOfIterations; x++)
{
    // stringBuilder and readOnlyMemory are passed in by ref.
    // the stringBuilder is using Clear, which should keep the underlying buffer
    stringBuilder.Clear();
    stringBuilder.Append(x);
    readOnlyMemory = stringBuilder.ToString().AsMemory();
}

In Release mode with 214 million allocations, the first approach of ToString().AsMemory() took about 6 seconds on average where the second approach took about 8 seconds on average. Unless there is another faster way, which I couldn't find, the fastest way to convert an int to ReadOnlyMemory<char> is the first approach. Each approach also used about 13 MB of memory total.

I have also created a benchmark in ML.Net that uses the first approach to create the ReadOnlyMemory<char>. It ran 10 million rows in 1.25 seconds, and the entire test only used 30 MB of memory (I am assuming the majority of this memory is from ML.Net itself as the first tests I did had more rows and used less memory). I have not done any other benchmarks tests as of yet.

@codemzs does this address your concern? Or is there something else I can do to address it? Or do you know a better way to deal with the ReadOnlyMemory<char>?

@eerhardt do you know of a better way to handle the `ReadOnlyMemory?

Copy link
Member

@eerhardt eerhardt May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ML.NET's current infrastructure, this is probably as good as you can make it. But in later versions of .NET Core (if we wanted to start targeting them), you could reshare the buffers in some cases by using the following methods:

https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.memorymarshal.trygetarray
and
https://docs.microsoft.com/en-us/dotnet/api/system.int32.tryformat

The idea being something like this:

public void Convert(in I4 src, ref TX dst)
{
    if (MemoryMarshal.TryGetArray(dst, out ArraySegment<char> segment))
    {
        if (src.TryFormat(segment.Array, out int charsWritten))
        {
             dst = segment.Array.AsMemory(0, charsWritten);
             return;
        }
    }

    char[] newBuffer = new char[11]; // 11 is the most characters needed to represent all 32-bit ints
    src.TryFormat(newBuffer, out int charsWritten);
    dst = newBuffer.AsMemory(0, charsWritten);
}

What's dangerous about this approach is that dst might be someone else's array that we are clobbering over in order to reuse it. To prevent that, we could write a little sentinel value at the beginning of the array to mark it as ours. Say something like 2 char.MaxValue elements. And then we could check if the array we got from MemoryMarshal.TryGetArray was "ours" by checking the sentinel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we want to use the StringBuilder overloads directly above this block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case the conversion being sought is from some data type like int to string and not to StringBuilder. If we wanted to use the above conversions, we would have to create a local StringBuilder convert the given data type to StringBuilder and then convert it back to string. @michaelgsharp has discovered this method is slower.


In reply to: 428265503 [](ancestors = 428265503)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that those APIs are supported on .NET Core 3.1 and 2.1 which ML.NET supports but not supported on netfx461 which ML.NET needs.
In that case, is it okay to go ahead with this PR until those APIs are fully supported on all ML.NET configs?


In reply to: 428262800 [](ancestors = 428262800)

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@harishsk harishsk merged commit 8b54a7b into dotnet:master May 28, 2020
@michaelgsharp michaelgsharp deleted the string-type-conversion branch November 4, 2020 20:36
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants