-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added in standard conversions from types to ReadOnlyMemory<char> #5106
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
1a669ad
added in standard conversions from types to ReadOnlyMemory<char>
michaelgsharp 66c21a6
fixed issues with differences in tostring of .netcore 3
michaelgsharp decb5ff
removing RunSpecificTest test attribute
michaelgsharp efcdab0
added comments into documentation about type changes
michaelgsharp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation behind this change?
My concern here is this could potentially increase garbage collector pressure, however, in the case of a transformer you could control that pressure by re-using the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microsoft.ML.Featurizers has a few transformers that have overlapping functionality with transformers that are already present in ML.NET. The motivation for this change is to bring in the those featurizers and merge their functionality with existing transformers. The ToStringTransformer only implements conversion from basic data types to string. So we are looking to merge it with the more general TypeConvertingTransformer by adding string support for it.
In a transformer, can we truly reuse the buffer? Wouldn't we hand over the reference to the string back to the caller after the conversion?
Do you see this conversion function being in the hot path to cause garbage collector pressure?
Or do you think this garbage collection concern is serious enough that we should have string conversion being done in a separate transformer with a separate API?
In reply to: 423448756 [](ancestors = 423448756)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller hands over a reference to the transfomer which is why the getters of the transformers take output as
ref
type, for example look at the getter for ImageLoader Transform. We reuse buffer and keep the GC pressure low, this is a fundamental principle in ML .NET that gives all the performance benefits that have been outline in KDD'2019 publication. With strings you will try to reuse the buffer when you can and only expand when needed using ReadOnlyMemory.I have a reason to believe this can be a hot path, may be if you can share benchmarks to show me this isn't by actually running performance tests by using several string conversions using this PR vs using transformer where you can reuse the buffer then it would be great. The last time we overhauled the type system we did extensive benchmarks and testing that you can see here. I expect the same anytime type system is changed or updated.
CC: @eerhardt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Zeeshan,
In this case the caller hands over a
ReadOnlyMemory<char>
reference. From my understanding, since its readonly, we cannot modify it or the underlying buffer, and must allocate a new one. I also looked at the documentation for ReadOnlyMemory and couldn't see any way to modify it at all. Do you know of a way to modify it? Or do you know if we have any examples in code I can use as a reference? I was unable to find anything.Assuming we can't modify it and need to return a new instance, I did some testing converting an int to a string (not in ML.Net) and then returning a
ReadOnlyMemory<char>
pointing to it.The first way took a
ref ReadOnlyMemory<char>
and assigned it tovalue.ToString().AsMemory()
.The second way was to also take a
ref ReadOnlyMemory<char>
but also a refStringBuilder
and use the StringBuilder to create the string and memory.In Release mode with 214 million allocations, the first approach of
ToString().AsMemory()
took about 6 seconds on average where the second approach took about 8 seconds on average. Unless there is another faster way, which I couldn't find, the fastest way to convert an int toReadOnlyMemory<char>
is the first approach. Each approach also used about 13 MB of memory total.I have also created a benchmark in ML.Net that uses the first approach to create the
ReadOnlyMemory<char>
. It ran 10 million rows in 1.25 seconds, and the entire test only used 30 MB of memory (I am assuming the majority of this memory is from ML.Net itself as the first tests I did had more rows and used less memory). I have not done any other benchmarks tests as of yet.@codemzs does this address your concern? Or is there something else I can do to address it? Or do you know a better way to deal with the
ReadOnlyMemory<char>
?@eerhardt do you know of a better way to handle the `ReadOnlyMemory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With ML.NET's current infrastructure, this is probably as good as you can make it. But in later versions of .NET Core (if we wanted to start targeting them), you could reshare the buffers in some cases by using the following methods:
https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.memorymarshal.trygetarray
and
https://docs.microsoft.com/en-us/dotnet/api/system.int32.tryformat
The idea being something like this:
What's dangerous about this approach is that
dst
might be someone else's array that we are clobbering over in order to reuse it. To prevent that, we could write a little sentinel value at the beginning of the array to mark it as ours. Say something like 2char.MaxValue
elements. And then we could check if the array we got fromMemoryMarshal.TryGetArray
was "ours" by checking the sentinel.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we want to use the StringBuilder overloads directly above this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case the conversion being sought is from some data type like
int
tostring
and not toStringBuilder
. If we wanted to use the above conversions, we would have to create a localStringBuilder
convert the given data type toStringBuilder
and then convert it back tostring
. @michaelgsharp has discovered this method is slower.In reply to: 428265503 [](ancestors = 428265503)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that those APIs are supported on .NET Core 3.1 and 2.1 which ML.NET supports but not supported on netfx461 which ML.NET needs.
In that case, is it okay to go ahead with this PR until those APIs are fully supported on all ML.NET configs?
In reply to: 428262800 [](ancestors = 428262800)