-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace DV data type system with .NET standard type system. #863
Conversation
…nto typesystem # Conflicts: # src/Microsoft.ML.Api/ApiUtils.cs # src/Microsoft.ML.Data/Data/Conversion.cs # test/BaselineOutput/SingleDebug/Command/Datatypes-datatypes.txt # test/BaselineOutput/SingleRelease/Command/Datatypes-datatypes.txt # test/Microsoft.ML.Core.Tests/UnitTests/DvTypes.cs # test/Microsoft.ML.Tests/CollectionDataSourceTests.cs
…into typesystem # Conflicts: # src/Microsoft.ML.Api/ApiUtils.cs # src/Microsoft.ML.Api/DataViewConstructionUtils.cs # src/Microsoft.ML.Api/TypedCursor.cs # src/Microsoft.ML.Core/Data/DataKind.cs # src/Microsoft.ML.Core/Data/DvInt1.cs # src/Microsoft.ML.Core/Data/DvInt2.cs # src/Microsoft.ML.Core/Data/DvInt4.cs # src/Microsoft.ML.Core/Data/DvInt8.cs # src/Microsoft.ML.Core/Data/DvText.cs # src/Microsoft.ML.Core/Data/TypeUtils.cs # src/Microsoft.ML.Data/Data/Conversion.cs # src/Microsoft.ML.Data/Evaluators/BinaryClassifierEvaluator.cs # src/Microsoft.ML.Data/Evaluators/ClusteringEvaluator.cs # src/Microsoft.ML.Data/Evaluators/MultiOutputRegressionEvaluator.cs # src/Microsoft.ML.Data/Evaluators/MulticlassClassifierEvaluator.cs # src/Microsoft.ML.Data/Evaluators/RankerEvaluator.cs # src/Microsoft.ML.Data/Evaluators/RegressionEvaluatorBase.cs # test/BaselineOutput/SingleDebug/Command/Datatypes-datatypes.txt # test/BaselineOutput/SingleRelease/Command/Datatypes-datatypes.txt # test/Microsoft.ML.Core.Tests/UnitTests/DvTypes.cs # test/Microsoft.ML.Core.Tests/UnitTests/TestCSharpApi.cs # test/Microsoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs # test/Microsoft.ML.StaticPipelineTesting/StaticPipeTests.cs # test/Microsoft.ML.TestFramework/TestSparseDataView.cs # test/Microsoft.ML.Tests/CollectionDataSourceTests.cs
into typesystem # Conflicts: # src/Microsoft.ML.Data/Data/Conversion.cs # src/Microsoft.ML.Data/DataLoadSave/Binary/CodecFactory.cs # src/Microsoft.ML.Data/DataLoadSave/Binary/Codecs.cs # src/Microsoft.ML.Data/DataLoadSave/Binary/UnsafeTypeOps.cs # src/Microsoft.ML.Transforms/NAReplaceUtils.cs # test/BaselineOutput/SingleDebug/Command/Datatypes-datatypes.txt # test/BaselineOutput/SingleRelease/Command/Datatypes-datatypes.txt
@@ -120,47 +120,38 @@ public bool IsBool | |||
} | |||
|
|||
/// <summary> | |||
/// Whether this type is the standard timespan type. | |||
/// Whether this type is the standard <see cref="TimeSpan"/> type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TimeSpan [](start = 57, length = 8)
TimeSpanType
, not TimeSpan
right? Certainly any instance of a ColumnType
could not be a System.TimeSpan
. #Resolved
Contracts.Assert(this == TimeSpanType.Instance); | ||
return true; | ||
Contracts.Assert((this == TimeSpanType.Instance) == (this is TimeSpanType)); | ||
return this is TimeSpanType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice simplification of some, frankly, previously ridiculous code. Thank you for adding it. #Resolved
@@ -11,4 +11,8 @@ | |||
<Folder Include="Properties\" /> | |||
</ItemGroup> | |||
|
|||
<ItemGroup> | |||
<PackageReference Include="System.Memory" Version="4.5.1" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're referencing this package in many places, the version should be a variable SystemMemoryVersion
in build/Dependencies.props
. #Resolved
@@ -42,12 +42,12 @@ public static class Kinds | |||
public const string ScoreColumnSetId = "ScoreColumnSetId"; | |||
|
|||
/// <summary> | |||
/// Metadata kind that indicates the prediction kind as a string. E.g. "BinaryClassification". The value is typically a DvText. | |||
/// Metadata kind that indicates the prediction kind as a string. E.g. "BinaryClassification". The value is typically a ReadOnlyMemory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadOnlyMemory [](start = 132, length = 14)
Not just any, but specifically ReadOnlyMemory<char>
. #Resolved
@@ -295,7 +295,7 @@ public static IEnumerable<int> GetColumnSet(this ISchema schema, string metadata | |||
/// Returns <c>true</c> if the specified column: | |||
/// * is a vector of length N (including 0) | |||
/// * has a SlotNames metadata | |||
/// * metadata type is VBuffer<DvText> of length N | |||
/// * metadata type is VBuffer<ReadOnlyMemory> of length N |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadOnlyMemory [](start = 43, length = 14)
If you wanted to read it out it would be not ReadOnlyMemory
but ReadOnlyMemory<char>
, but this is why we have <see
tags in the XML docs. #Resolved
{ | ||
|
||
/// <summary> | ||
/// This implements IEquatable's Equals method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implements IEquatable's Equals method. [](start = 12, length = 43)
It couldn't possibly be implementing an interface method since this is a static method on a static class, and even if it wasn't you couldn't make ReadOnlyMemory<char>
implement any interface anyway, since it is not a type under your control.
I think you meant to say it was a utility function out of which one could make an IEquatityComparer<ReadOnlyMemory<char>>
, but it's difficult to be certain. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <summary> | ||
/// Compare equality with the given system string value. | ||
/// </summary> | ||
public static bool EqualsStr(string s, ReadOnlyMemory<char> memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EqualsStr [](start = 27, length = 9)
Is this different method necessary due to perf reasons of s.AsMemory
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
for (int i = 0; i < memory.Length; i++) | ||
{ | ||
if (memory.Span[i] != b.Span[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Span [](start = 27, length = 4)
Please store the spans as a variable upfront, then operate on them. These calls to .Span
property are not free. Here and everywhere.
Consider this code.
class Program
{
private static void TimeTest(string str)
{
long temp = 0;
var sw = new Stopwatch();
const int trials = 1_000_000_000;
sw.Restart();
for (int i = 0; i < trials; ++i)
temp += str[0];
sw.Stop();
Console.WriteLine($"{sw.Elapsed} on string");
temp = 0;
sw.Restart();
var mem = str.AsMemory();
for (int i = 0; i < trials; ++i)
temp += mem.Span[0];
sw.Stop();
Console.WriteLine($"{sw.Elapsed} on memory.Span");
temp = 0;
sw.Restart();
var span = mem.Span;
for (int i = 0; i < trials; ++i)
temp += span[0];
sw.Stop();
Console.WriteLine($"{sw.Elapsed} on a stored span");
}
private static void Main(string[] args)
{
TimeTest("hello");
}
}
When compiled and run in .NET Core 2.1, in a release build, the timings are this:
00:00:00.4667924 on string
00:00:01.8801733 on memory.Span
00:00:00.2327934 on a stored span
``` #Resolved
yield break; | ||
} | ||
|
||
int ichMin = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ichMin [](start = 16, length = 6)
As far as I see both ichMin
and ichLim
could, here and most methods in this file, be replaced with direct usage of the 0
literal and memory.Length
(or, more precisely, span.Length
since we should be storing the span upfront, see above) literal. This would both simplify the code and make it more clear.
Look at it this way. If we had code over an array a
that looked like this:
for (int i=0; i<a.Length; ++i)
DoSomething(a[i]);
we certainly would not welcome any movement to change the code to this.
int min = 0;
int lim = a.Length;
for (int i=min; i<lim; ++i)
DoSomething(a[i]);
That's less clear and needlessly verbose. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ichLim might be worth keeping since we don't want to call memory.Length multiple times. There are places where ichMin is incremented and even there we should keep it.
In reply to: 216126883 [](ancestors = 216126883)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ichLim might be worth keeping since we don't want to call memory.Length multiple times.
Why? Are you under the impression that calling this on a ReadOnlySpan<char>
is a perf hit? Have you checked?
Just reactivating, since it reads like you aren't going to do this.
In reply to: 216189476 [](ancestors = 216189476,216126883)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I did not check but based on your comment for Span(and other places in the codebase I have seen such fields cached locally) I assumed it might also be the case for Length but may be I'm wrong. I will replace ichLim as well.
In reply to: 216192013 [](ancestors = 216192013,216189476,216126883)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I did not check but based on your comment for Span(and other places in the codebase I have seen such fields cached locally) I assumed it might also be the case for Length but may be I'm wrong. I will replace ichLim as well.
Why would you? Are you under the impression that an array .Length
involves some heavy computation as well?
The point of caching is to avoid computation or memory allocations. In this case .Length
is an immutable property of an immutable struct. By caching its value you're storing more on the stack.
The point is that a call to Span
involves some actual work whereas Length
does not.
In reply to: 216192140 [](ancestors = 216192140,216192013,216189476,216126883)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Span doing additional computation is also concerning to me. We need to take this up with the .NET team and provide this feedback.
In reply to: 216194473 [](ancestors = 216194473,216192140,216192013,216189476,216126883)
} | ||
|
||
// Note that we don't use any fields of "this" here in case one | ||
// of the out parameters is the same as "this". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy pasta detected. #Resolved
} | ||
|
||
/// <summary> | ||
/// Splits this instance on the left-most occurrence of an element of separators character array and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this instance [](start = 19, length = 13)
This is not an instance method, so saying "this instance" is unclear. A <paramref
however would not go amiss. #Resolved
|
||
/// <summary> | ||
/// Splits this instance on the left-most occurrence of an element of separators character array and | ||
/// produces the left and right ReadOnlyMemory values. If this instance does not contain any of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadOnlyMemory [](start = 40, length = 14)
Another textual reference to a non-generic ReadOnlyMemory
structure, which does not exist. I think these can be replaced more or less well with <see cref="ReadOnlyMemory{char}"/>
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// REVIEW: Can this be faster? | ||
private static bool ContainsChar(char ch, char[] rgch) | ||
{ | ||
Contracts.CheckNonEmpty(rgch, nameof(rgch)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contracts.CheckNonEmpty(rgch, nameof(rgch)); [](start = 12, length = 44)
Since there's that review comment (though I suspect it was just copy/pasted over), yes, this can be faster.
First this is a private method that you're calling repeatedly in tight loops in the split code. An assert on the non-emptyness of the input is appropriate for sure, and you ought to check that the code is unreachable in any situation where the separators fed in should not be null, but other than that it seems fine. A check is code run all the time even in a release build, which is not needed here. #Resolved
…to typesystem # Conflicts: # src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs # test/Microsoft.ML.Tests/Scenarios/Api/Estimators/Visibility.cs # test/Microsoft.ML.Tests/Scenarios/Api/Visibility.cs
…to typesystem # Conflicts: # src/Microsoft.ML.Transforms/NAReplaceTransform.cs # src/Microsoft.ML.Transforms/NAReplaceUtils.cs
That's not really the same thing. When you test for exceptions you should verify the exception type and optionally you can verify its message. The code could start throwing In reply to: 422599427 [](ancestors = 422599427,420677248) Refers to: test/Microsoft.ML.TestFramework/DataPipe/Parquet.cs:36 in d45bc2c. [](commit_id = d45bc2c, deletion_comment = False) |
By that logic the code could also throw another exception of the same type and even with the same message..... :) In reply to: 422847966 [](ancestors = 422847966,422599427,420677248) Refers to: test/Microsoft.ML.TestFramework/DataPipe/Parquet.cs:36 in d45bc2c. [](commit_id = d45bc2c, deletion_comment = False) |
/// <summary> | ||
/// Returns a <see cref="ReadOnlyMemory{T}"/> of <see cref="char"/> with trailing whitespace trimmed. | ||
/// </summary> | ||
public static ReadOnlyMemory<char> TrimEndWhiteSpace(ReadOnlyMemory<char> memory, ReadOnlySpan<char> span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method is a little awkward and should be removed. It is assuming that the memory
and span
are pointing to the same location. I see it is only called in 1 spot, which can just be moved to the above overload that only takes a ROM<char>
. #Resolved
{ | ||
Contracts.Assert(scan.IchMinBuf <= scan.IchMinNext && scan.IchMinNext <= scan.IchLimBuf); | ||
|
||
var text = scan.TextBuf; | ||
int ichLim = scan.IchLimBuf; | ||
int ichCur = scan.IchMinNext; | ||
|
||
//var span = text.Span; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) can be removed. #Resolved
return pool.Get(memory); | ||
} | ||
|
||
public static void AddToStringBuilder(ReadOnlyMemory<char> memory, StringBuilder sb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can get rid of this method and just have all the callers call sb.Append(memory)
themselves. That way we don't have yet another public util method that does the same thing. #Resolved
|
||
for (; ; ichMin++) | ||
public static Result Parse(out Single value, ReadOnlySpan<char> span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are in here fixing this, can we switch the parameters around so the out
variable is at the end? #Resolved
I think this is looking really good. Just a couple last comments to address and I think this will be ready to merge. #Resolved |
Thanks for reviewing! In reply to: 422861081 [](ancestors = 422861081) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change replaces DvType system with .NET standard data type system and fixes #673
sbyte
short
int
long
bool
DateTime
DateTimeOffset
TimeSpan
ReadOnlyMemory<char>