Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.NET data type system instead of DvTypes #673

Closed
codemzs opened this issue Aug 13, 2018 · 8 comments · Fixed by #863
Closed

.NET data type system instead of DvTypes #673

codemzs opened this issue Aug 13, 2018 · 8 comments · Fixed by #863
Assignees
Labels
API Issues pertaining the friendly API
Milestone

Comments

@codemzs
Copy link
Member

codemzs commented Aug 13, 2018

.NET data type system instead of DvTypes

Motivation

Machine Learning datasets often have missing values and to accommodate them along with C# native
types without increasing the memory footprint DvType system was created. If we were to use
Nullable<T> then we are looking at additional memory for HasValue boolean field plus another 3
bytes for 4 byte alignment. The C# native types that are replaced using DvTypes are bool as DvBool,
sbyte as DvInt1, int16 as DvInt2, int32 as DvInt4, int64 as DvInt8, DvDateTime as System.DateTime,
DvDateTimeZone as combination of DvDateTime and DvInt2 offset, DvTimeSpan as SysTimeSpan and string
as DvText. Float and Double types already have a special value called NaN that can be used for
missing value. DvType system achieves a smaller memory footprint by denoting special value for
missing value which is usually the smallest number that can be represented by the native type that
is encapsulated by DvType, example, DvInt1's missing value indicator would be SByte.MinValue and in
the case of types that represent date/time types it is a value that represent maximum ticks.

We plan to remove DvTypes to make IDataView a general commodity that can be used in other products
and for this to happen it would be nice if it did not having a dependency on a special type system.
If in future we find having DvTypes was useful then we can consider exposing it natively from .NET
platform. Once we remove DvTypes then ML.NET platform will be using native non-nullable C# types.
Float or double types can be used to represent missing value.

Column Types

Columns in ML.NET make up the dataset and ColumnType defines a column. At high level there are two
kinds of column, first is PrimitiveType and that comprises of types such as NumberType,
BoolType, TextType, DateTimeType, DateTimeZoneType, KeyType, second is Structured type
and it comparises of VectorType. ColumnType is primarily made up of Type and DataKind.
Type could refer to any type but it is instantiated with a type referred by DataKind which is an
identifer for data types that comprises of DvTypes, native C# types such as float, double and custom
big integer UInt128.

Type conversion

DvTypes have implicit and explicit override for assignment operator that handles type conversion.
Lets consider DvInt1 for example:

To From Current behavior
DvInt1 sbyte Copy the value as it is
DvInt1 sbyte? Assign missing value if null otherwise copy the value as it is
sbyte DvInt1 Copy if not a missing value otherwise throw exception
sbyte? DvInt1 Assign null for missing values otherwise copy over
DvInt1 DvBool Assign missing value for a missing value otherwise copy value over
DvInt1 DvInt2 Cast raw value from short to sbyte and compare it with original value if they are not same assign missing value otherwise casted value
DvInt1 DvInt4 Same as above
DvInt1 DvInt8 Same as above
DvInt1 Float
DvInt1 Double Same as above
Float DvInt1 Assign NaN for missing value
Double DvInt1 Same as above

Similar conversion rules exist for DvInt2, DvInt4, DvInt8 and DvBool.

Logical, bitwise and numerical operators

Operations such as ==, !=, !, >, >=, <, <=, +,-,*,pow,|,& take place
between same DvTypes only. They also handle missing values and in the case of arithmetic operators
overflow is also handled. Most of these overrides are implemented but only few are actively used.
Whenever there is an overflow the resulting value is represented as missing value and the same goes
when one of the operands is a missing value.

Serialization

DvTypes have their own codecs for efficiently compressing data and writing it to disk, for example,
to write DvBool to disk, two bits are used to represent a boolean value, 0x00 is false, 0x01 is true
and 0x10 is missing value indicator. Boolean values are written at the level of int32 which has 32
bits that can accommodate 32/2 or 16 boolean values in 4 bytes as opposed to using 1 byte per
boolean value using the naive approach that does not even handle missing value. We can reuse this
approach to serialize bool by using one bit instead of two. DvInt* codecs need not be changed at
all. DateTime and DvText codecs will require some changes.

Intermediate Language(IL) code generation

ML.NET contains a mini compiler that generates IL code at runtime for peak and poke functions that
basically perform reflection of objects to set and get values in a more performant manner. Here we
can use OpCodes.Stobj to emit IL code for DvTimeSpan,DvDateTime, DvDateTimeZone and
ReadOnlyMemory<char> types.

New Behavior

  • DvInt1, DvInt2, DvInt4, DvInt8 will be replaced with sbyte, short, int and long
    respectively.

    • Conversions will conform to .NET standard conversions.
    • Types will be converted using casting and this might cause underflow and overflow and therefore
      behavior is undefined here, example, casting long to sbyte will result in assigning of low 8
      bits from long to sbyte. ML.NET projects by default are unchecked because checked is expensive
      and hence used in code blocks where it is needed.
      > unchecked((sbyte)long.MaxValue)
      -1
      
    • Conversion from Text to Integer type is done by first converting Text to long value in
      the case of positive number and ulong in the case of negative number and then validating this
      value is within the legal bounds of the type that it is being converted to from Text type,
      example, legal bound for sbyte is -128 to 127, so converting "-129" or "128" will result in an
      exception, also converting a value that is out of legal bounds for a long type will also
      result in an exception.
      var c = Convert.ToSByte("129");
      Value was either too large or too small for a signed byte.
      sbyte.Parse(string, System.Globalization.NumberStyles, System.Globalization.NumberFormatInfo)
      System.Convert.ToSByte(string)
      
  • DvTimeSpan, DvDateTime and DvDateTimeZone will be replaced with TimeSpan, DateTime and
    DateTimeOffset respectively.

    • Offset in DataTimeOffset is represented as long because it records the ticks. Previously this
      was represented as DvInt2 or short in DvDateTimeZone because it was recorded as minutes and
      due to this it had a smaller footprint on the disk. With offset being long the footprint will
      increase, one work around is to convert it to minutes before writing and then converting minutes
      back to ticks but this might lead to loss in precision. Since DataTime is very rarely used in
      Machine Learning so I'm not sure if it is worth making an optimization here.
  • DvText will be replaced with ReadOnlyMemory<char>.

    • ReadOnlyMemory<char> does not implement IEquatable<T> and due to this it cannot be be used a
      type in GroupKeyColumnChecker in Cursor in GroupTransform. The workaround for this is to
      remove the IEquatable<T> contraint on the type and instead use if else to check if the type
      implements IEquatable<T> then cast and call Equals method otherwise check if the type is of
      ReadOnlyMemory<char> then use its utility method for equality otherwise throw an exception.
    • ReadOnlyMemory<char> does not implement GetHashCode() and due to this it cannot be used as a
      key in a dictionary in ReconcileSlotNames<T> in EvaluatorUtils.cs. The workaround for this is
      to use string representation of ReadOnlyMemory<char> as a key. While this is wastage of memory
      but its not too bad because this is only used at the end of evaluation phase and the number of
      strings allocated here will be roughly proportional to the number of classes.
  • DvBool will be replaced with bool.

    • GetPredictedLabel and GetPredictedLabelCore will result in an undefined behavior in the case
      where score contains a missing value represented as NaN. Here we will default to false.
  • Backward compatiblity when reading IDV files written with DvTypes.

    • Integers are read as they were written to disk, i.e minimum value of the corresponding data
      type in the case of missing value.
    • Boolean is read using the old codec, where two bits are used per value and missing values are
      converted to false to fit in bool type.
    • DateTime, DateTimeSpan, DateTimeZone use long and short type underneath to represent
      ticks and offset and they are converted using the Integer scheme defined above. In the case
      where ticks or offset is read and found to contain missing value represented as a minimum of
      the underlying type then it is converted to default value of that type to prevent an exception
      from DateTime or TimeSpan or DateTimeOffset class as such minimum values indicate an
      invalid date.
    • DvText is read as it is. Missing values when being converted to Integer types are converted to
      minimum value of that integer type and empty string is converted to default value of that
      integer type.
  • TextLoader

    • Will throw an exception if it encounters missing value.
    • Will convert empty string to default values of type it is being converted to.
  • Parquet Loader

    • Will throw an exception for nullables or overflow.

Future consideration

Introduce an option in the loader whether to throw an exception in the case of missing value or just
replace them with default values. With the current design we will throw an exception in the case
of missing for Text Loader and Parquet loader but not IDV(Binary Loader).

Benchmarking the type system changes

(this section was written by @najeeb-kazmi )

ReadOnlyMemory<char> is a data type introduced recently that allows management of strings without unnecessary memory allocation. Strings in C# are immutable. Hence, when we take a string operation such as substring, the resulting string is copied to a new memory location. To prevent unnecessary allocation of memory, ReadOnlyMemory keeps track of the substring via start and end offsets relative to the original string. Hence, for every substring operation, the memory allocated is constant. In ReadOnlyMemory, if one needs to access independent elements, they do it by calling the Span property, which returns a ReadOnlySpan object, which is a stack only concept. It turns out that this Span property is an expensive operation, and our initial benchmarks showed that runtimes of the pipelines regressed by 100%. Upon further performance analysis, we decide to cache the returned ReadOnlySpan as much as we could, and that brought the runtimes on par with DvText.

These benchmarks are intended to compare performance after these optimizations on Span were done, in order to investigate whether we hit parity with DvText or not.

Datasets and pipelines

We chose datasets and pipelines to test to cover a variety of scenarios, including:

  • numeric data only
  • numeric + categorical data with categorical transform
  • numeric + categorical data with categorical and categorical hash transforms
  • categorical + text data with categorical and text transforms
  • text transform only on a very large text dataset

The table below shows the datasets and their characteristics, as well as the pipeline that we executed on each dataset. All datasets were ingested in text format, which makes heavy use of DvText / ReadOnlyMemory<char>. Other data types are also involved in the pipelines, although the performance of the pipelines are dominated by DvText / ReadOnlyMemory<char>.

Dataset Size Rows Features Pipeline Comments
Criteo 230 MB 1M 13 numeric 26 categorical Train data={\ct01\data\Criteo\Kaggle\train-1M.txt} loader=TextLoader{ col=Label:R4:0 col=NumFeatures:R4:1-13 col=LowCardCat:TX:19,22,30,33 col=HighCardCat:TX:~ } xf=CategoricalTransform{col=LowCardCat} xf=CategoricalHashTransform{col=HighCardCat bits=16} xf=MissingValueIndicatorTransform{col=NumFeatures} xf=Concat{ col=Features:NumFeatures,LowCardCat,HighCardCat } tr=ap{iter=10} seed=1 cache=- Numeric + categorical features with categorical and categorical hash transforms
Bing Click Prediction 3 GB 500k 3076 numeric Train data={\ct01\data\TeamOnly\NumericalDatasets\Ranking\BingClickPrediction\train-500K} loader=TextLoader{col=Label:R4:0 col=Features:R4:8-3083 header=+ quote=-} xf=NAHandleTransform{col=Features ind=-} tr=SDCA seed=1 cache=- Numeric features only
Flight Delay 227 MB 7M 5 numeric 3 categorical Train data={\ct01\data\PerformanceAnalysis\Data\Flight\New\FD2007train.csv} loader=TextLoader{ sep=, col=Month:R4:0 col=DayofMonth:R4:1 col=DayofWeek:R4:2 col=DepTime:R4:3 col=Distance:R4:4 col=UniqueCarrier:TX:5 col=Origin:TX:6 col=Dest:TX:7 col=Label:R4:9 header=+ } xf=CategoricalTransform{ col=UniqueCarrier col=Origin col=Dest } xf=Concat{ col=Features:Month,DayofMonth,DayofWeek,DepTime,Distance,UniqueCarrier,Origin,Dest } tr=SDCA seed=1 cache=- Numeric + categorical features with categorical transform
Wikipedia Detox 74 MB 160k 1 categorical 1 text column Train data={\ct01\data\SCRATCH_TO_MOVE\BinaryClassification\WikipediaDetox\toxicity_annotated_comments.merged.shuf-75MB,_160k-rows.tsv} loader=TextLoader{ quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=text:TX:2 col=year:TX:3 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 header=+ } xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=NAFilter{col=Label} xf=Term{col=Label:Label} xf=TextTransform{ col=FeaturesText:text wordExtractor=NgramExtractorTransform{ngram=2} charExtractor=NgramExtractorTransform{ngram=3} } xf=Concat{col=Features:logged_in,ns,FeaturesText} tr=OVA {p=AveragedPerceptron{iter=10}} seed=1 cache=- Categorical transform + text featurization
Amazon Reviews 9 GB 18M 1 text column Train data={\ct01\users\prroy\dataset\cleandata_VW\Amazon_reviews_cleaned.tsv} loader=TextLoader{col=Label:TX:0 col=text:TX:1 header=+ sparse=-} xf=NAFilter{col=Label} xf=Term{col=Label:Label} xf=TextTransform{ col=Features:text wordExtractor=NgramExtractorTransform{ngram=2} charExtractor=NgramExtractorTransform{ngram=3} } tr=OVA {p=AveragedPerceptron{iter=10}} seed=1 cache=- Text featurization on a very large dataset

Methodology and experimental setup

  • The two builds of ML.NET (one using DvTypes and the other using .NET data types) were built to target .NET Core 2.1.
  • Pipelines were executed from the Microsoft.ML.Console project: dotnet MML.dll <pipeline>
  • All pipelines were executed on Azure Standard F72s_v2 VMs running Windows Server 2016, which offer an instance isolated to dedicated hardware (Intel Xeon Platinum 8168).
  • We killed background processes that were not needed to run the experiments, including closing Visual Studio, ensuring that only one console window was open on the VM.
  • For each pipeline, we discarded the results of the first two runs for each pipeline to control for runtime variability due to a cold start, keeping only the subsequent runs for analysis.

Results

We present the results of the benchmarks here. The deltas indicate performance gap of .NET data types relative to DvTypes: negative values indicate slower performance of .NET data types compared to DvTypes, and percentage deltas are based off the mean runtime for DvTypes. Finally, we did an independent samples t-test with unequal variances for the two builds, and present the p-values for each test. We chose a significance threshold of 0.05, with a smaller p-value indicating significant differences.

We can see that for all the pipelines except the one with Amazon Reviews dataset, the deltas were within 1% of the speed of DvTypes, and were not significant. For Amazon Reviews, the delta was 1.85% of the speed of DvTypes and significant. The statistical significance is not particularly concerning here because the long runtimes on this dataset were bound to return significantly different runtimes even with a small percentage difference. More important thing here is that the performance gap was reduced from ~100% to within 2%. We expect the performance to only improve with further optimizations in future .NET Core runtimes.

Criteo 1M

Run # .NET data types DvTypes
1 12.907 12.634
2 12.635 12.847
3 12.989 12.546
4 12.708 12.713
5 12.789 12.463
6 12.565 12.751
7 12.828 12.73
8 12.688 12.425
9 12.791 13.009
10 12.858 12.584
Mean 12.7758 12.6702
S.D. 0.128720887 0.178014232
Delta -0.1056 -0.83%
p-value 0.073767344 Not significant

Flight Delay 7M

Run # .NET data types DvTypes
1 52.536 51.562
2 52.667 52.501
3 52.175 52.475
4 52.076 51.773
5 54.19 51.786
6 51.678 52.698
7 52.647 52.338
8 52.426 52.704
9 51.703 51.214
10 51.742 52.407
Mean 52.384 52.1458
S.D. 0.74152 0.520013632
Delta -0.2382 -0.46%
p-value 0.208863 Not significant

Bing Click Prediction 500K

Run # .NET data types DvTypes
1 222 221
2 222 222
3 220 223
4 221 223
5 220 220
6 223 219
7 222 222
8 223 220
9 223 223
10 222 222
Mean 221.8 221.5
S.D. 1.135292 1.433721
Delta -0.3 -0.14%
p-value 0.305291 Not significant

Wikipedia Detox

Run # .NET data types DvTypes
1 65.992 65.265
2 66.042 65.308
3 65.6 67.457
4 65.146 66.011
5 66.196 65.788
6 65.683 67.611
7 65.498 65.191
8 65.819 66.636
9 65.896 65.412
10 66.564 66.381
11 66.392 66.074
12 65.862 65.155
13 65.958 64.808
14 66.085 65.157
15 66.085 66.116
16 66.116 66.189
17 66.086 65.748
18 66.822 66.066
19 66.227 65.009
20 65.278 65.911
Mean 65.96735 65.86465
S.D. 0.402667 0.758248
Delta -0.1027 -0.16%
p-value 0.29838 Not significant

Amazon Reviews

Run # .NET data types DvTypes
1 5121 4992
2 5121 5016
3 5090 5036
4 5163 4981
5 5112 5003
6 5075 5008
7 5097 5022
8 5093 4991
9 5071 5040
10 5090 5019
Mean 5103.3 5010.8
S.D. 27.10084 19.46393
Delta -92.5 -1.85%
p-value 7.05E-08 Significant

CC: @eerhardt @Zruty0 @Ivanidzo4ka @TomFinley @shauheen @najeeb-kazmi @markusweimer

@codemzs codemzs added the API Issues pertaining the friendly API label Aug 13, 2018
@codemzs codemzs added this to the 0818 milestone Aug 13, 2018
@codemzs codemzs self-assigned this Aug 13, 2018
@casperOne
Copy link

we can use XML serialization though it might increase the footprint on the disk.

You may want to make the part that serializes pluggable (can discover it through DI)/based on a provider model.

The default should probably be JSON (it's the prevalent serialization mechanism currently) and if people are concerned about performance, they can implement the provider for the preferred serialization mechanism (Protobuf comes to mind, but there are other options).

Also, you may want to make sure that the new Memory/Span APIs are used in this area.

Small nitpick; the title should be ".NET types" and not "C# types" as these types are not specific to C#, but are types available throughout the .NET ecosystem.

@TomFinley
Copy link
Contributor

This seems fine on the whole, but what is going to be done about sparsity, implicit values for sparse values, and types like int?, for example? We want sparse vectors of numeric types to have implicit values of 0, for various reasons. We've previously relied on the fact that default of numeric types is 0. Now default of int? is not 0 but is null, we can no longer rely on that mechanism.

@casperOne
Copy link

@TomFinley

but what is going to be done about sparsity, implicit values for sparse values, and types like int?, for example?

I could be misreading, but using int? is an option for implicit values and sparse values, it's not mandatory.

The point is to use the .NET type system (and all that it offers) instead of DvTypes, as using DvTypes means transformation in use in many other places (the whole point of a type system is to unify data across operations, not fragments it with other sub type systems).

You could continue to use int with an implicit 0/default/sparse mapping if you wish, or use int? if that suits your needs better.

.NET has this out-of-the box, and other serialization mechanisms map easily to the .NET type system (JSON.Net, Protobuf.NET, etc).

IOW, it's a layer that doesn't need to exist, as it doesn't afford anything that doesn't already exist in the .NET type system.

@TomFinley
Copy link
Contributor

TomFinley commented Aug 17, 2018

Hello @casperOne, thanks for your response and clarifications. I think perhaps I was not clear -- I'm not actually confused about the proposal, I'm pointing out a serious architectural morass this issue as written engenders. But I'll clarify what I mean a bit more.

Imagine we get rid of this DvInt* and still want NA values by using things like int?. Sparse vectors are a very important part of our architecture for reasons that are probably obvious, and in order to be useful they must have a well defined value for implicit entries. Therefore, logically we must accept one of the two following terrible options for vectors:

  1. We continue to have the implicit values of default in our sparse vector. The sparse vector of length {2:2, 4:5} would then logically be the dense vector {0,0,2,0,5} if it is of int, and {?,?,2,?,5} if it is of int?. That, in addition to making sparse vectors more or less practically useless for int?, makes all conversions from int to int? a densifying operation thereby introducing perf booby traps into the code, and most seriously is pretty confusing.

  2. We change the implicit sparse value to no longer be default for these numeric types, but continue to have it be 0, thereby maintaining the general intuitive expectation people have that implicit values in sparse vectors are 0s. This is perhaps somewhat easier to understand from a "users" perspective, but all general code for VBuffers that might deal with these types will have to find out what the implicit value is, and adapt its code accordingly, inviting considerable code complexity.

Both of these options are awful. Our code and user code in lots of places benefits from the assumption that numeric vectors have a 0 for their implicit values. On the other hand, we also in plenty of places assume that the implicit sparse value for VBuffer<T> is default(T). Breaking either of those now formerly solid assumptions incurs a dizzying amount of engineering cost. This is both in the initial cost of the necessary transition (assuming that it is even possible to reliably do that), and I'd argue going forward makes our code unmaintainable since the issues at play are clearly so non-obvious that I have no faith whatsoever that subtle bugs won't be constantly introduced by misunderstandings about what is correct.

So if we get rid of DvInt* (because, obviously, it serves no useful purpose and exists for no reason, right? 😉), I'd rather simply not allow NA values for our built in integer types at all, and tell people if they want NA values that utility only occurs in float or double (which actually sensibly have a reserved values for NaN, unlike int). Which is probably fine. And if they really, really want it for who knows what reason, since IDV has an extensible type system they are free to do so, just far away from this codebase. It will technically break backcompat here and there in subtle ways, but since people use ints in pipelines sparingly and NA values for them even more sparingly, it's probably practically fine.

Incidentally let me make a secondary point while I'm here. As you say, .NET has a concept for NA values that's close and almost useful, except for one major problem: default(T?) == null, instead of default(T?) == default(T). I'll trust the situation as it stands s a good choice for most .NET applications, but unfortunately that choice compromises its usability for anything dealing with numerics. (By analogy: floats have "NA" (kinda) value with NaN, but certainly I doubt many people would consider default(float) becoming NaN a useful innovation.) It is certainly good to use .NET types where possible, but we have to use sound judgment about the logical implications of using them, even if those implications are not obvious based on casual observation. And sometimes that means not using what already exists in .NET since the implication is, as here, that it is unfit for the purpose.

@Ivanidzo4ka
Copy link
Contributor

This seems fine on the whole, but what is going to be done about sparsity, implicit values for sparse values, and types like int?, for example? We want sparse vectors of numeric types to have implicit values of 0, for various reasons. We've previously relied on the fact that default of numeric types is 0. Now default of int? is not 0 but is null, we can no longer rely on that mechanism.

From what I see in implementation of this issue, and @TomFinley comment we completely remove nullable support for fields and properties. Which is fine if you use Textloader, but in case of IEnumerable -> Dataview conversion looks like really bad decision. Imagine I'm as a user want to train model on top of SQL table. I can fetch data through LINQ2SQL or EF (which provide me drag and drop option to generate classes and methods to get data) as IEnumerable, wrap it in CollectionDataSource and train it. But only if I don't have any nullable fields in my table, as soon as I have at least one nullable field, I don't have no other options than create new class, write conversion from old class to new class, which is can be extremely painful process, especially if in your some relationship with SQL, people can have hundreds of columns (fields).

If only problem which prevents us from nullable support is VBuffer and sparsity, can we change VBuffer code and put check on incoming type and if it's nullable set values to default of inner type?

@TomFinley
Copy link
Contributor

TomFinley commented Aug 22, 2018

Hi @Ivanidzo4ka . What you are saying I think is that before an SQL user injects their table into our system they will have to be explicit about what null actually means in their case. This strikes me as something good, not bad -- what is meant in a database system by a null column is more often than not incredibly application specific (for evidence of this see please the discussions over the years just among ourselves about how to interpret a null -- if we ourselves could not agree at once, what hope do people that were designing bespoke systems have?). Therefore the fact that we'd appear (in the prior system) to handle that case seamlessly is more misleading than helpful, frankly.

We are writing an API, and that means people are free to (and will) write their own code around us, rather than having our own mechanisms be the only things at people's disposal. Though I understand this requires a shift in perspective, in this new world sometimes the right answer is, we not only don't have to handle this case, but we absolutely should not. I think this is one of those times.

@codemzs codemzs changed the title C# native type system instead of DvTypes .NET data type system instead of DvTypes Aug 26, 2018
@shauheen shauheen modified the milestones: 0818, 0918 Aug 30, 2018
@najeeb-kazmi
Copy link
Member

Benchmarking the type system changes

ReadOnlyMemory<char> is a data type introduced recently that allows management of strings without unnecessary memory allocation. Strings in C# are immutable. Hence, when we take a string operation such as substring, the resulting string is copied to a new memory location. To prevent unnecessary allocation of memory, ReadOnlyMemory keeps track of the substring via start and end offsets relative to the original string. Hence, for every substring operation, the memory allocated is constant. In ReadOnlyMemory, if one needs to access independent elements, they do it by calling the Span property, which returns a ReadOnlySpan object, which is a stack only concept. It turns out that this Span property is an expensive operation, and our initial benchmarks showed that runtimes of the pipelines regressed by 100%. Upon further performance analysis, we decide to cache the returned ReadOnlySpan as much as we could, and that brought the runtimes on par with DvText.

These benchmarks are intended to compare performance after these optimizations on Span were done, in order to investigate whether we hit parity with DvText or not.

Datasets and pipelines

We chose datasets and pipelines to test to cover a variety of scenarios, including:

  • numeric data only
  • numeric + categorical data with categorical transform
  • numeric + categorical data with categorical and categorical hash transforms
  • categorical + text data with categorical and text transforms
  • text transform only on a very large text dataset

The table below shows the datasets and their characteristics, as well as the pipeline that we executed on each dataset. All datasets were ingested in text format, which makes heavy use of DvText / ReadOnlyMemory<char>. Other data types are also involved in the pipelines, although the performance of the pipelines are dominated by DvText / ReadOnlyMemory<char>.

Dataset Size Rows Features Pipeline Comments
Criteo 230 MB 1M 13 numeric 26 categorical Train data={\ct01\data\Criteo\Kaggle\train-1M.txt} loader=TextLoader{ col=Label:R4:0 col=NumFeatures:R4:1-13 col=LowCardCat:TX:19,22,30,33 col=HighCardCat:TX:~ } xf=CategoricalTransform{col=LowCardCat} xf=CategoricalHashTransform{col=HighCardCat bits=16} xf=MissingValueIndicatorTransform{col=NumFeatures} xf=Concat{ col=Features:NumFeatures,LowCardCat,HighCardCat } tr=ap{iter=10} seed=1 cache=- Numeric + categorical features with categorical and categorical hash transforms
Bing Click Prediction 3 GB 500k 3076 numeric Train data={\ct01\data\TeamOnly\NumericalDatasets\Ranking\BingClickPrediction\train-500K} loader=TextLoader{col=Label:R4:0 col=Features:R4:8-3083 header=+ quote=-} xf=NAHandleTransform{col=Features ind=-} tr=SDCA seed=1 cache=- Numeric features only
Flight Delay 227 MB 7M 5 numeric 3 categorical Train data={\ct01\data\PerformanceAnalysis\Data\Flight\New\FD2007train.csv} loader=TextLoader{ sep=, col=Month:R4:0 col=DayofMonth:R4:1 col=DayofWeek:R4:2 col=DepTime:R4:3 col=Distance:R4:4 col=UniqueCarrier:TX:5 col=Origin:TX:6 col=Dest:TX:7 col=Label:R4:9 header=+ } xf=CategoricalTransform{ col=UniqueCarrier col=Origin col=Dest } xf=Concat{ col=Features:Month,DayofMonth,DayofWeek,DepTime,Distance,UniqueCarrier,Origin,Dest } tr=SDCA seed=1 cache=- Numeric + categorical features with categorical transform
Wikipedia Detox 74 MB 160k 1 categorical 1 text column Train data={\ct01\data\SCRATCH_TO_MOVE\BinaryClassification\WikipediaDetox\toxicity_annotated_comments.merged.shuf-75MB,_160k-rows.tsv} loader=TextLoader{ quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=text:TX:2 col=year:TX:3 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 header=+ } xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=NAFilter{col=Label} xf=Term{col=Label:Label} xf=TextTransform{ col=FeaturesText:text wordExtractor=NgramExtractorTransform{ngram=2} charExtractor=NgramExtractorTransform{ngram=3} } xf=Concat{col=Features:logged_in,ns,FeaturesText} tr=OVA {p=AveragedPerceptron{iter=10}} seed=1 cache=- Categorical transform + text featurization
Amazon Reviews 9 GB 18M 1 text column Train data={\ct01\users\prroy\dataset\cleandata_VW\Amazon_reviews_cleaned.tsv} loader=TextLoader{col=Label:TX:0 col=text:TX:1 header=+ sparse=-} xf=NAFilter{col=Label} xf=Term{col=Label:Label} xf=TextTransform{ col=Features:text wordExtractor=NgramExtractorTransform{ngram=2} charExtractor=NgramExtractorTransform{ngram=3} } tr=OVA {p=AveragedPerceptron{iter=10}} seed=1 cache=- Text featurization on a very large dataset

Methodology and experimental setup

  • The two builds of ML.NET (one using DvTypes and the other using .NET data types) were built to target .NET Core 2.1.
  • Pipelines were executed from the Microsoft.ML.Console project: dotnet MML.dll <pipeline>
  • All pipelines were executed on Azure Standard F72s_v2 VMs running Windows Server 2016, which offer an instance isolated to dedicated hardware (Intel Xeon Platinum 8168).
  • We killed background processes that were not needed to run the experiments, including closing Visual Studio, ensuring that only one console window was open on the VM.
  • For each pipeline, we discarded the results of the first two runs for each pipeline to control for runtime variability due to a cold start, keeping only the subsequent runs for analysis.

Results

We present the results of the benchmarks here. The deltas indicate performance gap of .NET data types relative to DvTypes: negative values indicate slower performance of .NET data types compared to DvTypes, and percentage deltas are based off the mean runtime for DvTypes. Finally, we did an independent samples t-test with unequal variances for the two builds, and present the p-values for each test. We chose a significance threshold of 0.05, with a smaller p-value indicating significant differences.

We can see that for all the pipelines except the one with Amazon Reviews dataset, the deltas were within 1% of the speed of DvTypes, and were not significant. For Amazon Reviews, the delta was 1.85% of the speed of DvTypes and significant. The statistical significance is not particularly concerning here because the long runtimes on this dataset were bound to return significantly different runtimes even with a small percentage difference. More important thing here is that the performance gap was reduced from ~100% to within 2%. We expect the performance to only improve with further optimizations in future .NET Core runtimes.

Criteo 1M

Run # .NET data types DvTypes
1 12.907 12.634
2 12.635 12.847
3 12.989 12.546
4 12.708 12.713
5 12.789 12.463
6 12.565 12.751
7 12.828 12.73
8 12.688 12.425
9 12.791 13.009
10 12.858 12.584
Mean 12.7758 12.6702
S.D. 0.128720887 0.178014232
Delta -0.1056 -0.83%
p-value 0.073767344 Not significant

Flight Delay 7M

Run # .NET data types DvTypes
1 52.536 51.562
2 52.667 52.501
3 52.175 52.475
4 52.076 51.773
5 54.19 51.786
6 51.678 52.698
7 52.647 52.338
8 52.426 52.704
9 51.703 51.214
10 51.742 52.407
Mean 52.384 52.1458
S.D. 0.74152 0.520013632
Delta -0.2382 -0.46%
p-value 0.208863 Not significant

Bing Click Prediction 500K

Run # .NET data types DvTypes
1 222 221
2 222 222
3 220 223
4 221 223
5 220 220
6 223 219
7 222 222
8 223 220
9 223 223
10 222 222
Mean 221.8 221.5
S.D. 1.135292 1.433721
Delta -0.3 -0.14%
p-value 0.305291 Not significant

Wikipedia Detox

Run # .NET data types DvTypes
1 65.992 65.265
2 66.042 65.308
3 65.6 67.457
4 65.146 66.011
5 66.196 65.788
6 65.683 67.611
7 65.498 65.191
8 65.819 66.636
9 65.896 65.412
10 66.564 66.381
11 66.392 66.074
12 65.862 65.155
13 65.958 64.808
14 66.085 65.157
15 66.085 66.116
16 66.116 66.189
17 66.086 65.748
18 66.822 66.066
19 66.227 65.009
20 65.278 65.911
Mean 65.96735 65.86465
S.D. 0.402667 0.758248
Delta -0.1027 -0.16%
p-value 0.29838 Not significant

Amazon Reviews

Run # .NET data types DvTypes
1 5121 4992
2 5121 5016
3 5090 5036
4 5163 4981
5 5112 5003
6 5075 5008
7 5097 5022
8 5093 4991
9 5071 5040
10 5090 5019
Mean 5103.3 5010.8
S.D. 27.10084 19.46393
Delta -92.5 -1.85%
p-value 7.05E-08 Significant

cc: @codemzs @eerhardt @TomFinley @shauheen @markusweimer @justinormont @Zruty0 @GalOshri

@justinormont
Copy link
Contributor

Thanks @najeeb-kazmi for the great benchmarks.

From a user perspective, I doubt any user would notice a runtime change this small (within 2%). And, @najeeb-kazmi, as you state, "We expect the performance to only improve with further optimizations in future .NET Core runtimes."

Do we have guesses where the main perf impact is located? This might help us create a focused benchmark which will let the DotNet team have a direct measure to optimize.


On a higher level note: do we have any datasets with NA values for a type which no longer has NA values (within either the Features or Label)? It would be interesting to see the change in what a user would expect for their first run's accuracy. If the meaning of the NA is truly a missing value, I expect the NAHandleTransform will add measurable accuracy vs. auto-filling w/ default for the type.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.