.NET data type system instead of DvTypes

# .NET data type system instead of DvTypes
## Motivation
Machine Learning datasets often have missing values and to accommodate them along with C# native
types without increasing the memory footprint DvType system was created. If we were to use
`Nullable<T>` then we are looking at additional memory for `HasValue` boolean field plus another 3
bytes for 4 byte alignment. The C# native types that are replaced using DvTypes are bool as DvBool,
sbyte as DvInt1, int16 as DvInt2, int32 as DvInt4, int64 as DvInt8, DvDateTime as System.DateTime,
DvDateTimeZone as combination of DvDateTime and DvInt2 offset, DvTimeSpan as SysTimeSpan and string
as DvText. Float and Double types already have a special value called NaN that can be used for
missing value. DvType system achieves a smaller memory footprint by denoting special value for
missing value which is usually the smallest number that can be represented by the native type that
is encapsulated by DvType, example, DvInt1's missing value indicator would be SByte.MinValue and in
the case of types that represent date/time types it is a value that represent maximum ticks. 

We plan to remove DvTypes to make IDataView a general commodity that can be used in other products
and for this to happen it would be nice if it did not having a dependency on a special type system.
If in future we find having DvTypes was useful then we can consider exposing it natively from .NET
platform. Once we remove DvTypes then ML.NET platform will be using native non-nullable C# types.
Float or double types can be used to represent missing value.

## Column Types 
Columns in ML.NET make up the dataset and `ColumnType` defines a column. At high level there are two
kinds of column, first is `PrimitiveType` and that comprises of types such as `NumberType`,
`BoolType`, `TextType`, `DateTimeType`, `DateTimeZoneType`, `KeyType`, second is `Structured type`
and it comparises of `VectorType`. `ColumnType` is primarily made up of `Type` and `DataKind`.
`Type` could refer to any type but it is instantiated with a type referred by `DataKind` which is an
identifer for data types that comprises of DvTypes, native C# types such as float, double and custom
big integer UInt128.  

## Type conversion
DvTypes have implicit and explicit override for assignment operator that handles type conversion.
Lets consider DvInt1 for example:

| To  | From | Current behavior
|:-:|:-:|:-:
| DvInt1 | sbyte | Copy the value as it is 
| DvInt1 | sbyte? | Assign missing value if null otherwise copy the value as it is 
| sbyte | DvInt1 | Copy if not a missing value otherwise throw exception 
| sbyte? | DvInt1 | Assign null for missing values otherwise copy over 
| DvInt1 | DvBool | Assign missing value for a missing value otherwise copy value over | sbyte = bool?
| DvInt1 | DvInt2 | Cast raw value from short to sbyte and compare it with original value if they are not same assign missing value otherwise casted value 
| DvInt1 | DvInt4 | Same as above
| DvInt1 | DvInt8 | Same as above 
| DvInt1 | Float |
| DvInt1 | Double | Same as above 
| Float | DvInt1 | Assign NaN for missing value 
| Double | DvInt1 | Same as above

Similar conversion rules exist for DvInt2, DvInt4, DvInt8 and DvBool. 

## Logical, bitwise and numerical operators
Operations such as `==`, `!=`, `!`, `>`, `>=`, `<`, `<=`, `+`,`-`,`*`,`pow`,`|`,`&` take place
between same DvTypes only. They also handle missing values and in the case of arithmetic operators
overflow is also handled. Most of these overrides are implemented but only few are actively used.
Whenever there is an overflow the resulting value is represented as missing value and the same goes
when one of the operands is a missing value.

## Serialization
DvTypes have their own codecs for efficiently compressing data and writing it to disk, for example,
to write DvBool to disk, two bits are used to represent a boolean value, 0x00 is false, 0x01 is true
and 0x10 is missing value indicator. Boolean values are written at the level of int32 which has 32
bits that can accommodate 32/2 or 16 boolean values in 4 bytes as opposed to using 1 byte per
boolean value using the naive approach that does not even handle missing value. We can reuse this
approach to serialize bool by using one bit instead of two. DvInt* codecs need not be changed at
all. DateTime and DvText codecs will require some changes.

## Intermediate Language(IL) code generation
ML.NET contains a mini compiler that generates IL code at runtime for peak and poke functions that
basically perform reflection of objects to set and get values in a more performant manner. Here we
can use OpCodes.Stobj to emit IL code for `DvTimeSpan`,`DvDateTime`, `DvDateTimeZone` and
`ReadOnlyMemory<char>` types.

# New Behavior 
* `DvInt1`, `DvInt2`, `DvInt4`, `DvInt8` will be replaced with `sbyte`, `short`, `int` and `long`
  respectively.
  * Conversions will conform to .NET standard conversions.
  * Types will be converted using casting and this might cause underflow and overflow and therefore
    behavior is undefined here, example, casting `long` to `sbyte` will result in assigning of low 8
    bits from long to sbyte. ML.NET projects by default are unchecked because checked is expensive
    and hence used in code blocks where it is needed.
    ```
    > unchecked((sbyte)long.MaxValue)
    -1
    ```
  * Conversion from `Text` to `Integer` type is done by first converting `Text` to `long` value in
    the case of positive number and `ulong` in the case of negative number and then validating this
    value is within the legal bounds of the type that it is being converted to from `Text` type,
    example, legal bound for `sbyte` is -128 to 127, so converting "-129" or "128" will result in an
    exception, also converting a value that is out of legal bounds for a `long` type will also
    result in an exception.
    ``` 
    var c = Convert.ToSByte("129");
    Value was either too large or too small for a signed byte.
    sbyte.Parse(string, System.Globalization.NumberStyles, System.Globalization.NumberFormatInfo)
    System.Convert.ToSByte(string)
    ```

* `DvTimeSpan`, `DvDateTime` and `DvDateTimeZone` will be replaced with `TimeSpan`, `DateTime` and
  `DateTimeOffset` respectively.
  * Offset in `DataTimeOffset` is represented as long because it records the ticks. Previously this
    was represented as DvInt2 or short in `DvDateTimeZone` because it was recorded as minutes and
    due to this it had a smaller footprint on the disk. With offset being long the footprint will
    increase, one work around is to convert it to minutes before writing and then converting minutes
    back to ticks but this might lead to loss in precision. Since DataTime is very rarely used in
    Machine Learning so I'm not sure if it is worth making an optimization here.

* `DvText` will be replaced with `ReadOnlyMemory<char>`.
  * `ReadOnlyMemory<char>` does not implement `IEquatable<T>` and due to this it cannot be be used a
    type in `GroupKeyColumnChecker` in `Cursor` in GroupTransform. The workaround for this is to
    remove the `IEquatable<T>` contraint on the type and instead use if else to check if the type
    implements `IEquatable<T>` then cast and call `Equals` method otherwise check if the type is of
    `ReadOnlyMemory<char>` then use its utility method for equality otherwise throw an exception.
  * `ReadOnlyMemory<char>` does not implement `GetHashCode()` and due to this it cannot be used as a
    key in a dictionary in `ReconcileSlotNames<T>` in EvaluatorUtils.cs. The workaround for this is
    to use string representation of `ReadOnlyMemory<char>` as a key. While this is wastage of memory
    but its not too bad because this is only used at the end of evaluation phase and the number of
    strings allocated here will be roughly proportional to the number of classes.

* `DvBool` will be replaced with bool.       
  * `GetPredictedLabel` and `GetPredictedLabelCore` will result in an undefined behavior in the case
    where score contains a missing value represented as NaN. Here we will default to false.

* Backward compatiblity when reading `IDV` files written with `DvTypes`.
  * `Integers` are read as they were written to disk, i.e minimum value of the corresponding data
      type in the case of missing value.
  * `Boolean` is read using the old codec, where two bits are used per value and missing values are
      converted to `false` to fit in `bool` type.
  * `DateTime`, `DateTimeSpan`, `DateTimeZone` use `long` and `short` type underneath to represent
      ticks and offset and they are converted using the `Integer` scheme defined above. In the case
      where ticks or offset is read and found to contain missing value represented as a minimum of
      the underlying type then it is converted to default value of that type to prevent an exception
      from `DateTime` or `TimeSpan` or `DateTimeOffset` class as such minimum values indicate an
      invalid date. 
  * `DvText` is read as it is. Missing values when being converted to Integer types are converted to
    minimum value of that `integer` type and empty string is converted to `default` value of that
    `integer` type.

* TextLoader
  * Will throw an exception if it encounters missing value.
  * Will convert empty string to `default` values of type it is being converted to.

* Parquet Loader
  * Will throw an exception for nullables or overflow.

# Future consideration
Introduce an option in the loader whether to throw an exception in the case of missing value or just
replace them with `default` values. With the current design we will throw an exception in the case
of missing for Text Loader and Parquet loader but not IDV(Binary Loader).

# Benchmarking the type system changes 
### (this section was written by @najeeb-kazmi )

`ReadOnlyMemory<char>` is a data type introduced recently that allows management of strings without unnecessary memory allocation. Strings in C# are immutable. Hence, when we take a string operation such as `substring`, the resulting string is copied to a new memory location. To prevent unnecessary allocation of memory, `ReadOnlyMemory` keeps track of the substring via start and end offsets relative to the original string. Hence, for every `substring` operation, the memory allocated is constant. In `ReadOnlyMemory`, if one needs to access independent elements, they do it by calling the `Span` property, which returns a `ReadOnlySpan` object, which is a stack only concept. It turns out that this `Span` property is an expensive operation, and our initial benchmarks showed that runtimes of the pipelines regressed by 100%. Upon further performance analysis, we decide to cache the returned `ReadOnlySpan` as much as we could, and that brought the runtimes on par with `DvText`.

These benchmarks are intended to compare performance after these optimizations on `Span` were done, in order to investigate whether we hit parity with `DvText` or not.

## Datasets and pipelines
We chose datasets and pipelines to test to cover a variety of scenarios, including:
- numeric data only
- numeric + categorical data with categorical transform
- numeric + categorical data with categorical and categorical hash transforms
- categorical + text data with categorical and text transforms
- text transform only on a very large text dataset

The table below shows the datasets and their characteristics, as well as the pipeline that we executed on each dataset. All datasets were ingested in text format, which makes heavy use of `DvText` / `ReadOnlyMemory<char>`.  Other data types are also involved in the pipelines, although the performance of the pipelines are dominated by `DvText` / `ReadOnlyMemory<char>`.

| Dataset                     | Size         | Rows       | Features                            | Pipeline                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Comments                                                                        |
|-----------------------------|--------------|------------|-------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
|    Criteo                   |    230 MB    |    1M      |    13 numeric   26 categorical      |    Train    data={\\ct01\data\Criteo\Kaggle\train-1M.txt}    loader=TextLoader{   col=Label:R4:0    col=NumFeatures:R4:1-13    col=LowCardCat:TX:19,22,30,33    col=HighCardCat:TX:~   }    xf=CategoricalTransform{col=LowCardCat}    xf=CategoricalHashTransform{col=HighCardCat bits=16}   xf=MissingValueIndicatorTransform{col=NumFeatures}   xf=Concat{ col=Features:NumFeatures,LowCardCat,HighCardCat   }    tr=ap{iter=10}    seed=1    cache=-                                                                                                                                                                                                                                                                                                                 | Numeric + categorical features with categorical and categorical hash transforms |
|    Bing Click Prediction    |    3 GB      |    500k    |    3076 numeric                     |    Train    data={\\ct01\data\TeamOnly\NumericalDatasets\Ranking\BingClickPrediction\train-500K}   loader=TextLoader{col=Label:R4:0 col=Features:R4:8-3083   header=+ quote=-}   xf=NAHandleTransform{col=Features ind=-}    tr=SDCA    seed=1    cache=-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Numeric features only                                                           |
|    Flight Delay             |    227 MB    |    7M      |    5 numeric   3 categorical        |    Train    data={\\ct01\data\PerformanceAnalysis\Data\Flight\New\FD2007train.csv}       loader=TextLoader{    sep=,    col=Month:R4:0    col=DayofMonth:R4:1    col=DayofWeek:R4:2    col=DepTime:R4:3    col=Distance:R4:4    col=UniqueCarrier:TX:5    col=Origin:TX:6    col=Dest:TX:7    col=Label:R4:9    header=+    }    xf=CategoricalTransform{ col=UniqueCarrier col=Origin   col=Dest }   xf=Concat{ col=Features:Month,DayofMonth,DayofWeek,DepTime,Distance,UniqueCarrier,Origin,Dest   }   tr=SDCA    seed=1    cache=-                                                                                                                                                                                                                                   | Numeric + categorical features with categorical transform                       |
|    Wikipedia Detox          |    74 MB     |    160k    |    1 categorical   1 text column    |    Train    data={\\ct01\data\SCRATCH_TO_MOVE\BinaryClassification\WikipediaDetox\toxicity_annotated_comments.merged.shuf-75MB,_160k-rows.tsv}      loader=TextLoader{   quote=-    sparse=-    col=Label:R4:0    col=rev_id:TX:1    col=text:TX:2    col=year:TX:3    col=logged_in:BL:4    col=ns:TX:5    col=sample:TX:6    col=split:TX:7    header=+   }    xf=Convert{col=logged_in type=R4}    xf=CategoricalTransform{col=ns}    xf=NAFilter{col=Label}    xf=Term{col=Label:Label}    xf=TextTransform{   col=FeaturesText:text    wordExtractor=NgramExtractorTransform{ngram=2}      charExtractor=NgramExtractorTransform{ngram=3}   }    xf=Concat{col=Features:logged_in,ns,FeaturesText}   tr=OVA {p=AveragedPerceptron{iter=10}}    seed=1    cache=-    | Categorical transform + text featurization                                      |
|    Amazon Reviews           |    9 GB      |    18M     |    1 text column                    |    Train    data={\\ct01\users\prroy\dataset\cleandata_VW\Amazon_reviews_cleaned.tsv}    loader=TextLoader{col=Label:TX:0 col=text:TX:1 header=+   sparse=-}    xf=NAFilter{col=Label}    xf=Term{col=Label:Label}    xf=TextTransform{   col=Features:text    wordExtractor=NgramExtractorTransform{ngram=2}      charExtractor=NgramExtractorTransform{ngram=3}   }    tr=OVA {p=AveragedPerceptron{iter=10}}    seed=1   cache=-                                                                                                                                                                                                                                                                                                                                      | Text featurization on a very large dataset                                      |

## Methodology and experimental setup

- The two builds of ML.NET (one using DvTypes and the other using .NET data types) were built to target .NET Core 2.1. 
- Pipelines were executed from the Microsoft.ML.Console project: `dotnet MML.dll <pipeline>`
- All pipelines were executed on Azure Standard F72s_v2 VMs running Windows Server 2016, which offer an instance isolated to dedicated hardware (Intel Xeon Platinum 8168).
- We killed background processes that were not needed to run the experiments, including closing Visual Studio, ensuring that only one console window was open on the VM.
- For each pipeline, we discarded the results of the first two runs for each pipeline to control for runtime variability due to a cold start, keeping only the subsequent runs for analysis.

## Results
We present the results of the benchmarks here. The deltas indicate performance gap of .NET data types relative to DvTypes: negative values indicate slower performance of .NET data types compared to DvTypes, and percentage deltas are based off the mean runtime for DvTypes. Finally, we did an independent samples t-test with unequal variances for the two builds, and present the p-values for each test. We chose a significance threshold of 0.05, with a smaller p-value indicating significant differences.

We can see that for all the pipelines except the one with Amazon Reviews dataset, the deltas were within 1% of the speed of DvTypes, and were not significant. For Amazon Reviews, the delta was 1.85% of the speed of DvTypes and significant. The statistical significance is not particularly concerning here because the long runtimes on this dataset were bound to return significantly different runtimes even with a small percentage difference. More important thing here is that the performance gap was reduced from ~100% to within 2%. We expect the performance to only improve with further optimizations in future .NET Core runtimes.

### Criteo 1M

| Run #   | .NET data types    | DvTypes         |
|---------|--------------------|-----------------|
| 1       | 12.907             | 12.634          |
| 2       | 12.635             | 12.847          |
| 3       | 12.989             | 12.546          |
| 4       | 12.708             | 12.713          |
| 5       | 12.789             | 12.463          |
| 6       | 12.565             | 12.751          |
| 7       | 12.828             | 12.73           |
| 8       | 12.688             | 12.425          |
| 9       | 12.791             | 13.009          |
| 10      | 12.858             | 12.584          |
|         |                    |                 |
| Mean    | 12.7758            | 12.6702         |
| S.D.    | 0.128720887        | 0.178014232     |
|         |                    |                 |
| Delta   | -0.1056            | -0.83%          |
| p-value | 0.073767344        | Not significant |

### Flight Delay 7M 

| Run #   | .NET data types    | DvTypes         |
|---------|--------------------|-----------------|
| 1       | 52.536             | 51.562          |
| 2       | 52.667             | 52.501          |
| 3       | 52.175             | 52.475          |
| 4       | 52.076             | 51.773          |
| 5       | 54.19              | 51.786          |
| 6       | 51.678             | 52.698          |
| 7       | 52.647             | 52.338          |
| 8       | 52.426             | 52.704          |
| 9       | 51.703             | 51.214          |
| 10      | 51.742             | 52.407          |
|         |                    |                 |
| Mean    | 52.384             | 52.1458         |
| S.D.    | 0.74152            | 0.520013632     |
|         |                    |                 |
| Delta   | -0.2382            | -0.46%          |
| p-value | 0.208863           | Not significant |


### Bing Click Prediction 500K
| Run #   | .NET data types            | DvTypes         |
|---------|----------------------------|-----------------|
| 1       | 222                        | 221             |
| 2       | 222                        | 222             |
| 3       | 220                        | 223             |
| 4       | 221                        | 223             |
| 5       | 220                        | 220             |
| 6       | 223                        | 219             |
| 7       | 222                        | 222             |
| 8       | 223                        | 220             |
| 9       | 223                        | 223             |
| 10      | 222                        | 222             |
|         |                            |                 |
| Mean    | 221.8                      | 221.5           |
| S.D.    | 1.135292                   | 1.433721        |
|         |                            |                 |
| Delta   | -0.3                       | -0.14%          |
| p-value | 0.305291                   | Not significant |

### Wikipedia Detox
| Run #   | .NET data types | DvTypes         |
|---------|-----------------|-----------------|
| 1       | 65.992          | 65.265          |
| 2       | 66.042          | 65.308          |
| 3       | 65.6            | 67.457          |
| 4       | 65.146          | 66.011          |
| 5       | 66.196          | 65.788          |
| 6       | 65.683          | 67.611          |
| 7       | 65.498          | 65.191          |
| 8       | 65.819          | 66.636          |
| 9       | 65.896          | 65.412          |
| 10      | 66.564          | 66.381          |
| 11      | 66.392          | 66.074          |
| 12      | 65.862          | 65.155          |
| 13      | 65.958          | 64.808          |
| 14      | 66.085          | 65.157          |
| 15      | 66.085          | 66.116          |
| 16      | 66.116          | 66.189          |
| 17      | 66.086          | 65.748          |
| 18      | 66.822          | 66.066          |
| 19      | 66.227          | 65.009          |
| 20      | 65.278          | 65.911          |
|         |                 |                 |
| Mean    | 65.96735        | 65.86465        |
| S.D.    | 0.402667        | 0.758248        |
|         |                 |                 |
| Delta   | -0.1027         | -0.16%          |
| p-value | 0.29838         | Not significant |

### Amazon Reviews
| Run #   | .NET data types | DvTypes     |
|---------|-----------------|-------------|
| 1       | 5121            | 4992        |
| 2       | 5121            | 5016        |
| 3       | 5090            | 5036        |
| 4       | 5163            | 4981        |
| 5       | 5112            | 5003        |
| 6       | 5075            | 5008        |
| 7       | 5097            | 5022        |
| 8       | 5093            | 4991        |
| 9       | 5071            | 5040        |
| 10      | 5090            | 5019        |
|         |                 |             |
| Mean    | 5103.3          | 5010.8      |
| S.D.    | 27.10084        | 19.46393    |
|         |                 |             |
| Delta   | -92.5           | -1.85%      |
| p-value | 7.05E-08        | Significant |

CC: @eerhardt @Zruty0 @Ivanidzo4ka @TomFinley @shauheen @najeeb-kazmi @markusweimer 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

.NET data type system instead of DvTypes #673

.NET data type system instead of DvTypes

Motivation

Column Types

Type conversion

Logical, bitwise and numerical operators

Serialization

Intermediate Language(IL) code generation

New Behavior

Future consideration

Benchmarking the type system changes

(this section was written by @najeeb-kazmi )

Datasets and pipelines

Methodology and experimental setup

Results

Criteo 1M

Flight Delay 7M

Bing Click Prediction 500K

Wikipedia Detox

Amazon Reviews

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

To	From	Current behavior
DvInt1	sbyte	Copy the value as it is
DvInt1	sbyte?	Assign missing value if null otherwise copy the value as it is
sbyte	DvInt1	Copy if not a missing value otherwise throw exception
sbyte?	DvInt1	Assign null for missing values otherwise copy over
DvInt1	DvBool	Assign missing value for a missing value otherwise copy value over
DvInt1	DvInt2	Cast raw value from short to sbyte and compare it with original value if they are not same assign missing value otherwise casted value
DvInt1	DvInt4	Same as above
DvInt1	DvInt8	Same as above
DvInt1	Float
DvInt1	Double	Same as above
Float	DvInt1	Assign NaN for missing value
Double	DvInt1	Same as above

Dataset	Size	Rows	Features	Pipeline	Comments
Criteo	230 MB	1M	13 numeric 26 categorical	Train data={\ct01\data\Criteo\Kaggle\train-1M.txt} loader=TextLoader{ col=Label:R4:0 col=NumFeatures:R4:1-13 col=LowCardCat:TX:19,22,30,33 col=HighCardCat:TX:~ } xf=CategoricalTransform{col=LowCardCat} xf=CategoricalHashTransform{col=HighCardCat bits=16} xf=MissingValueIndicatorTransform{col=NumFeatures} xf=Concat{ col=Features:NumFeatures,LowCardCat,HighCardCat } tr=ap{iter=10} seed=1 cache=-	Numeric + categorical features with categorical and categorical hash transforms
Bing Click Prediction	3 GB	500k	3076 numeric	Train data={\ct01\data\TeamOnly\NumericalDatasets\Ranking\BingClickPrediction\train-500K} loader=TextLoader{col=Label:R4:0 col=Features:R4:8-3083 header=+ quote=-} xf=NAHandleTransform{col=Features ind=-} tr=SDCA seed=1 cache=-	Numeric features only
Flight Delay	227 MB	7M	5 numeric 3 categorical	Train data={\ct01\data\PerformanceAnalysis\Data\Flight\New\FD2007train.csv} loader=TextLoader{ sep=, col=Month:R4:0 col=DayofMonth:R4:1 col=DayofWeek:R4:2 col=DepTime:R4:3 col=Distance:R4:4 col=UniqueCarrier:TX:5 col=Origin:TX:6 col=Dest:TX:7 col=Label:R4:9 header=+ } xf=CategoricalTransform{ col=UniqueCarrier col=Origin col=Dest } xf=Concat{ col=Features:Month,DayofMonth,DayofWeek,DepTime,Distance,UniqueCarrier,Origin,Dest } tr=SDCA seed=1 cache=-	Numeric + categorical features with categorical transform
Wikipedia Detox	74 MB	160k	1 categorical 1 text column	Train data={\ct01\data\SCRATCH_TO_MOVE\BinaryClassification\WikipediaDetox\toxicity_annotated_comments.merged.shuf-75MB,_160k-rows.tsv} loader=TextLoader{ quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=text:TX:2 col=year:TX:3 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 header=+ } xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=NAFilter{col=Label} xf=Term{col=Label:Label} xf=TextTransform{ col=FeaturesText:text wordExtractor=NgramExtractorTransform{ngram=2} charExtractor=NgramExtractorTransform{ngram=3} } xf=Concat{col=Features:logged_in,ns,FeaturesText} tr=OVA {p=AveragedPerceptron{iter=10}} seed=1 cache=-	Categorical transform + text featurization
Amazon Reviews	9 GB	18M	1 text column	Train data={\ct01\users\prroy\dataset\cleandata_VW\Amazon_reviews_cleaned.tsv} loader=TextLoader{col=Label:TX:0 col=text:TX:1 header=+ sparse=-} xf=NAFilter{col=Label} xf=Term{col=Label:Label} xf=TextTransform{ col=Features:text wordExtractor=NgramExtractorTransform{ngram=2} charExtractor=NgramExtractorTransform{ngram=3} } tr=OVA {p=AveragedPerceptron{iter=10}} seed=1 cache=-	Text featurization on a very large dataset

Run #	.NET data types	DvTypes
1	12.907	12.634
2	12.635	12.847
3	12.989	12.546
4	12.708	12.713
5	12.789	12.463
6	12.565	12.751
7	12.828	12.73
8	12.688	12.425
9	12.791	13.009
10	12.858	12.584

Mean	12.7758	12.6702
S.D.	0.128720887	0.178014232

Delta	-0.1056	-0.83%
p-value	0.073767344	Not significant

Run #	.NET data types	DvTypes
1	52.536	51.562
2	52.667	52.501
3	52.175	52.475
4	52.076	51.773
5	54.19	51.786
6	51.678	52.698
7	52.647	52.338
8	52.426	52.704
9	51.703	51.214
10	51.742	52.407

Mean	52.384	52.1458
S.D.	0.74152	0.520013632

Delta	-0.2382	-0.46%
p-value	0.208863	Not significant

Run #	.NET data types	DvTypes
1	222	221
2	222	222
3	220	223
4	221	223
5	220	220
6	223	219
7	222	222
8	223	220
9	223	223
10	222	222

Mean	221.8	221.5
S.D.	1.135292	1.433721

Delta	-0.3	-0.14%
p-value	0.305291	Not significant

Run #	.NET data types	DvTypes
1	65.992	65.265
2	66.042	65.308
3	65.6	67.457
4	65.146	66.011
5	66.196	65.788
6	65.683	67.611
7	65.498	65.191
8	65.819	66.636
9	65.896	65.412
10	66.564	66.381
11	66.392	66.074
12	65.862	65.155
13	65.958	64.808
14	66.085	65.157
15	66.085	66.116
16	66.116	66.189
17	66.086	65.748
18	66.822	66.066
19	66.227	65.009
20	65.278	65.911

Mean	65.96735	65.86465
S.D.	0.402667	0.758248

Delta	-0.1027	-0.16%
p-value	0.29838	Not significant

Run #	.NET data types	DvTypes
1	5121	4992
2	5121	5016
3	5090	5036
4	5163	4981
5	5112	5003
6	5075	5008
7	5097	5022
8	5093	4991
9	5071	5040
10	5090	5019

Mean	5103.3	5010.8
S.D.	27.10084	19.46393

Delta	-92.5	-1.85%
p-value	7.05E-08	Significant

.NET data type system instead of DvTypes #673

Description

.NET data type system instead of DvTypes

Motivation

Column Types

Type conversion

Logical, bitwise and numerical operators

Serialization

Intermediate Language(IL) code generation

New Behavior

Future consideration

Benchmarking the type system changes

(this section was written by @najeeb-kazmi )

Datasets and pipelines

Methodology and experimental setup

Results

Criteo 1M

Flight Delay 7M

Bing Click Prediction 500K

Wikipedia Detox

Amazon Reviews

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions