Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added in standard conversions from types to ReadOnlyMemory<char> #5106

Merged
merged 4 commits into from
May 28, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 14 additions & 6 deletions docs/code/IDataViewTypeSystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ true/false values. The `BooleanDataViewType` class derives from

The default value of `BL` is `false`, and it has no `NA` value.

There is a standard conversion from `TX` to `BL`. There are standard
There are standard conversions from `TX` to `BL`, and from `BL` to `TX`. There are standard
conversions from `BL` to all signed integer and floating point numeric types,
with `false` mapping to zero and `true` mapping to one.

Expand Down Expand Up @@ -332,7 +332,8 @@ values being the canonical `NA` values.

There are standard conversions from each floating-point type to the other
floating-point type. There are also standard conversions from text to each
floating-point type and from each integer type to each floating-point type.
floating-point type, from floating-point type to text types, and from each
integer type to each floating-point type.

### Signed Integer Types

Expand All @@ -342,8 +343,8 @@ default value of each of these is zero.

There are standard conversions from each signed integer type to every other
signed integer type. There are also standard conversions from text to each
signed integer type and from each signed integer type to each floating-point
type.
signed integer type, from each signed integer type to text, and from each
signed integer type to each floating-point type.

Note that we have not defined standard conversions from floating-point types
to signed integer types.
Expand All @@ -357,8 +358,8 @@ have an `NA` value.

There are standard conversions from each unsigned integer type to every other
unsigned integer type. There are also standard conversions from text to each
unsigned integer type and from each unsigned integer type to each floating-
point type.
unsigned integer type, each unsigned integer type to text, and from each unsigned
integer type to each floating-point type.

Note that we have not defined standard conversions from floating-point types
to unsigned integer types, or between signed integer types and unsigned
Expand Down Expand Up @@ -541,6 +542,13 @@ case, it is simple to map implicit items (suppressed due to sparsity) to zero.
In the former case, these items are first mapped to the empty text value. To
get the same result, we need empty text to map to zero.

### To Text

There are standard conversions to `TX` from the standard primitive types,
`R4`, `R8`, `I1`, `I2`, `I4`, `I8`, `U1`, `U2`, `U4`, `U8`, `BL`, `TS`, `DT`, and `DZ`.
`R4` uses the G7 format and `R8` uses the G17 format. `BL` converts to "True" or "False".
`TS` uses the format "0:c". `DT` and `DZ` use the "0:o" format.

### Floating Point

There are standard conversions from `R4` to `R8` and from `R8` to `R4`. These
Expand Down
33 changes: 33 additions & 0 deletions src/Microsoft.ML.Data/Data/Conversion.cs
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ private Conversions()
AddStd<I1, R8>(Convert);
AddAux<I1, SB>(Convert);
AddStd<I1, BL>(Convert);
AddStd<I1, TX>(Convert);

AddStd<I2, I1>(Convert);
AddStd<I2, I2>(Convert);
Expand All @@ -123,6 +124,7 @@ private Conversions()
AddStd<I2, R8>(Convert);
AddAux<I2, SB>(Convert);
AddStd<I2, BL>(Convert);
AddStd<I2, TX>(Convert);

AddStd<I4, I1>(Convert);
AddStd<I4, I2>(Convert);
Expand All @@ -132,6 +134,7 @@ private Conversions()
AddStd<I4, R8>(Convert);
AddAux<I4, SB>(Convert);
AddStd<I4, BL>(Convert);
AddStd<I4, TX>(Convert);

AddStd<I8, I1>(Convert);
AddStd<I8, I2>(Convert);
Expand All @@ -141,6 +144,7 @@ private Conversions()
AddStd<I8, R8>(Convert);
AddAux<I8, SB>(Convert);
AddStd<I8, BL>(Convert);
AddStd<I8, TX>(Convert);

AddStd<U1, U1>(Convert);
AddStd<U1, U2>(Convert);
Expand All @@ -151,6 +155,7 @@ private Conversions()
AddStd<U1, R8>(Convert);
AddAux<U1, SB>(Convert);
AddStd<U1, BL>(Convert);
AddStd<U1, TX>(Convert);

AddStd<U2, U1>(Convert);
AddStd<U2, U2>(Convert);
Expand All @@ -161,6 +166,7 @@ private Conversions()
AddStd<U2, R8>(Convert);
AddAux<U2, SB>(Convert);
AddStd<U2, BL>(Convert);
AddStd<U2, TX>(Convert);

AddStd<U4, U1>(Convert);
AddStd<U4, U2>(Convert);
Expand All @@ -171,6 +177,7 @@ private Conversions()
AddStd<U4, R8>(Convert);
AddAux<U4, SB>(Convert);
AddStd<U4, BL>(Convert);
AddStd<U4, TX>(Convert);

AddStd<U8, U1>(Convert);
AddStd<U8, U2>(Convert);
Expand All @@ -181,23 +188,27 @@ private Conversions()
AddStd<U8, R8>(Convert);
AddAux<U8, SB>(Convert);
AddStd<U8, BL>(Convert);
AddStd<U8, TX>(Convert);

AddStd<UG, U1>(Convert);
AddStd<UG, U2>(Convert);
AddStd<UG, U4>(Convert);
AddStd<UG, U8>(Convert);
// REVIEW: Conversion from UG to R4/R8, should we?
AddAux<UG, SB>(Convert);
AddStd<UG, TX>(Convert);

AddStd<R4, R4>(Convert);
AddStd<R4, BL>(Convert);
AddStd<R4, R8>(Convert);
AddAux<R4, SB>(Convert);
AddStd<R4, TX>(Convert);

AddStd<R8, R4>(Convert);
AddStd<R8, R8>(Convert);
AddStd<R8, BL>(Convert);
AddAux<R8, SB>(Convert);
AddStd<R8, TX>(Convert);

AddStd<TX, I1>(Convert);
AddStd<TX, U1>(Convert);
Expand Down Expand Up @@ -225,22 +236,26 @@ private Conversions()
AddStd<BL, R8>(Convert);
AddStd<BL, BL>(Convert);
AddAux<BL, SB>(Convert);
AddStd<BL, TX>(Convert);

AddStd<TS, I8>(Convert);
AddStd<TS, R4>(Convert);
AddStd<TS, R8>(Convert);
AddAux<TS, SB>(Convert);
AddStd<TS, TX>(Convert);

AddStd<DT, I8>(Convert);
AddStd<DT, R4>(Convert);
AddStd<DT, R8>(Convert);
AddStd<DT, DT>(Convert);
AddAux<DT, SB>(Convert);
AddStd<DT, TX>(Convert);

AddStd<DZ, I8>(Convert);
AddStd<DZ, R4>(Convert);
AddStd<DZ, R8>(Convert);
AddAux<DZ, SB>(Convert);
AddStd<DZ, TX>(Convert);

AddIsNA<R4>(IsNA);
AddIsNA<R8>(IsNA);
Expand Down Expand Up @@ -912,6 +927,24 @@ public void Convert(in BL src, ref SB dst)
public void Convert(in DZ src, ref SB dst) { ClearDst(ref dst); dst.AppendFormat("{0:o}", src); }
#endregion ToStringBuilder

#region ToTX
public void Convert(in I1 src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in I2 src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in I4 src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in I8 src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in U1 src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in U2 src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in U4 src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in U8 src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in UG src, ref TX dst) => dst = string.Format("0x{0:x16}{1:x16}", src.High, src.Low).AsMemory();
public void Convert(in R4 src, ref TX dst) => dst = src.ToString("G7", CultureInfo.InvariantCulture).AsMemory();
public void Convert(in R8 src, ref TX dst) => dst = src.ToString("G17", CultureInfo.InvariantCulture).AsMemory();
public void Convert(in BL src, ref TX dst) => dst = src.ToString().AsMemory();
public void Convert(in TS src, ref TX dst) => dst = string.Format("{0:c}", src).AsMemory();
public void Convert(in DT src, ref TX dst) => string.Format("{0:o}", src).AsMemory();
public void Convert(in DZ src, ref TX dst) => string.Format("{0:o}", src).AsMemory();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation behind this change?

My concern here is this could potentially increase garbage collector pressure, however, in the case of a transformer you could control that pressure by re-using the buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microsoft.ML.Featurizers has a few transformers that have overlapping functionality with transformers that are already present in ML.NET. The motivation for this change is to bring in the those featurizers and merge their functionality with existing transformers. The ToStringTransformer only implements conversion from basic data types to string. So we are looking to merge it with the more general TypeConvertingTransformer by adding string support for it.

In a transformer, can we truly reuse the buffer? Wouldn't we hand over the reference to the string back to the caller after the conversion?

Do you see this conversion function being in the hot path to cause garbage collector pressure?

Or do you think this garbage collection concern is serious enough that we should have string conversion being done in a separate transformer with a separate API?


In reply to: 423448756 [](ancestors = 423448756)

Copy link
Member

@codemzs codemzs May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller hands over a reference to the transfomer which is why the getters of the transformers take output as ref type, for example look at the getter for ImageLoader Transform. We reuse buffer and keep the GC pressure low, this is a fundamental principle in ML .NET that gives all the performance benefits that have been outline in KDD'2019 publication. With strings you will try to reuse the buffer when you can and only expand when needed using ReadOnlyMemory.

I have a reason to believe this can be a hot path, may be if you can share benchmarks to show me this isn't by actually running performance tests by using several string conversions using this PR vs using transformer where you can reuse the buffer then it would be great. The last time we overhauled the type system we did extensive benchmarks and testing that you can see here. I expect the same anytime type system is changed or updated.

CC: @eerhardt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Zeeshan,

In this case the caller hands over a ReadOnlyMemory<char> reference. From my understanding, since its readonly, we cannot modify it or the underlying buffer, and must allocate a new one. I also looked at the documentation for ReadOnlyMemory and couldn't see any way to modify it at all. Do you know of a way to modify it? Or do you know if we have any examples in code I can use as a reference? I was unable to find anything.

Assuming we can't modify it and need to return a new instance, I did some testing converting an int to a string (not in ML.Net) and then returning a ReadOnlyMemory<char> pointing to it.

The first way took a ref ReadOnlyMemory<char> and assigned it to value.ToString().AsMemory().

for (int x = 0; x < NumberOfIterations; x++)
{
    // readOnlyMemory is passed in by ref
    readOnlyMemory = x.ToString().AsMemory();
}

The second way was to also take a ref ReadOnlyMemory<char> but also a ref StringBuilder and use the StringBuilder to create the string and memory.

for (int x = 0; x < NumberOfIterations; x++)
{
    // stringBuilder and readOnlyMemory are passed in by ref.
    // the stringBuilder is using Clear, which should keep the underlying buffer
    stringBuilder.Clear();
    stringBuilder.Append(x);
    readOnlyMemory = stringBuilder.ToString().AsMemory();
}

In Release mode with 214 million allocations, the first approach of ToString().AsMemory() took about 6 seconds on average where the second approach took about 8 seconds on average. Unless there is another faster way, which I couldn't find, the fastest way to convert an int to ReadOnlyMemory<char> is the first approach. Each approach also used about 13 MB of memory total.

I have also created a benchmark in ML.Net that uses the first approach to create the ReadOnlyMemory<char>. It ran 10 million rows in 1.25 seconds, and the entire test only used 30 MB of memory (I am assuming the majority of this memory is from ML.Net itself as the first tests I did had more rows and used less memory). I have not done any other benchmarks tests as of yet.

@codemzs does this address your concern? Or is there something else I can do to address it? Or do you know a better way to deal with the ReadOnlyMemory<char>?

@eerhardt do you know of a better way to handle the `ReadOnlyMemory?

Copy link
Member

@eerhardt eerhardt May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ML.NET's current infrastructure, this is probably as good as you can make it. But in later versions of .NET Core (if we wanted to start targeting them), you could reshare the buffers in some cases by using the following methods:

https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.memorymarshal.trygetarray
and
https://docs.microsoft.com/en-us/dotnet/api/system.int32.tryformat

The idea being something like this:

public void Convert(in I4 src, ref TX dst)
{
    if (MemoryMarshal.TryGetArray(dst, out ArraySegment<char> segment))
    {
        if (src.TryFormat(segment.Array, out int charsWritten))
        {
             dst = segment.Array.AsMemory(0, charsWritten);
             return;
        }
    }

    char[] newBuffer = new char[11]; // 11 is the most characters needed to represent all 32-bit ints
    src.TryFormat(newBuffer, out int charsWritten);
    dst = newBuffer.AsMemory(0, charsWritten);
}

What's dangerous about this approach is that dst might be someone else's array that we are clobbering over in order to reuse it. To prevent that, we could write a little sentinel value at the beginning of the array to mark it as ours. Say something like 2 char.MaxValue elements. And then we could check if the array we got from MemoryMarshal.TryGetArray was "ours" by checking the sentinel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we want to use the StringBuilder overloads directly above this block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case the conversion being sought is from some data type like int to string and not to StringBuilder. If we wanted to use the above conversions, we would have to create a local StringBuilder convert the given data type to StringBuilder and then convert it back to string. @michaelgsharp has discovered this method is slower.


In reply to: 428265503 [](ancestors = 428265503)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that those APIs are supported on .NET Core 3.1 and 2.1 which ML.NET supports but not supported on netfx461 which ML.NET needs.
In that case, is it okay to go ahead with this PR until those APIs are fully supported on all ML.NET configs?


In reply to: 428262800 [](ancestors = 428262800)

#endregion ToTX

#region ToBL
public void Convert(in R8 src, ref BL dst) => dst = System.Convert.ToBoolean(src);
public void Convert(in R4 src, ref BL dst) => dst = System.Convert.ToBoolean(src);
Expand Down
20 changes: 20 additions & 0 deletions test/Microsoft.ML.Tests/Transformers/ConvertTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,26 @@ public void TestConvertWorkout()
var expectedConvertedValues = ML.Data.LoadFromEnumerable(allTypesDataConverted);

CheckSameValues(expectedConvertedValues, actualConvertedValues);

var allInputTypesData = new[] { new { A = (sbyte)sbyte.MinValue, B = (byte)byte.MinValue, C = double.MaxValue, D = float.MinValue, E = "already a string", F = false } };
var allInputTypesDataView = ML.Data.LoadFromEnumerable(allInputTypesData);
var allInputTypesDataPipe = ML.Transforms.Conversion.ConvertType(columns: new[] {new TypeConvertingEstimator.ColumnOptions("A1", DataKind.String, "A"),
new TypeConvertingEstimator.ColumnOptions("B1", DataKind.String, "B"),
new TypeConvertingEstimator.ColumnOptions("C1", DataKind.String, "C"),
new TypeConvertingEstimator.ColumnOptions("D1", DataKind.String, "D"),
new TypeConvertingEstimator.ColumnOptions("E1", DataKind.String, "E"),
new TypeConvertingEstimator.ColumnOptions("F1", DataKind.String, "F"),
});

var convertedValues = allInputTypesDataPipe.Fit(allInputTypesDataView).Transform(allInputTypesDataView);

var expectedValuesData = new[] { new { A = (sbyte)sbyte.MinValue, B = (byte)byte.MinValue, C = double.MaxValue, D = float.MinValue, E = "already a string", F = false,
A1 = "-128", B1 = "0", C1 = "1.7976931348623157E+308", D1 = "-3.402823E+38", E1 = "already a string", F1 = "False" } };
var expectedValuesDataView = ML.Data.LoadFromEnumerable(expectedValuesData);

CheckSameValues(expectedValuesDataView, convertedValues);
TestEstimatorCore(allInputTypesDataPipe, allInputTypesDataView);

Done();
}

Expand Down