Skip to content

Update documentation to stop mentioning interfaces that no longer exist #4673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jan 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
a095df9
ISchema to DataViewShema
antoniovs1029 Jan 17, 2020
12f0c5f
IRandom -> System.Random
antoniovs1029 Jan 17, 2020
bae0204
IRow -> DataViewRow
antoniovs1029 Jan 17, 2020
1cb8049
Moved DataViewRowCursor.md into docs/code
antoniovs1029 Jan 17, 2020
2e8d9e1
Moved DataViewRowCursor.md
antoniovs1029 Jan 17, 2020
505dcb6
Updated SynchronizedCursorBase
antoniovs1029 Jan 18, 2020
a8ea502
Updated ISchemaBindableMapper
antoniovs1029 Jan 18, 2020
43993bb
Updated ISweepers
antoniovs1029 Jan 18, 2020
eb7f5db
Updated RowCursorUtils
antoniovs1029 Jan 18, 2020
84bd5ce
Updated ClusteringEvaluator.Aggregator
antoniovs1029 Jan 18, 2020
e4cdafe
Updated EvaluatorBase
antoniovs1029 Jan 18, 2020
975aef4
Updated TrainerCursorBase
antoniovs1029 Jan 18, 2020
2c9b1d0
Updated OneToOneTransformBase
antoniovs1029 Jan 18, 2020
c2bc6b3
Updated KMeansUtils
antoniovs1029 Jan 18, 2020
c3b0ea2
Change in IDataReader code
antoniovs1029 Jan 18, 2020
45b3fc8
Modified ISweeper
antoniovs1029 Jan 21, 2020
b89a053
Revert "Updated SynchronizedCursorBase"
antoniovs1029 Jan 21, 2020
f8d549e
Use XML "see" for DataViewRow
antoniovs1029 Jan 21, 2020
e627874
Editing ClusteringEvaluator
antoniovs1029 Jan 21, 2020
5cf207e
Use XML "see" for DataViewRowCursor
antoniovs1029 Jan 21, 2020
925f52c
Updated RoleMappedSchema
antoniovs1029 Jan 21, 2020
cc9e55a
Updated ISchemaBindableMapper
antoniovs1029 Jan 21, 2020
2872752
Updated ISchemaBindableMapper
antoniovs1029 Jan 21, 2020
7229760
Remove MoveMany() comment from TrainerUtils
antoniovs1029 Jan 21, 2020
d15f0e6
Rephrase usage of "interface" on "Comparison with LINQ" Appendix
antoniovs1029 Jan 21, 2020
9286035
Typo
antoniovs1029 Jan 21, 2020
0f7a78d
Added explanatory comment to OneToOneTransformBase
antoniovs1029 Jan 21, 2020
77573fb
Updated IDataReader to IDataLoader in MlNetHighLevelConcepts.md
antoniovs1029 Jan 21, 2020
60f5891
Updated OneToOnetransformBase
antoniovs1029 Jan 23, 2020
fa6a364
Make it explicit that the Database loader spec refer to System.Data.I…
antoniovs1029 Jan 23, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/code/IDataViewDesignPrinciples.md
Original file line number Diff line number Diff line change
Expand Up @@ -459,15 +459,15 @@ the IDataView system is similar to the LINQ eco-system. The comparisons below
refer to the `IDataView` and `IEnumerable<T>` interfaces as the core
interfaces of their respective worlds.

In both worlds, there is a cursoring interface associated with the core
In both worlds, there is a cursoring mechanism associated with the core
interface. In the IEnumerable world, the cursoring interface is
`IEnumerator<T>`. In the IDataView world, the cursoring interface is
`IRowCursor`.
`IEnumerator<T>`. In the IDataView world, the cursoring mechanism is accomplished through a
`DataViewRowCursor`.

Both cursoring interfaces have `MoveNext()` methods for forward-only iteration
Both cursoring mechanisms have `MoveNext()` methods for forward-only iteration
through the elements.

Both cursoring interfaces provide access to information about the current
Both cursoring mechanisms provide access to information about the current
item. For the IEnumerable world, the access is through the `Current` property
of the enumerator. Note that when `T` is a class type, this suggests that each
item served requires memory allocation. In the IDataView world, there is no
Expand All @@ -476,7 +476,7 @@ current row are directly accessible via methods on the cursor. This avoids
memory allocation for each row.

In both worlds, the item type information is carried by both the core
interface and the cursoring interface. In the IEnumerable world, this type
interface and the cursoring mechanism. In the IEnumerable world, this type
information is part of the .Net type, while in the IDataView world, the type
information is much richer and contained in the schema, rather than in the
.Net type.
Expand Down
50 changes: 17 additions & 33 deletions docs/code/IDataViewImplementation.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,12 +144,13 @@ to make it a loader or a transform. If not, it probably does not make sense.
Let us address something fairly conspicuous. The question almost everyone
asks, when they first start using `IDataView`: what is up with these getters?

One does not fetch values directly from an `IRow` implementation (including
`IRowCursor`). Rather, one retains a delegate that can be used to fetch
objects, through the `GetGetter` method on `IRow`. This delegate is:
One does not fetch values directly from a `DataViewRow` implementation (including
`DataViewRowCursor`). Rather, one retains a delegate that can be used to fetch
objects, through the `GetGetter` method on `DataViewRow`. This delegate is:

```csharp
public delegate void ValueGetter<TValue>(ref TValue value);

```

If you are unfamiliar with delegates, [read
Expand All @@ -159,7 +160,7 @@ method, and you use this delegate multiple times to fetch the actual column
values as you `MoveNext` through the cursor.

Some history to motivate this: In the first version of `IDataView` the
`IRowCursor` implementation did not actually have these "getters" but rather
`DataViewRowCursor` implementation (formerly known as `IRowCursor`) did not actually have these "getters" but rather
Copy link
Contributor

@justinormont justinormont Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see one usage of and various references to IRowCursor -- https://github.com/dotnet/machinelearning/search?utf8=%E2%9C%93&q=IRowCursor&type=

Example:

var cursors = new IRowCursor[prList.Count];
#Resolved

Copy link
Member Author

@antoniovs1029 antoniovs1029 Jan 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems it is the only place in "actual code" where IRowCursor appears.

It is wrapped inside a #if !CORECLR (link). So when building ML.NET nothing inside that block is compiled. If I remove the CORECLR if directive, then I get some errors, not only corresponding to the IRowCursor (which doesn't exist currently in ML.NET).

I don't know if I should change it, then. #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in BinaryClassifierEvaluator will not compile even if we change IRowCursor to DataViewRowCursor, since it uses a class that does not exist in our code: XYPlot. We had that class in an internal repo, and it used System.Windows.Forms which I believe is only for .Net Framework, so we didn't include it in ML.NET. I think that code can safely be deleted.


In reply to: 368201681 [](ancestors = 368201681)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open another PR to remove that piece of code then, since I prefer that this PR remains only as changes in the documentation. Thanks for the clarification, @yaeldekel !


In reply to: 368281931 [](ancestors = 368281931,368201681)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened the PR in here: #4694


In reply to: 369266609 [](ancestors = 369266609,368281931,368201681)

had a method, `GetColumnValue<TValue>(int col, ref TValue val)`. However, this
has the following problems:

Expand Down Expand Up @@ -191,7 +192,7 @@ values for the same columns, it will apparently be a "consistent" view. It is
probably obvious what this mean, but specifically:

The cursor as returned through `GetRowCursor` (with perhaps an identically
constructed `IRandom` instance) in any iteration should return the same number
constructed `System.Random` instance) in any iteration should return the same number
of rows on all calls, and with the same values at each row.

Why is this important? Many machine learning algorithms require multiple
Expand All @@ -203,7 +204,7 @@ are computed were not consistent? How could a dual algorithm like SDCA
function with any accuracy, if the examples associated with any given dual
variable were to change? Consider even a relatively simple transform, like a
forward looking windowed averager, or anything relating to time series. The
implementation of those `ICursor` interfaces often open *two* cursors on the
implementation of those `DataViewRowCursor` interfaces often open *two* cursors on the
underlying `IDataView`, one "look ahead" cursor used to gather and calculate
necessary statistics, and another cursor for any data: how could the column
constructed out of that transform be meaningful of the look ahead cursor was
Expand Down Expand Up @@ -249,7 +250,7 @@ data in a consistent way.
Let us formalize this somewhat. We consider two data views to be functionally
identical if there is absolutely no way to distinguish them: they return the
same values, have the same types, same number of rows, they shuffle
identically given identically constructed `IRandom` when row cursors are
identically given identically constructed `System.Random` when row cursors are
constructed, return the same ID for rows from the ID getter, etc. Obviously
this concept is transitive. (Of course, `Batch` in a cursor might be different
between the two, but that is the case even with two cursors constructed on the
Expand Down Expand Up @@ -348,7 +349,7 @@ feature names are, etc.) when all we have is the data model. (For example, the

# Getters Must Fail for Invalid Types

For a given `IRow`, we must expect that `GetGetter<TValue>(col)` will throw if
For a given `DataViewRow`, we must expect that `GetGetter<TValue>(col)` will throw if
either `IsColumnActive(col)` is `false`, or `typeof(TValue) !=
Schema.GetColumnType(col).RawType`, as indicated in the code documentation.
But why? It might seem reasonable to add seemingly "harmless" flexibility to
Expand Down Expand Up @@ -383,15 +384,15 @@ inconsistency, surprises and bugs for users and developers.

# Thread Safety

Any `IDataView` implementation, as well as the `ISchema`, *must* be thread
Any `IDataView` implementation, as well as the `DataViewSchema`, *must* be thread
safe. There is a lot of code that depends on this. For example, cross
validation works by operating over the same dataset (just, of course, filtered
to different subsets of the data). That amounts to multiple cursors being
opened, simultaneously, over the same data.

So: `IDataView` and `ISchema` must be thread safe. However, `IRowCursor`,
So: `IDataView` and `DataViewSchema` must be thread safe. However, `DataViewRowCursor`,
being a stateful object, we assume is accessed from exactly one thread at a
time. The `IRowCursor`s returned through a `GetRowCursorSet`, however, which
time. The `DataViewRowCursor`s returned through a `GetRowCursorSet`, however, which
each single one must be accessed by a single thread at a time, multiple
threads can access this set of cursors simultaneously: that's why we have that
method in the first place.
Expand Down Expand Up @@ -431,10 +432,10 @@ not have been obvious immediately.

# `GetGetter` Returning the Same Delegate

On a single instance of `IRowCursor`, since each `IRowCursor` instance has no
On a single instance of `DataViewRowCursor`, since each `DataViewRowCursor` instance has no
requirement to be thread safe, it is entirely legal for a call to `GetGetter`
on a single column to just return the same getting delegate. It has come to
pass that the majority of implementations of `IRowCursor` actually do that,
pass that the majority of implementations of `DataViewRowCursor` actually do that,
since it is in some ways easier to write the code that way.

This practice has inadvertently enabled a fairly attractive tool for analysis
Expand All @@ -447,29 +448,12 @@ do not, but the vast majority do.
# Class Structuring

The essential attendant classes of an `IDataView` are its schema, as returned
through the `Schema` property, as well as the `IRowCursor` implementation(s),
through the `Schema` property, as well as the `DataViewRowCursor` implementation(s),
as returned through the `GetRowCursor` and `GetRowCursorSet` methods. The
implementations for those two interfaces are typically nested within the
`IDataView` implementation itself. The cursor implementation is almost always
at the bottom of the data view class.

# `IRow` and `ICursor` vs. `IRowCursor`

We have `IRowCursor` which descends from both `IRow` and `ICursor`. Why do
these other interfaces exist?

Firstly, there are implementations of `IRow` or `ICursor` that are not
`IRowCursor`s. We have occasionally found it useful to have something
resembling a key-value store, but that is strongly, dynamically typed in some
fashion. Why not simply represent this using the same idioms of `IDataView`?
So we put them in an `IRow`. Similarly: we have several things that behave
*like* cursors, but that are in no way *row* cursors.

However, more than that, there are a number of utility functions where we want
to operate over something like an `IRowCursor`, but we want to have some
indication that this function will not move the cursor (in which case `IRow`
is helpful), or that will not access any values (in which case `ICursor` is
helpful).

# Schema

Expand All @@ -485,8 +469,8 @@ schema's `TryGetColumnIndex`.

Regarding name hiding, the principles mention that when multiple columns have
the same name, other columns are "hidden." The convention all implementations
of `ISchema` obey is that the column with the *largest* index. Note however
that this is merely convention, not part of the definition of `ISchema`.
of `DataViewSchema` obey is that the column with the *largest* index. Note however
that this is merely convention, not part of the definition of `DataViewSchema`.

Implementations of `TryGetColumnIndex` should be O(1), that is, practically,
this mapping ought to be backed with a dictionary in most cases. (There are
Expand Down
30 changes: 15 additions & 15 deletions docs/code/MlNetHighLevelConcepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ This document is going to cover the following ML.NET concepts:
- In one sentence, a transformer is a component that takes data, does some work on it, and returns new 'transformed' data.
- For example, you can think of a machine learning model as a transformer that takes features and returns predictions.
- Another example, 'text tokenizer' would take a single text column and output a vector column with individual 'words' extracted out of the texts.
- [*Data reader*](#data-reader), represented as an `IDataReader<T>` interface.
- The data reader is ML.NET component to 'create' data: it takes an instance of `T` and returns data out of it.
- For example, a *TextLoader* is an `IDataReader<FileSource>`: it takes the file source and produces data.
- [*Data loader*](#data-loader), represented as an `IDataLoader<TSource>` interface.
- The data loader is ML.NET component to 'create' data: it takes an instance of `TSource` and returns data out of it.
- For example, a *TextLoader* is an `IDataLoader<IMultiStreamSource>`: it takes the file source and produces data.
- [*Estimator*](#estimator), represented as an `IEstimator<T>` interface.
- This is an object that learns from data. The result of the learning is a *transformer*.
- You can think of a machine learning *algorithm* as an estimator that learns on data and produces a machine learning *model* (which is a transformer).
Expand All @@ -28,7 +28,7 @@ This document is going to cover the following ML.NET concepts:

In ML.NET, data is very similar to a SQL view: it's a lazily-evaluated, cursorable, heterogenous, schematized dataset.

- It has *Schema* (an instance of an `ISchema` interface), that contains the information about the data view's columns.
- It has *Schema* (an instance of a `DataViewSchema` class), that contains the information about the data view's columns.
- Each column has a *Name*, a *Type*, and an arbitrary set of *annotations* associated with it.
- It is important to note that one of the types is the `vector<T, N>` type, which means that the column's values are *vectors of items of type T, with the size of N*. This is a recommended way to represent multi-dimensional data associated with every row, like pixels in an image, or tokens in a text.
- The column's *annotations* contains information like 'slot names' of a vector column and suchlike. The annotations itself are actually represented as another one-row *data*, that is unique to each column.
Expand All @@ -40,12 +40,12 @@ In ML.NET, data is very similar to a SQL view: it's a lazily-evaluated, cursorab

A transformer is a component that takes data, does some work on it, and return new 'transformed' data.

Here's the interface of `ITransformer`:
Here's part of the `ITransformer` interface:
```c#
public interface ITransformer
{
IDataView Transform(IDataView input);
ISchema GetOutputSchema(ISchema inputSchema);
DataViewSchema GetOutputSchema(DataViewSchema inputSchema);
}
```

Expand Down Expand Up @@ -73,26 +73,26 @@ var fullTransformer = transformer1.Append(transformer2).Append(transformer3);

We utilize this property a lot in ML.NET: typically, the trained ML.NET model is a 'chain of transformers', which is, for all intents and purposes, a *transformer*.

## Data reader
## Data loader

The data reader is ML.NET component to 'create' data: it takes an instance of `T` and returns data out of it.
The data loader is ML.NET component to 'create' data: it takes an instance of `TSource` and returns data out of it.

Here's the exact interface of `IDataReader<T>`:
Here's the interface of `IDataLoader<TSource>`:
```c#
public interface IDataReader<in TSource>
{
IDataView Read(TSource input);
ISchema GetOutputSchema();
IDataView Load(TSource input);
DataViewSchema GetOutputSchema();
}
```
As you can see, the reader is capable of reading data (potentially multiple times, and from different 'inputs'), but the resulting data will always have the same schema, denoted by `GetOutputSchema`.
As you can see, the loader is capable of loading data (potentially multiple times, and from different 'inputs'), but the resulting data will always have the same schema, denoted by `GetOutputSchema`.

An interesting property to note is that you can create a new data reader by 'attaching' a transformer to an existing data reader. This way you can have 'reader' with transformation behavior baked in:
An interesting property to note is that you can create a new data loader by 'attaching' a transformer to an existing data loader. This way you can have a 'loader' with transformation behavior baked in:
```c#
var newReader = reader.Append(transformer1).Append(transformer2)
var newLoader = loader.Append(transformer1).Append(transformer2)
```

Another similarity to transformers is that, since data is lazily evaluated, *readers are lazy*: no (or minimal) actual 'reading' happens when you call `dataReader.Read()`: only when a cursor is requested on the resulting data does the reader begin to work.
Another similarity to transformers is that, since data is lazily evaluated, *loaders are lazy*: no (or minimal) actual 'loading' happens when you call `dataLoader.Load()`: only when a cursor is requested on the resulting data does the loader begin to work.

## Estimator

Expand Down
4 changes: 2 additions & 2 deletions docs/code/SchemaComprehension.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ For a better understanding of `IDataView` principles and type system please refe

## Introduction

Every dataset in ML.NET is represented as an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other annotations is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object.
Every dataset in ML.NET is represented as an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other annotations is known as the *schema* of the `IDataView`, and it's represented as an `DataViewSchema` object.

In this document, we will be using the terms *data view* and `IDataView` interchangeably, same for *schema* and `ISchema`.
In this document, we will be using the terms *data view* and `IDataView` interchangeably, same for *schema* and `DataViewSchema`.

Before any new data enters ML.NET, the user needs to somehow define how the schema of the data will look like.
To do this, the following questions need to be answered:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ MLContext mlContext = new MLContext();
IDataView trainingDataView = mlContext.Data.LoadFromDbSqlQuery<ModelInputData, SqlConnection>(connString: myConnString, sqlQuerySentence: "Select * from InputMLModelDataset where InputMLModelDataset.CompanyName = 'MSFT'");
```

**2. (Foundational method) Data loading from a database with an IDataReader object:**
**2. (Foundational method) Data loading from a database with a System.Data.IDataReader object:**

This is the foundational or pillar method which will be used by the rest of the higher level or convenient methods:

Expand Down
2 changes: 1 addition & 1 deletion src/Microsoft.ML.AutoML/Sweepers/ISweeper.cs
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ IComparable IRunResult.MetricValue

/// <summary>
/// The metric class, used by smart sweeping algorithms.
/// Ideally we would like to move towards the new IDataView/ISchematized, this is
/// Ideally we would like to move towards a IDataView, this is
/// just a simple view instead, and it is decoupled from RunResult so we can move
/// in that direction in the future.
/// </summary>
Expand Down
6 changes: 3 additions & 3 deletions src/Microsoft.ML.Core/Data/ISchemaBindableMapper.cs
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@
namespace Microsoft.ML.Data
{
/// <summary>
/// A mapper that can be bound to a <see cref="RoleMappedSchema"/> (which is an ISchema, with mappings from column kinds
/// to columns). Binding an <see cref="ISchemaBindableMapper"/> to a <see cref="RoleMappedSchema"/> produces an
/// A mapper that can be bound to a <see cref="RoleMappedSchema"/> (which encapsulates a <see cref="DataViewSchema"/> and has mappings from column kinds
/// to columns of that schema). Binding an <see cref="ISchemaBindableMapper"/> to a <see cref="RoleMappedSchema"/> produces an
/// <see cref="ISchemaBoundMapper"/>, which is an interface that has methods to return the names and indices of the input columns
/// needed by the mapper to compute its output. The <see cref="ISchemaBoundRowMapper"/> is an extention to this interface, that
/// can also produce an output IRow given an input IRow. The IRow produced generally contains only the output columns of the mapper, and not
/// can also produce an output <see cref="DataViewRow"/> given an input <see cref="DataViewRow"/>. The <see cref="DataViewRow"/> produced generally contains only the output columns of the mapper, and not
/// the input columns (but there is nothing preventing an <see cref="ISchemaBoundRowMapper"/> from mapping input columns directly to outputs).
/// This interface is implemented by wrappers of IValueMapper based predictors, which are predictors that take a single
/// features column. New predictors can implement <see cref="ISchemaBindableMapper"/> directly. Implementing <see cref="ISchemaBindableMapper"/>
Expand Down
4 changes: 2 additions & 2 deletions src/Microsoft.ML.Core/Data/RoleMappedSchema.cs
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
namespace Microsoft.ML.Data
{
/// <summary>
/// Encapsulates an <see cref="Schema"/> plus column role mapping information. The purpose of role mappings is to
/// Encapsulates a <see cref="DataViewSchema"/> plus column role mapping information. The purpose of role mappings is to
/// provide information on what the intended usage is for. That is: while a given data view may have a column named
/// "Features", by itself that is insufficient: the trainer must be fed a role mapping that says that the role
/// mapping for features is filled by that "Features" column. This allows things like columns not named "Features"
Expand All @@ -25,7 +25,7 @@ namespace Microsoft.ML.Data
/// in this schema.
/// </summary>
/// <remarks>
/// Note that instances of this class are, like instances of <see cref="Schema"/>, immutable.
/// Note that instances of this class are, like instances of <see cref="DataViewSchema"/>, immutable.
///
/// It is often the case that one wishes to bundle the actual data with the role mappings, not just the schema. For
/// that case, please use the <see cref="RoleMappedData"/> class.
Expand Down
Loading