@@ -73,7 +73,7 @@ result that if a pipeline was composed in some other fashion, there would be
73
73
some error.
74
74
75
75
The only thing you can really assume is that an ` IDataView ` behaves "sanely"
76
- according to the contracts of the ` IDataView ` interface, so that future TLC
76
+ according to the contracts of the ` IDataView ` interface, so that future ML.NET
77
77
developers can form some reasonable expectations of how your code behaves, and
78
78
also have a prayer of knowing how to maintain the code. It is hard enough to
79
79
write software correctly even when the code you're working with actually does
@@ -166,8 +166,8 @@ has the following problems:
166
166
* ** Every** call had to verify that the column was active,
167
167
* ** Every** call had to verify that ` TValue ` was of the right type,
168
168
* When these were part of, say, a transform in a chain (as they often are,
169
- considering how common transforms are used by TLC 's users) each access would
170
- be accompanied by a virtual method call to the upstream cursor's
169
+ considering how common transforms are used by ML.NET 's users) each access
170
+ would be accompanied by a virtual method call to the upstream cursor's
171
171
` GetColumnValue ` .
172
172
173
173
In contrast, consider the situation with these getter delegates. The
@@ -211,14 +211,14 @@ consuming different data from the contemporaneous cursor? There are many
211
211
examples of this throughout the codebase.
212
212
213
213
Nevertheless: in very specific circumstances we have relaxed this. For
214
- example, the TLC API serves up corrupt ` IDataView ` implementations that have
215
- their underlying data change, since reconstituting a data pipeline on fresh
216
- data is at the present moment too resource intensive. Nonetheless, this is
217
- wrong: for example, the ` TrainingCursorBase ` and related subclasses rely upon
218
- the data not changing. Since, however, that is used for * training* and the
219
- prediction engines of the API as used for * scoring* , we accept these. However
220
- this is not, strictly speaking, correct, and this sort of corruption of
221
- ` IDataView ` should only be considered as a last resort, and only when some
214
+ example, some ML.NET API code serves up corrupt ` IDataView ` implementations
215
+ that have their underlying data change, since reconstituting a data pipeline
216
+ on fresh data is at the present moment too resource intensive. Nonetheless,
217
+ this is wrong: for example, the ` TrainingCursorBase ` and related subclasses
218
+ rely upon the data not changing. Since, however, that is used for * training*
219
+ and the prediction engines of the API as used for * scoring* , we accept these.
220
+ However this is not, strictly speaking, correct, and this sort of corruption
221
+ of ` IDataView ` should only be considered as a last resort, and only when some
222
222
great good can be accomplished through this. We certainly did not accept this
223
223
corruption lightly!
224
224
@@ -265,19 +265,19 @@ same data view.) So some rules:
265
265
## Versioning
266
266
267
267
This requirement for consistency of a data model often has implications across
268
- versions of TLC , and our requirements for data model backwards compatibility.
269
- As time has passed, we often feel like it would make sense if a transform
270
- behaved * differently* , that is, if it organized or calculated its output in a
271
- different way than it currently does. For example, suppose we wanted to switch
272
- the hash transform to something a bit more efficient than murmur hashes, for
273
- example. If we did so, presumably the same input values would map to different
274
- outputs. We are free to do so, of course, yet: when we deserialize a hash
275
- transform from before we made this change, that hash transform should continue
276
- to output values as it did, before we made that change. (This, of course,
277
- assuming that the transform was released as part of a "blessed" non-preview
278
- point release of TLC. We can, and have, broken backwards compatibility for
279
- something that has not yet been incorporated in any sort of blessed release,
280
- though we prefer to not.)
268
+ versions of ML.NET , and our requirements for data model backwards
269
+ compatibility. As time has passed, we often feel like it would make sense if a
270
+ transform behaved * differently* , that is, if it organized or calculated its
271
+ output in a different way than it currently does. For example, suppose we
272
+ wanted to switch the hash transform to something a bit more efficient than
273
+ murmur hashes, for example. If we did so, presumably the same input values
274
+ would map to different outputs. We are free to do so, of course, yet: when we
275
+ deserialize a hash transform from before we made this change, that hash
276
+ transform should continue to output values as it did, before we made that
277
+ change. (This, of course, assuming that the transform was released as part of
278
+ a "blessed" non-preview point release of ML.NET. We can, and have, broken
279
+ backwards compatibility for something that has not yet been incorporated in
280
+ any sort of blessed release, though we prefer to not.)
281
281
282
282
## What is Not Functionally Identical
283
283
@@ -334,10 +334,9 @@ aside (which we can hardly help), we expect the models to be the same.
334
334
335
335
# On Loaders, Data Models, and Empty ` IMultiStreamSource ` s
336
336
337
- When you run TLC you have the option of specifying not only * one* data input,
338
- but any number of data input files, including zero. :) This is how [ the
339
- examples here] ( ../public/command/DataCommands.md#look-ma-no-files ) work. But
340
- there's also a more general principle at work here: when deserializing a data
337
+ When you create a loader you have the option of specifying not only * one* data
338
+ input, but any number of data input files, including zero. But there's also a
339
+ more general principle at work here with zero files: when deserializing a data
341
340
loader from a data model with an ` IMultiStreamSource ` with ` Count == 0 ` (e.g.,
342
341
as would be constructed with ` new MultiFileSource(null) ` ), we have a protocol
343
342
that * every* ` IDataLoader ` should work in that circumstance, and merely be a
@@ -472,7 +471,34 @@ indication that this function will not move the cursor (in which case `IRow`
472
471
is helpful), or that will not access any values (in which case ` ICursor ` is
473
472
helpful).
474
473
475
- # Metadata
474
+ # Schema
475
+
476
+ The schema contains information about the columns. As we see in [ the design
477
+ principles] ( IDataViewDesignPrinciples.md ) , it has index, data type, and
478
+ optional metadata.
479
+
480
+ While * programmatically* accesses to an ` IDataView ` are by index, from a
481
+ user's perspective the indices are by name; most training algorithms
482
+ conceptually train on the ` Features ` column (under default settings). For this
483
+ reason nearly all usages of an ` IDataView ` will be prefixed with a call to the
484
+ schema's ` TryGetColumnIndex ` .
485
+
486
+ Regarding name hiding, the principles mention that when multiple columns have
487
+ the same name, other columns are "hidden." The convention all implementations
488
+ of ` ISchema ` obey is that the column with the * largest* index. Note however
489
+ that this is merely convention, not part of the definition of ` ISchema ` .
490
+
491
+ Implementations of ` TryGetColumnIndex ` should be O(1), that is, practically,
492
+ this mapping ought to be backed with a dictionary in most cases. (There are
493
+ obvious exceptions like, say, things like ` LineLoader ` which produce exactly
494
+ one column. There, a simple equality test suffices.)
495
+
496
+ It is best if ` GetColumnType ` returns the * same* object every time. That is,
497
+ things like key-types and vector-types, when returned, should not be created
498
+ in the function itself (thereby creating a new object every time), but rather
499
+ stored somewhere and returned.
500
+
501
+ ## Metadata
476
502
477
503
Since metadata is * optional* , one is not obligated to necessarily produce it,
478
504
or conform to any particular schemas for any particular kinds (beyond, say,
0 commit comments