Skip to content

Commit 8f1da5c

Browse files
committed
Justin semantic comments
1 parent 9ace94b commit 8f1da5c

File tree

3 files changed

+73
-56
lines changed

3 files changed

+73
-56
lines changed

docs/code/IDataViewDesignPrinciples.md

Lines changed: 17 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -64,10 +64,9 @@ The IDataView design fulfills the following design requirements:
6464
kinds, and supports composing multiple primitive components to achieve
6565
higher-level semantics. See [here](#components).
6666

67-
* **Open component system**: While the AzureML Algorithms team has developed,
68-
and continues to develop, a large library of IDataView components,
69-
additional components that interoperate with these may be implemented in
70-
other code bases. See [here](#components).
67+
* **Open component system**: While the ML.NET code has a growing large library
68+
of IDataView components, additional components that interoperate with these
69+
may be implemented in other code bases. See [here](#components).
7170

7271
* **Cursoring**: The rows of a view are accessed sequentially via a row
7372
cursor. Multiple cursors can be active on the same view, both sequentially
@@ -136,11 +135,8 @@ The IDataView system design does *not* include the following:
136135

137136
* **Data file formats**: The IDataView system does not dictate storage or
138137
transport formats. It *does* include interfaces for loader and saver
139-
components. The AzureML Algorithms team has implemented loaders and savers
140-
for some binary and text file formats, but additional loaders and savers can
141-
(and will) be implemented. In particular, implementing a loader from XDF
142-
will be straightforward. Implementing a saver to XDF will likely require the
143-
XDF format to be extended to support vector-valued columns.
138+
components. The ML.NET code has implementations of loaders and savers for
139+
some binary and text file formats.
144140

145141
* **Multi-node computation over multiple data partitions**: The IDataView
146142
design is focused on single node computation. We expect that in multi-node
@@ -197,16 +193,16 @@ experience and performance.
197193

198194
Machine learning and advanced analytics applications often involve high-
199195
dimensional data. For example, a common technique for learning from text,
200-
known as bag-of-words, represents each word in the text as a numeric feature
201-
containing the number of occurrences of that word. Another technique is
202-
indicator or one-hot encoding of categorical values, where, for example, a
203-
text-valued column containing a person's last name is expanded to a set of
204-
features, one for each possible name (Tesla, Lincoln, Gandhi, Zhang, etc.),
205-
with a value of one for the feature corresponding to the name, and the
206-
remaining features having value zero. Variations of these techniques use
207-
hashing in place of dictionary lookup. With hashing, it is common to use 20
208-
bits or more for the hash value, producing $2^20$ (about a million) features
209-
or more.
196+
known as [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model),
197+
represents each word in the text as a numeric feature containing the number of
198+
occurrences of that word. Another technique is indicator or one-hot encoding
199+
of categorical values, where, for example, a text-valued column containing a
200+
person's last name is expanded to a set of features, one for each possible
201+
name (Tesla, Lincoln, Gandhi, Zhang, etc.), with a value of one for the
202+
feature corresponding to the name, and the remaining features having value
203+
zero. Variations of these techniques use hashing in place of dictionary
204+
lookup. With hashing, it is common to use 20 bits or more for the hash value,
205+
producing `2^^20` (about a million) features or more.
210206

211207
These techniques typically generate an enormous number of features.
212208
Representing each feature as an individual column is far from ideal, both from
@@ -225,8 +221,8 @@ corresponding vector values may have any length. A tokenization transform,
225221
that maps a text value to the sequence of individual terms in that text,
226222
naturally produces variable-length vectors of text. Then, a hashing ngram
227223
transform may map the variable-length vectors of text to a bag-of-ngrams
228-
representation, which naturally produces numeric vectors of length $2^k$, where
229-
$k$ is the number of bits used in the hash function.
224+
representation, which naturally produces numeric vectors of length `2^^k`,
225+
where `k` is the number of bits used in the hash function.
230226

231227
### Key Types
232228

@@ -409,10 +405,6 @@ needed, the operating system disk cache transparently enhances performance.
409405
Further, when the data is known to fit in memory, caching, as described above,
410406
provides even better performance.
411407

412-
Note: Implementing a loader for XDF files should be straightforward. To
413-
implement a saver, the XDF format will likely need to be extended to support
414-
vector-valued columns, and perhaps metadata encoding.
415-
416408
### Randomization
417409

418410
Some training algorithms benefit from randomizing the order of rows produced

docs/code/IDataViewImplementation.md

Lines changed: 55 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ result that if a pipeline was composed in some other fashion, there would be
7373
some error.
7474

7575
The only thing you can really assume is that an `IDataView` behaves "sanely"
76-
according to the contracts of the `IDataView` interface, so that future TLC
76+
according to the contracts of the `IDataView` interface, so that future ML.NET
7777
developers can form some reasonable expectations of how your code behaves, and
7878
also have a prayer of knowing how to maintain the code. It is hard enough to
7979
write software correctly even when the code you're working with actually does
@@ -166,8 +166,8 @@ has the following problems:
166166
* **Every** call had to verify that the column was active,
167167
* **Every** call had to verify that `TValue` was of the right type,
168168
* When these were part of, say, a transform in a chain (as they often are,
169-
considering how common transforms are used by TLC's users) each access would
170-
be accompanied by a virtual method call to the upstream cursor's
169+
considering how common transforms are used by ML.NET's users) each access
170+
would be accompanied by a virtual method call to the upstream cursor's
171171
`GetColumnValue`.
172172

173173
In contrast, consider the situation with these getter delegates. The
@@ -211,14 +211,14 @@ consuming different data from the contemporaneous cursor? There are many
211211
examples of this throughout the codebase.
212212

213213
Nevertheless: in very specific circumstances we have relaxed this. For
214-
example, the TLC API serves up corrupt `IDataView` implementations that have
215-
their underlying data change, since reconstituting a data pipeline on fresh
216-
data is at the present moment too resource intensive. Nonetheless, this is
217-
wrong: for example, the `TrainingCursorBase` and related subclasses rely upon
218-
the data not changing. Since, however, that is used for *training* and the
219-
prediction engines of the API as used for *scoring*, we accept these. However
220-
this is not, strictly speaking, correct, and this sort of corruption of
221-
`IDataView` should only be considered as a last resort, and only when some
214+
example, some ML.NET API code serves up corrupt `IDataView` implementations
215+
that have their underlying data change, since reconstituting a data pipeline
216+
on fresh data is at the present moment too resource intensive. Nonetheless,
217+
this is wrong: for example, the `TrainingCursorBase` and related subclasses
218+
rely upon the data not changing. Since, however, that is used for *training*
219+
and the prediction engines of the API as used for *scoring*, we accept these.
220+
However this is not, strictly speaking, correct, and this sort of corruption
221+
of `IDataView` should only be considered as a last resort, and only when some
222222
great good can be accomplished through this. We certainly did not accept this
223223
corruption lightly!
224224

@@ -265,19 +265,19 @@ same data view.) So some rules:
265265
## Versioning
266266

267267
This requirement for consistency of a data model often has implications across
268-
versions of TLC, and our requirements for data model backwards compatibility.
269-
As time has passed, we often feel like it would make sense if a transform
270-
behaved *differently*, that is, if it organized or calculated its output in a
271-
different way than it currently does. For example, suppose we wanted to switch
272-
the hash transform to something a bit more efficient than murmur hashes, for
273-
example. If we did so, presumably the same input values would map to different
274-
outputs. We are free to do so, of course, yet: when we deserialize a hash
275-
transform from before we made this change, that hash transform should continue
276-
to output values as it did, before we made that change. (This, of course,
277-
assuming that the transform was released as part of a "blessed" non-preview
278-
point release of TLC. We can, and have, broken backwards compatibility for
279-
something that has not yet been incorporated in any sort of blessed release,
280-
though we prefer to not.)
268+
versions of ML.NET, and our requirements for data model backwards
269+
compatibility. As time has passed, we often feel like it would make sense if a
270+
transform behaved *differently*, that is, if it organized or calculated its
271+
output in a different way than it currently does. For example, suppose we
272+
wanted to switch the hash transform to something a bit more efficient than
273+
murmur hashes, for example. If we did so, presumably the same input values
274+
would map to different outputs. We are free to do so, of course, yet: when we
275+
deserialize a hash transform from before we made this change, that hash
276+
transform should continue to output values as it did, before we made that
277+
change. (This, of course, assuming that the transform was released as part of
278+
a "blessed" non-preview point release of ML.NET. We can, and have, broken
279+
backwards compatibility for something that has not yet been incorporated in
280+
any sort of blessed release, though we prefer to not.)
281281

282282
## What is Not Functionally Identical
283283

@@ -334,10 +334,9 @@ aside (which we can hardly help), we expect the models to be the same.
334334

335335
# On Loaders, Data Models, and Empty `IMultiStreamSource`s
336336

337-
When you run TLC you have the option of specifying not only *one* data input,
338-
but any number of data input files, including zero. :) This is how [the
339-
examples here](../public/command/DataCommands.md#look-ma-no-files) work. But
340-
there's also a more general principle at work here: when deserializing a data
337+
When you create a loader you have the option of specifying not only *one* data
338+
input, but any number of data input files, including zero. But there's also a
339+
more general principle at work here with zero files: when deserializing a data
341340
loader from a data model with an `IMultiStreamSource` with `Count == 0` (e.g.,
342341
as would be constructed with `new MultiFileSource(null)`), we have a protocol
343342
that *every* `IDataLoader` should work in that circumstance, and merely be a
@@ -472,7 +471,34 @@ indication that this function will not move the cursor (in which case `IRow`
472471
is helpful), or that will not access any values (in which case `ICursor` is
473472
helpful).
474473

475-
# Metadata
474+
# Schema
475+
476+
The schema contains information about the columns. As we see in [the design
477+
principles](IDataViewDesignPrinciples.md), it has index, data type, and
478+
optional metadata.
479+
480+
While *programmatically* accesses to an `IDataView` are by index, from a
481+
user's perspective the indices are by name; most training algorithms
482+
conceptually train on the `Features` column (under default settings). For this
483+
reason nearly all usages of an `IDataView` will be prefixed with a call to the
484+
schema's `TryGetColumnIndex`.
485+
486+
Regarding name hiding, the principles mention that when multiple columns have
487+
the same name, other columns are "hidden." The convention all implementations
488+
of `ISchema` obey is that the column with the *largest* index. Note however
489+
that this is merely convention, not part of the definition of `ISchema`.
490+
491+
Implementations of `TryGetColumnIndex` should be O(1), that is, practically,
492+
this mapping ought to be backed with a dictionary in most cases. (There are
493+
obvious exceptions like, say, things like `LineLoader` which produce exactly
494+
one column. There, a simple equality test suffices.)
495+
496+
It is best if `GetColumnType` returns the *same* object every time. That is,
497+
things like key-types and vector-types, when returned, should not be created
498+
in the function itself (thereby creating a new object every time), but rather
499+
stored somewhere and returned.
500+
501+
## Metadata
476502

477503
Since metadata is *optional*, one is not obligated to necessarily produce it,
478504
or conform to any particular schemas for any particular kinds (beyond, say,

docs/code/IDataViewTypeSystem.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,7 @@ the specific interface is written using fixed pitch font as `IDataView`.
1616

1717
IDataView is the data pipeline machinery for ML.NET. The ML.NET codebase has
1818
an extensive library of IDataView related components (loaders, transforms,
19-
savers, trainers, predictors, etc.). The team is actively working on many
20-
more.
19+
savers, trainers, predictors, etc.). More are being worked on.
2120

2221
The name IDataView was inspired from the database world, where the term table
2322
typically indicates a mutable body of data, while a view is the result of a

0 commit comments

Comments
 (0)