Description
openedon Jun 28, 2018
Another followup to #371, in which we discuss changes to RoleMappedSchema
and RoleMappedData
to make them less idiosyncratic.
RoleMappedSchema
and RoleMappedData
are structures that solve the following problem: Once you create a pipeline, before you feed it to an ITrainer
or similar structure, you must have some mechanism to communicate to consumers of that pipeline, what all the columns were actually for... e.g., which column(s) were feature columns, which the label, and so on. (Before this structure existed, our "solution" to this was that every component consumed an IDataView
directly and had configurable options for someone to declare which was which. This is good in that each trainer had the chance to be explicit about what it wanted, but still was somewhat troublesome since having to tell absolutely every component we wanted to use, "OK, these are still the feature columns" became somewhat troublesome, and a source of user error. So a structure to make this assignment more "sticky" was invented.
So that is all fine, more or less. And, I'd say on the whole it is a pretty good class, insofar that it seems to have worked well for its purpose. However there are wrinkles we probably ought to clean up.
Nearly all architecture effort went into making it easy to consume, as opposed to being easy or sensible to create. Previously, this made sense, since it was only instantiated in a handful of places, and used in hundreds of places. With API usage, the situation is reversed: we expect everyone to create it, and there will be "only" hundreds of consumers.
-
On that subject, creation is somewhat odd: there are
Create
andCreateOpt
methods, as opposed to how most people would imagine an object is created, through an actual constructor (maybe with abool opt = false
parameter.) -
"Reapplication" of an existing role-mapping to new data is a common operation performed in the code-base, yet there is no convenience for it, and it's something we'd want people to be able to do relatively easily. (E.g., when applying caching, for example.)
-
The common convenience helpers for the most common cases of creating
RoleMappedData
exist (e.g., "these are my features, these are my labels) exist in aTrainUtils
class. This makes them impossible to discover unless you know where to look. Probably the easiest to discover place to have these conveniences would be on the classes themselves. (This would also be a start at cleaning upTrainUtils
, which is basically a haphazard bag of vaguely useful things.) -
General cleanup of the code. A relic from a bygone time,
Id
, was never removed, despite being irrelevant and never used any more. It's been years since it was replaced with ids onIRowCursor
directly. Also a fair amount of code exists in the class to detect conditions that cannot possibly happen.