Skip to content

Direct API: RoleMappedSchema/Data Cleanup, Improvement #445

Closed

Description

Another followup to #371, in which we discuss changes to RoleMappedSchema and RoleMappedData to make them less idiosyncratic.

RoleMappedSchema and RoleMappedData are structures that solve the following problem: Once you create a pipeline, before you feed it to an ITrainer or similar structure, you must have some mechanism to communicate to consumers of that pipeline, what all the columns were actually for... e.g., which column(s) were feature columns, which the label, and so on. (Before this structure existed, our "solution" to this was that every component consumed an IDataView directly and had configurable options for someone to declare which was which. This is good in that each trainer had the chance to be explicit about what it wanted, but still was somewhat troublesome since having to tell absolutely every component we wanted to use, "OK, these are still the feature columns" became somewhat troublesome, and a source of user error. So a structure to make this assignment more "sticky" was invented.

So that is all fine, more or less. And, I'd say on the whole it is a pretty good class, insofar that it seems to have worked well for its purpose. However there are wrinkles we probably ought to clean up.

Nearly all architecture effort went into making it easy to consume, as opposed to being easy or sensible to create. Previously, this made sense, since it was only instantiated in a handful of places, and used in hundreds of places. With API usage, the situation is reversed: we expect everyone to create it, and there will be "only" hundreds of consumers.

  • On that subject, creation is somewhat odd: there are Create and CreateOpt methods, as opposed to how most people would imagine an object is created, through an actual constructor (maybe with a bool opt = false parameter.)

  • "Reapplication" of an existing role-mapping to new data is a common operation performed in the code-base, yet there is no convenience for it, and it's something we'd want people to be able to do relatively easily. (E.g., when applying caching, for example.)

  • The common convenience helpers for the most common cases of creating RoleMappedData exist (e.g., "these are my features, these are my labels) exist in a TrainUtils class. This makes them impossible to discover unless you know where to look. Probably the easiest to discover place to have these conveniences would be on the classes themselves. (This would also be a start at cleaning up TrainUtils, which is basically a haphazard bag of vaguely useful things.)

  • General cleanup of the code. A relic from a bygone time, Id, was never removed, despite being irrelevant and never used any more. It's been years since it was replaced with ids on IRowCursor directly. Also a fair amount of code exists in the class to detect conditions that cannot possibly happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

APIIssues pertaining the friendly API

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions