-
Notifications
You must be signed in to change notification settings - Fork 28.6k
SPARK-4588 [MLLIB] [WIP] Add API for feature attributes #4460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #27053 has started for PR 4460 at commit
|
Test build #27053 has finished for PR 4460 at commit
|
Test PASSed. |
@srowen I'm thinking of connecting
And we can build By I see that you made feature names fundamental. It might be hard to provide every feature a name, especially after a series of feature transformations. Essentially, we want to maintain a bi-map between feature indices and names, while some names might be missing. It would be better if we can feature index as the feature identifier. Given a
The construct will look like
And we maintain a map from feature name to feature indices internally. For each
Then we can have This is basically my brain dump ... we definitely need to revise them when we get into the details. |
@mengxr Great, that helps me. I took another shot at implementing the above ideas.
|
Test build #27294 has started for PR 4460 at commit
|
Test build #27294 has finished for PR 4460 at commit
|
Test FAILed. |
I like the current sketch but also want to think about it more. A few thoughts: I'm not quite clear on how the Array of Attributes in FeatureAttributes corresponds to the columns of the DataFrame. Is it one-to-one, or will Attributes be nested? (I'm basically thinking about groups of features, especially individual features grouped into vectors.) How will propagation of feature names work? Will we try to impose a standard, such as Transformers maintaining the same (or a modified) feature name whenever possible? By the way, do we want to call this "FeatureAttributes," or should we name it something like "ColumnAttributes" so it more obviously applies to other types of columns like labels, users, products, etc.? +1 for moving FeatureType from mllib.tree to attribute. It should be more general. |
This is perhaps contained in @jkbradley 's question, but how does this work with features that are represented with multiple entries in the feature vector - e.g. when we're doing a one-hot encoding. With a one-hot encoding is each category its own feature or can a feature span multiple indices in the vector? |
There are two types of +1 on @jkbradley 's suggestion about not calling it We don't need to care about the |
Btw, for the feature type, beside continuous and categorical, do we want to make binary special? It could be treated as both continuous and categorical. |
So is the idea that Rename You could imagine a more elaborate hierarchy of types: discrete is a special case of continuous, ordinal is a special case of discrete. It's nice to have that expressiveness; it adds somewhat to the complexity for the caller and the code. Maybe you could argue that the schema should force an interpretation for the algorithm. But I kind of like it. The type objects would have methods like |
I wish we could use
I think having a full hierarchy is a good idea. Could you list all of the types you want to include? Then we can check the complexity. Btw, I don't know whether we should have ML attributes attached to string columns. It seems to me that a string column should be mapped to an integer column first to become an ML column with attribute. Hopefully that reduces the complexity. |
I'm OK with a type hierarchy as long as it stays simple (and doesn't turn into a type system parallel to the DataFrame system). To support any type of DataFrame (with Structs and Arrays), we'll need to support nesting for sure. |
Call it So if an As for hierarchy that's all I can think of. Ordinal extends discrete extends continuous; binary extends, well, discrete and categorical I suppose. Hm, I'd imagine most categorical features come in as strings. This feels like just the kind of thing a framework can accommodate if it has the type information. I don't think it's more or less complex to say that a string column can be categorical? It would take some work to inject a translation to integers where that's needed but that's great if the framework can do that. |
@srowen If we mark a string column categorical, it may be hard to answer how many categories it has without looking at the data. If a column is marked categorical, it should store the information of categories in the metadata and store only integer/double in the column. The difference is that we cannot merge a string column with a double column into a vector column without turning the string column into an integer/double column. So I feel that we should treat string column and categorical column separately. For the type hierarchy, if we add a type, let's discuss what algorithms would actually use it:
Maybe for the base Not a concern at this stage, for |
Don't you always have to look at the data to determine how many unique values a column has, regardless of type? String and int are encodings, but attribute types like categorical and continuous are interpretations. Those seem orthogonal to me and I thought From a user perspective, I'd be surprised if I had to encode categorical string values as it seems like something a framework can easily do for me. There's nothing inherently strange about providing a string-valued column that is (necessarily) categorical and in fact they often are strings in input. But there's no reason it couldn't encode the categorical values internally in a different way if needed. Are you just referring to the latter? sure, anything can be done within the framework. So why can't a vector-valued column contain a string -- is the issue that internally we can't have a single array-valued column with different data types? that makes sense. Vector-valued columns are usually used for things like a large number of indicator variables, or related counts, which are all of the same data type, and not typically to encode complex nested schemas containing different data types. But if you really wanted a vector-valued column containing continuous age and categorical gender, in that case yes it seems like the data representation limitations would demand they're of the same data type, and must be doubles, and that's fine. But does mean you can't ever have a non-vector string-valued categorical feature? I agree that discrete and ordinal don't come up. I don't think they're required as types, but may allow optimizations. For example, there's no point in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature. They're all the same rule. That optimization doesn't exist yet. I can't actually think of a realistic optimization that would depend on knowing a value is ordinal (1,2,3,...) I'd drop that maybe. OK I will get to work on changes discussed so far. |
No if we already have ML attributes saved together with the data or defined by users.
Conceptually, this is true. But adding restrictions would simplify our implementation. The restriction I proposed is that data stored in columns with ML attributes are values (Float/Double/Int/Long/Boolean/Vector). So algorithms and transformers don't need to handle special types. Let's consider a vector assembler that merges multiple columns into a vector column. If it needs to handle string columns, it needs to call some indexer to turn strings into indices and merge them. This piece of code would probably appear in every algorithm and the unit test. If we force users turn string columns into numeric ones first. The implementation of the rest of the pipelines could be simplified.
scikit-learn has a clear separation of string values and numeric values. All string values must be encoded into categorical columns through transformers before calling ML algorithms, and all ML algorithms take a matrix
For trees, if the features are integers and their is a split |
+1 for the feature assembler or some other algorithm handling munging and indexing as needed.
About categorical types for decision trees: There should ideally be a distinction between categorical types with arbitrary values and categorical types known to be in a range {0, 1, ..., numCategories-1}. |
@mengxr Understood about being told the number of distinct values for a column by the caller/schema. I thought you were saying this was a difference between string / integer columns. Yes, the point of this value is that it can be fed in as metadata. I think we are saying the same thing regarding strings. Yes, there's no point in every algorithm handling string types. It's fine if they only consume numeric types, and some separate optional transformation stage re-encodes strings if needed. That's a transformation stage the framework could provide, for users to drop in. Algorithms could indeed refuse to operate on string types. To me, that means that the idea of a string-valued categorical column still has a place in the representation since it exists at some stage of a pipeline. It's just that such a thing would never reach an algorithm as-is. Is that aligned with what you guys think? Let me get down to the code to see if this actually matters to the metadata. Right now nothing about this PR has to do with the underlying data type. As long as you aren't saying "string" is a type mutually exclusive with "categorical" then I think we're saying the same thing. Yes I think my tree example wasn't a good one, nevermind. The algorithm is already going to choose only actual data values as split points. Hm. Might matter to algorithms that want to restrict themselves to discrete inputs, like multinomial naive bayes? even there I don't think it's an example of requiring the distinction. I'm neutral about adding 'discrete' as a type I suppose. Interesting point: are numeric categorical values always assumed to be in {0, 1, 2, ..., n-1}? That strikes me as something that is often true, not always. A particular algorithm may require it, and may provide an optional transformation to put a feature into this form. It doesn't seem like a conceptually different type as much as an additional restriction on a categorical feature's representation? that is, is this maybe an additional bit of metadata? Hm. Let me make some code changes to reflect the discussion so far. |
@mengxr Please feel free at any time to take this over if it's easier. It will become more efficient at some point for you to just finish it as you think best. I personally believe your vision sounds good and am more concerned with making it match the vision of the other code you're creating, and letting you get on with that. |
Test build #27813 has started for PR 4460 at commit
|
Test build #27813 has finished for PR 4460 at commit
|
Test FAILed. |
@jkbradley The issue with algorithms handling munging and indexing is the increased complexity. For example, if |
I agree: ML types should be applicable to any SQL type, as long as it makes sense.
I think we are saying the same thing. (There will be some mutually exclusive types, such as Strings not being continuous.)
Algorithms which want 0-based indices for categories (for efficient vector indexing) could handle the re-indexing themselves, but it would be nice to encode it in metadata for the benefit of ensemble algorithms (where you would only want to do the re-indexing once).
@mengxr It won't really increase complexity of the code much since the same code could be re-used for all algorithms (with a few options for the feature types the algorithm can handle). The main issue is the API:
I've argued for algorithms handling featurization before, but I can see reason in forcing users to know what they are doing. This discussion may not belong in this PR anyways, since this functionality could be added later on. |
If we want to have attributes parallel to arbitrary nestings of SQL types, then it will need to be changed. I'm not clear right now what the consensus is on supporting nested schema. |
This continues the work in #4460 from srowen . The design doc is published on the JIRA page with some minor changes. Short description of ML attributes: https://github.com/apache/spark/pull/4925/files?diff=unified#diff-95e7f5060429f189460b44a3f8731a35R24 More details can be found in the design doc. srowen Could you help review this PR? There are many lines but most of them are boilerplate code. Author: Xiangrui Meng <meng@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes #4925 from mengxr/SPARK-4588-new and squashes the following commits: 71d1bd0 [Xiangrui Meng] add JavaDoc for package ml.attribute 617be40 [Xiangrui Meng] remove final; rename cardinality to numValues 393ffdc [Xiangrui Meng] forgot to include Java attribute group tests b1aceef [Xiangrui Meng] more tests e7ab467 [Xiangrui Meng] update ML attribute impl 7c944da [Sean Owen] Add FeatureType hierarchy and categorical cardinality 2a21d6d [Sean Owen] Initial draft of FeatureAttributes class
@mengxr
This is an early checkin to see if this is in the right direction at all.