SPARK-4588 [MLLIB] [WIP] Add API for feature attributes #4460

srowen · 2015-02-08T22:14:30Z

@mengxr

This is an early checkin to see if this is in the right direction at all.

Should this have a counterpart "builder" class, or setters that make a new copy of the object?
In the JIRA you mention feature dimension. Dumb question, but is that just the cardinality of categorical features?
I structured it all with maps, and I wonder if that's OK

SparkQA · 2015-02-08T22:17:51Z

Test build #27053 has started for PR 4460 at commit aa3361b.

This patch merges cleanly.

SparkQA · 2015-02-08T23:31:44Z

Test build #27053 has finished for PR 4460 at commit aa3361b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FeatureAttributes(val metadata: Metadata)

AmplabJenkins · 2015-02-08T23:31:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27053/
Test PASSed.

mengxr · 2015-02-09T22:02:16Z

@srowen I'm thinking of connecting FeatureAttributes and SQL's Metadata through factory methods but not via constructors.

val attributes = FeatureAttributes.fromMetadata(metadata)
val metadata = attributes.toMetadata

And we can build FeatureAttributes using a builder. This would be easier for us to write unit tests and manipulate feature attributes.

By feature dimension, I mean the vector size for a vector-typed feature column.

I see that you made feature names fundamental. It might be hard to provide every feature a name, especially after a series of feature transformations. Essentially, we want to maintain a bi-map between feature indices and names, while some names might be missing. It would be better if we can feature index as the feature identifier.

Given a FeatureAttributes instances, the following methods seem to be necessary:

size(): Int: 1 for a scalar column, and vectorSize for a vector column
producer: String, log who produces those attributes
getFeatureAttribute(index: Int): Attribute: gets a feature's attribute from its index
`getFeatureIndex(name: String): gets a feature's index from its name
categoricalFeatureIndices(): Array[Int]: returns a list of categorical features

The construct will look like

class FeatureAttributes(val attributes: Array[Attribute], val producer: String)

And we maintain a map from feature name to feature indices internally. For each Attribute, we can put the following fields:

featureType: Int: continuous, categorical, etc
name: String

Then we can have ContinuousAttribute and CategoricalAttribute as two subclasses of Attribute and each implements its own methods and toMetadata/fromMetadata methods. For a continuous feature, we might want to leave slots for min, max, support, etc, and for a categorical attribute, we want to at least have categories: Array[String] and numCategories.

This is basically my brain dump ... we definitely need to revise them when we get into the details.

srowen · 2015-02-11T17:45:15Z

@mengxr Great, that helps me. I took another shot at implementing the above ideas.

Is package org.apache.spark.ml.attribute reasonable?
FeatureType duplicates org.apache.spark.mllib.tree.configuration.FeatureType; OK as this will be the 'new' version?
Metadata operates like a Map, and that means keys don't necessarily have a value. Does it need to return Option[...] from its methods then?
You can see there's some special handling of optional values in the caller when buildilng too. Should the builder accept Option[...] too?
I'm not sure I implemented the patterns quite in the way you had in mind, like regarding producer. Please comment if you prefer a different design

SparkQA · 2015-02-11T17:47:28Z

Test build #27294 has started for PR 4460 at commit b3a1000.

This patch merges cleanly.

SparkQA · 2015-02-11T17:48:24Z

Test build #27294 has finished for PR 4460 at commit b3a1000.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class Attribute(val index: Int,

AmplabJenkins · 2015-02-11T17:48:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27294/
Test FAILed.

jkbradley · 2015-02-17T02:52:44Z

I like the current sketch but also want to think about it more. A few thoughts:

I'm not quite clear on how the Array of Attributes in FeatureAttributes corresponds to the columns of the DataFrame. Is it one-to-one, or will Attributes be nested? (I'm basically thinking about groups of features, especially individual features grouped into vectors.)

How will propagation of feature names work? Will we try to impose a standard, such as Transformers maintaining the same (or a modified) feature name whenever possible?

By the way, do we want to call this "FeatureAttributes," or should we name it something like "ColumnAttributes" so it more obviously applies to other types of columns like labels, users, products, etc.?

+1 for moving FeatureType from mllib.tree to attribute. It should be more general.

sryza · 2015-02-17T04:14:05Z

This is perhaps contained in @jkbradley 's question, but how does this work with features that are represented with multiple entries in the feature vector - e.g. when we're doing a one-hot encoding. With a one-hot encoding is each category its own feature or can a feature span multiple indices in the vector?

mengxr · 2015-02-17T06:38:53Z

There are two types of Attribute(s): describing a feature group (a vector column) or describing a single feature (a scalar column). For a feature group, the column name becomes the group name and individual features inside this group may have their own names. For example, we have a vector column called user and inside this feature group we can have features named age and gender. When we merge multiple groups into a single feature vector, e.g., in a feature vector assembler, the names are flattened like user:age and user:gender. This answers @sryza 's question about one-hot-encoding. Assume that the input column is a scalar column called "country" with categories stored in the attribute. Then OneHotEncoder will output a vector column and generate feature attributes with names like country:US, country:CA, etc.

+1 on @jkbradley 's suggestion about not calling it FeatureAttribute. Attribute should be okay to describe a scalar column but we also need a name to describe a vector column, where Attributes may sounds a little confusing. I suggest AttributeGroup.

We don't need to care about the FeatureType in mllib.tree in this PR. Once we have this PR merged, we can migrate the decision tree code.

mengxr · 2015-02-17T06:40:43Z

Btw, for the feature type, beside continuous and categorical, do we want to make binary special? It could be treated as both continuous and categorical.

srowen · 2015-02-17T10:39:27Z

So is the idea that FeatureAttributes becomes AttributeGroup, and that it continues to contain many Attributes? I didn't realize that we intended the vector-valued features to be whole schemas within themselves. So they may be AttributeGroups too and so an AttributeGroup is an Attribute too. Makes sense.

Rename FeatureType? and what's its value for AttributeGroup? GROUP or null?

You could imagine a more elaborate hierarchy of types: discrete is a special case of continuous, ordinal is a special case of discrete. It's nice to have that expressiveness; it adds somewhat to the complexity for the caller and the code. Maybe you could argue that the schema should force an interpretation for the algorithm. But I kind of like it. The type objects would have methods like isContinuous, isCategorical. Should I make a fuller hierarchy or stick to adding BINARY?

mengxr · 2015-02-19T18:23:35Z

Rename FeatureType? and what's its value for AttributeGroup? GROUP or null?

I wish we could use type, but it is already taken by Scala. DataType is taken by SQL. So DatumType or MLDataType? ... I don't really have good suggestions. I'm not sure whether we should make AttributeGroup an Attribute. What is the benefit of making it an Attribute?

You could imagine a more elaborate hierarchy of types: discrete is a special case of continuous, ordinal is a special case of discrete. It's nice to have that expressiveness; it adds somewhat to the complexity for the caller and the code. Maybe you could argue that the schema should force an interpretation for the algorithm. But I kind of like it. The type objects would have methods like isContinuous, isCategorical. Should I make a fuller hierarchy or stick to adding BINARY?

I think having a full hierarchy is a good idea. Could you list all of the types you want to include? Then we can check the complexity. Btw, I don't know whether we should have ML attributes attached to string columns. It seems to me that a string column should be mapped to an integer column first to become an ML column with attribute. Hopefully that reduces the complexity.

jkbradley · 2015-02-19T20:28:07Z

I'm OK with a type hierarchy as long as it stays simple (and doesn't turn into a type system parallel to the DataFrame system).

To support any type of DataFrame (with Structs and Arrays), we'll need to support nesting for sure.

srowen · 2015-02-20T00:52:12Z

Call it AttributeType maybe?

So if an AttributeGroup contains both Attributes but also vector-valued columns, which sound like AttributeGroups within themselves. That's why it seemed like AttributeGroup should be an Attribute or at least share a common superclass? then I didn't know what to call it and it seemed like overkill. That was the logic behind AttributeGroup extends Attribute -- WDYT?

As for hierarchy that's all I can think of. Ordinal extends discrete extends continuous; binary extends, well, discrete and categorical I suppose.

Hm, I'd imagine most categorical features come in as strings. This feels like just the kind of thing a framework can accommodate if it has the type information. I don't think it's more or less complex to say that a string column can be categorical? It would take some work to inject a translation to integers where that's needed but that's great if the framework can do that.

mengxr · 2015-02-20T11:22:39Z

@srowen If we mark a string column categorical, it may be hard to answer how many categories it has without looking at the data. If a column is marked categorical, it should store the information of categories in the metadata and store only integer/double in the column. The difference is that we cannot merge a string column with a double column into a vector column without turning the string column into an integer/double column. So I feel that we should treat string column and categorical column separately.

For the type hierarchy, if we add a type, let's discuss what algorithms would actually use it:

numeric (slightly broader than continuous): linear algorithms can take them, while decision tree needs to bin them first (if the number of distinct values are large)
categorical: decision tree can take them, while linear algorithms needs dummy coding.
binary: both linear algorithms and decision tree can take them directly.
discrete/ordinal: ?

Maybe for the base Attribute trait, we can have cardinality: Int. For continuous data, we put inf or -1, for ordinal data, we put the number of distinct values. Let's try to write down the API and see how it looks.

Not a concern at this stage, for AttributeGroup we might need to handle the storage efficiently to avoid GC pressure. It may come out with millions of features, and we will hit GC if we use too many objects to store the attributes. Well, this is maybe too early to worry about.

srowen · 2015-02-20T11:41:00Z

Don't you always have to look at the data to determine how many unique values a column has, regardless of type? String and int are encodings, but attribute types like categorical and continuous are interpretations. Those seem orthogonal to me and I thought Attribute was only metadata representing the attribute type, whereas the RDD schema already knows the actual column data types. That is, there's no "string" attribute feature type right?

From a user perspective, I'd be surprised if I had to encode categorical string values as it seems like something a framework can easily do for me. There's nothing inherently strange about providing a string-valued column that is (necessarily) categorical and in fact they often are strings in input. But there's no reason it couldn't encode the categorical values internally in a different way if needed. Are you just referring to the latter? sure, anything can be done within the framework.

So why can't a vector-valued column contain a string -- is the issue that internally we can't have a single array-valued column with different data types? that makes sense. Vector-valued columns are usually used for things like a large number of indicator variables, or related counts, which are all of the same data type, and not typically to encode complex nested schemas containing different data types. But if you really wanted a vector-valued column containing continuous age and categorical gender, in that case yes it seems like the data representation limitations would demand they're of the same data type, and must be doubles, and that's fine. But does mean you can't ever have a non-vector string-valued categorical feature?

I agree that discrete and ordinal don't come up. I don't think they're required as types, but may allow optimizations. For example, there's no point in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature. They're all the same rule. That optimization doesn't exist yet. I can't actually think of a realistic optimization that would depend on knowing a value is ordinal (1,2,3,...) I'd drop that maybe.

OK I will get to work on changes discussed so far.

mengxr · 2015-02-20T19:33:51Z

Don't you always have to look at the data to determine how many unique values a column has, regardless of type?

No if we already have ML attributes saved together with the data or defined by users.

String and int are encodings, but attribute types like categorical and continuous are interpretations. Those seem orthogonal to me and I thought Attribute was only metadata representing the attribute type, whereas the RDD schema already knows the actual column data types.

Conceptually, this is true. But adding restrictions would simplify our implementation. The restriction I proposed is that data stored in columns with ML attributes are values (Float/Double/Int/Long/Boolean/Vector). So algorithms and transformers don't need to handle special types. Let's consider a vector assembler that merges multiple columns into a vector column. If it needs to handle string columns, it needs to call some indexer to turn strings into indices and merge them. This piece of code would probably appear in every algorithm and the unit test. If we force users turn string columns into numeric ones first. The implementation of the rest of the pipelines could be simplified.

From a user perspective, I'd be surprised if I had to encode categorical string values as it seems like something a framework can easily do for me. There's nothing inherently strange about providing a string-valued column that is (necessarily) categorical and in fact they often are strings in input. But there's no reason it couldn't encode the categorical values internally in a different way if needed. Are you just referring to the latter? sure, anything can be done within the framework.

scikit-learn has a clear separation of string values and numeric values. All string values must be encoded into categorical columns through transformers before calling ML algorithms, and all ML algorithms take a matrix X and a vector y. That didn't surprise users much (hopefully). In MLlib, we will provide transformers that turn strings into categorical columns in various ways.

I agree that discrete and ordinal don't come up. I don't think they're required as types, but may allow optimizations. For example, there's no point in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature. They're all the same rule. That optimization doesn't exist yet. I can't actually think of a realistic optimization that would depend on knowing a value is ordinal (1,2,3,...) I'd drop that maybe.

For trees, if the features are integers and their is a split > 3.4. Then there won't be splits between 3 and 4 because all points are separated. It looks okay to me that we have a split > 3.4 while all values are integers. We can definitely add this attribute back if it becomes necessary.

jkbradley · 2015-02-21T03:02:35Z

+1 for the feature assembler or some other algorithm handling munging and indexing as needed.

Note that the behavior of the assembler may depend on the algorithm being used. E.g., an assembler may want to use 1-hot encoding for Strings for linear regression, but use simple indexing for trees. That makes it awkward for the user, and we may eventually want each algorithm to handle its own feature assembly if needed.

About categorical types for decision trees: There should ideally be a distinction between categorical types with arbitrary values and categorical types known to be in a range {0, 1, ..., numCategories-1}.

srowen · 2015-02-21T10:52:26Z

@mengxr Understood about being told the number of distinct values for a column by the caller/schema. I thought you were saying this was a difference between string / integer columns. Yes, the point of this value is that it can be fed in as metadata.

I think we are saying the same thing regarding strings. Yes, there's no point in every algorithm handling string types. It's fine if they only consume numeric types, and some separate optional transformation stage re-encodes strings if needed. That's a transformation stage the framework could provide, for users to drop in. Algorithms could indeed refuse to operate on string types.

To me, that means that the idea of a string-valued categorical column still has a place in the representation since it exists at some stage of a pipeline. It's just that such a thing would never reach an algorithm as-is. Is that aligned with what you guys think?

Let me get down to the code to see if this actually matters to the metadata. Right now nothing about this PR has to do with the underlying data type. As long as you aren't saying "string" is a type mutually exclusive with "categorical" then I think we're saying the same thing.

Yes I think my tree example wasn't a good one, nevermind. The algorithm is already going to choose only actual data values as split points. Hm. Might matter to algorithms that want to restrict themselves to discrete inputs, like multinomial naive bayes? even there I don't think it's an example of requiring the distinction. I'm neutral about adding 'discrete' as a type I suppose.

Interesting point: are numeric categorical values always assumed to be in {0, 1, 2, ..., n-1}? That strikes me as something that is often true, not always. A particular algorithm may require it, and may provide an optional transformation to put a feature into this form. It doesn't seem like a conceptually different type as much as an additional restriction on a categorical feature's representation? that is, is this maybe an additional bit of metadata? Hm.

Let me make some code changes to reflect the discussion so far.

srowen · 2015-02-21T12:56:45Z

(Any support for making Metadata .get methods return Option?)
I created a FeatureType hierarchy. It's a little tricky and involves traits because Binary has to have two types, but works.
I did not yet make a BinaryAttribute or DiscreteAttribute, pending views on whether it's worth it. The same sort of extra work to inject a CategoricalTypeAttribute trait or something is necessary in order to let DiscreteAttribute share attributes of a CategoricalAttribute and ContinuousAttribute. Worth it?
I did not yet make a DiscreteAttribute. Worth it? That one's not hard.
CategoricalAttribute now accepts an optional, explicit cardinality (and that replaces numCategories). I think it should only be defined for categorical types, no? I added an example in the scaladoc.
I did not rename FeatureAttributes or try to make it extend Attribute, since I understand that vector-valued features do not have heterogeneous types. That is, a vector-valued feature is simply a many-valued continuous or categorical or binary column, not a complete sub-schema.
I went to add AttributeGroup, but then I can't figure out how this isn't already covered by Attribute's dimension? It's 1 for a scalar, >1 for a vector-valued feature. That's all that's different from a metadata perspective, right?

@mengxr Please feel free at any time to take this over if it's easier. It will become more efficient at some point for you to just finish it as you think best. I personally believe your vision sounds good and am more concerned with making it match the vision of the other code you're creating, and letting you get on with that.

SparkQA · 2015-02-21T12:57:35Z

Test build #27813 has started for PR 4460 at commit 7c944da.

This patch merges cleanly.

SparkQA · 2015-02-21T12:58:30Z

Test build #27813 has finished for PR 4460 at commit 7c944da.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class Attribute(val index: Int,

AmplabJenkins · 2015-02-21T12:58:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27813/
Test FAILed.

mengxr · 2015-02-21T18:13:46Z

@jkbradley The issue with algorithms handling munging and indexing is the increased complexity. For example, if DecisionTree takes string columns, there will be some code inside DecisionTree that maps them into indices, and also we need to insert the same logic to the model it produces, and then save/load for this model.

jkbradley · 2015-02-22T21:10:57Z

To me, that means that the idea of a string-valued categorical column still has a place in the representation since it exists at some stage of a pipeline. It's just that such a thing would never reach an algorithm as-is. Is that aligned with what you guys think?

I agree: ML types should be applicable to any SQL type, as long as it makes sense.

As long as you aren't saying "string" is a type mutually exclusive with "categorical" then I think we're saying the same thing.

I think we are saying the same thing. (There will be some mutually exclusive types, such as Strings not being continuous.)

Interesting point: are numeric categorical values always assumed to be in {0, 1, 2, ..., n-1}?

Algorithms which want 0-based indices for categories (for efficient vector indexing) could handle the re-indexing themselves, but it would be nice to encode it in metadata for the benefit of ensemble algorithms (where you would only want to do the re-indexing once).

@jkbradley The issue with algorithms handling munging and indexing is the increased complexity.

@mengxr It won't really increase complexity of the code much since the same code could be re-used for all algorithms (with a few options for the feature types the algorithm can handle). The main issue is the API:

Do users want to be able to call an algorithm on any dataset they load, without thinking about the feature types?
- If so, does the implicit featurization belong in the algorithm (which knows what types it can take) or in a featurizer PipelineStage before the algorithm (where the user would have to specify feature types based on the algorithm being used)?
Or do we want to force users to examine the features and select types by hand before running an algorithm?

I've argued for algorithms handling featurization before, but I can see reason in forcing users to know what they are doing. This discussion may not belong in this PR anyways, since this functionality could be added later on.

jkbradley · 2015-02-22T21:35:21Z

I went to add AttributeGroup, but then I can't figure out how this isn't already covered by Attribute's dimension? It's 1 for a scalar, >1 for a vector-valued feature. That's all that's different from a metadata perspective, right?

If we want to have attributes parallel to arbitrary nestings of SQL types, then it will need to be changed. I'm not clear right now what the consensus is on supporting nested schema.

This continues the work in #4460 from srowen . The design doc is published on the JIRA page with some minor changes. Short description of ML attributes: https://github.com/apache/spark/pull/4925/files?diff=unified#diff-95e7f5060429f189460b44a3f8731a35R24 More details can be found in the design doc. srowen Could you help review this PR? There are many lines but most of them are boilerplate code. Author: Xiangrui Meng <meng@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes #4925 from mengxr/SPARK-4588-new and squashes the following commits: 71d1bd0 [Xiangrui Meng] add JavaDoc for package ml.attribute 617be40 [Xiangrui Meng] remove final; rename cardinality to numValues 393ffdc [Xiangrui Meng] forgot to include Java attribute group tests b1aceef [Xiangrui Meng] more tests e7ab467 [Xiangrui Meng] update ML attribute impl 7c944da [Sean Owen] Add FeatureType hierarchy and categorical cardinality 2a21d6d [Sean Owen] Initial draft of FeatureAttributes class

srowen force-pushed the SPARK-4588 branch from aa3361b to b3a1000 Compare February 11, 2015 17:42

jkbradley mentioned this pull request Feb 17, 2015

[SPARK-4081] [mllib] VectorIndexer #3000

Closed

srowen added 2 commits February 21, 2015 10:54

Initial draft of FeatureAttributes class

2a21d6d

Add FeatureType hierarchy and categorical cardinality

7c944da

srowen force-pushed the SPARK-4588 branch from b3a1000 to 7c944da Compare February 21, 2015 12:54

mengxr mentioned this pull request Mar 6, 2015

[SPARK-4588] ML Attributes #4925

Closed

srowen closed this Mar 6, 2015

srowen deleted the SPARK-4588 branch March 7, 2015 00:47

SPARK-4588 [MLLIB] [WIP] Add API for feature attributes #4460

SPARK-4588 [MLLIB] [WIP] Add API for feature attributes #4460

Uh oh!

Conversation

srowen commented Feb 8, 2015

Uh oh!

SparkQA commented Feb 8, 2015

Uh oh!

SparkQA commented Feb 8, 2015

Uh oh!

AmplabJenkins commented Feb 8, 2015

Uh oh!

mengxr commented Feb 9, 2015

Uh oh!

srowen commented Feb 11, 2015

Uh oh!

SparkQA commented Feb 11, 2015

Uh oh!

SparkQA commented Feb 11, 2015

Uh oh!

AmplabJenkins commented Feb 11, 2015

Uh oh!

jkbradley commented Feb 17, 2015

Uh oh!

sryza commented Feb 17, 2015

Uh oh!

mengxr commented Feb 17, 2015

Uh oh!

mengxr commented Feb 17, 2015

Uh oh!

srowen commented Feb 17, 2015

Uh oh!

mengxr commented Feb 19, 2015

Uh oh!

jkbradley commented Feb 19, 2015

Uh oh!

srowen commented Feb 20, 2015

Uh oh!

mengxr commented Feb 20, 2015

Uh oh!

srowen commented Feb 20, 2015

Uh oh!

mengxr commented Feb 20, 2015

Uh oh!

jkbradley commented Feb 21, 2015

Uh oh!

srowen commented Feb 21, 2015

Uh oh!

srowen commented Feb 21, 2015

Uh oh!

SparkQA commented Feb 21, 2015

Uh oh!

SparkQA commented Feb 21, 2015

Uh oh!

AmplabJenkins commented Feb 21, 2015

Uh oh!

mengxr commented Feb 21, 2015

Uh oh!

jkbradley commented Feb 22, 2015

Uh oh!

jkbradley commented Feb 22, 2015

Uh oh!

Uh oh!