Skip to content

[SPARK-5885][MLLIB] Add VectorAssembler as a feature transformer #5196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Mar 26, 2015

VectorAssembler merges multiple columns into a vector column. This PR contains content from #5195.

carry ML attributes (moved to a follow-up PR)

@SparkQA
Copy link

SparkQA commented Mar 26, 2015

Test build #29199 has finished for PR 5196 at commit 29fb6ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorAssembler extends Transformer with HasInputCols with HasOutputCol
    • case class CreateStruct(children: Seq[Expression]) extends Expression

override lazy val resolved: Boolean = childrenResolved

override def dataType: StructType = {
assert(resolved, s"CreateStruct is called with unresolved children: $children.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is confusing. CreateStruct can be called with unresolved children (and you actually did this here), but CreateStruct.dataType can't, because child types may be unknown.

@liancheng
Copy link
Contributor

The build failure was caused by codegen. When the ExpressionEvaluationSuite.CreateStruct test case is executed separately, codegen is not enabled by default, and CreateStruct.dataType is not touched. But on Jenkins, some test suite executed earlier enabled codegen, then CreateStruct.dataType got called and threw exception because BoundReference used in the test case if not NamedExpression.

@liancheng
Copy link
Contributor

Actually, the failed test suites are GeneratedEvaluationSuite and GeneratedMutableEvaluationSuite, both of which just extend ExpressionEvaluationSuite.

* A feature transformer than merge multiple columns into a vector column.
*/
@AlphaComponent
class VectorAssembler extends Transformer with HasInputCols with HasOutputCol {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call it FeatureUnion to keep the same semantic with sklearn

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I like the idea of trying to keep naming and architecture as similar to sklearn as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not exactly sklearn's FeatureUnion that takes transformers. VectorAssembler merges arbitrary columns. I'm a little worried that if we use the same name people would expect the same behavior.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30018 has finished for PR 5196 at commit a52b101.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorAssembler extends Transformer with HasInputCols with HasOutputCol
    • class SparseMatrix(Matrix):
  • This patch does not change any dependencies.

@mengxr
Copy link
Contributor Author

mengxr commented Apr 13, 2015

Merged into master. Will send a follow-up PR to handle attributes.

@asfgit asfgit closed this in 9294044 Apr 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants