Skip to content

[SPARK-23045][ML][SparkR] Update RFormula to use OneHotEncoderEstimator. #20229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

MrBago
Copy link
Contributor

@MrBago MrBago commented Jan 11, 2018

What changes were proposed in this pull request?

RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid using the deprecated OneHotEncoder & to ensure the model produced can be used in streaming.

How was this patch tested?

Unit tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@MrBago MrBago changed the title Update RFormula to use VectorSizeHint & OneHotEncoderEstimator. [SPARK-23037][ML] Update RFormula to use VectorSizeHint & OneHotEncoderEstimator. Jan 11, 2018
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to get it to kick off R tests - could you touch one of the files under R/?
also please update PR to include [SPARKR]

@viirya
Copy link
Member

viirya commented Jan 11, 2018

Actually I think this can be separated in two different PRs. One for OneHotEncoderEstimator and one for VectorSizeHint.

// being used as reference category, we will not drop any category for that feature.
if (!hasIntercept && !keepReferenceCategory) {
keepReferenceCategory = true
encoderStages += new OneHotEncoderEstimator(uid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As OneHotEncoderEstimator can handle multi-column now, can't we just have one OneHotEncoderEstimator to fit for all terms at once?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. There is one OneHotEncoderEstimator for dropLast=false.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85950 has finished for PR 20229 at commit 7bad275.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -199,6 +199,7 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String)
val parsedFormula = RFormulaParser.parse($(formula))
val resolvedFormula = parsedFormula.resolve(dataset.schema)
val encoderStages = ArrayBuffer[PipelineStage]()
val oneHotEncodeStages = ArrayBuffer[(String, String)]()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oneHotEncodeColumns

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85957 has finished for PR 20229 at commit f67ce68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85958 has finished for PR 20229 at commit 2f64147.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MrBago MrBago changed the title [SPARK-23037][ML] Update RFormula to use VectorSizeHint & OneHotEncoderEstimator. [SPARK-23045][ML] Update RFormula to use OneHotEncoderEstimator. Jan 11, 2018
@MrBago MrBago changed the title [SPARK-23045][ML] Update RFormula to use OneHotEncoderEstimator. [SPARK-23045][ML][SparkR] Update RFormula to use OneHotEncoderEstimator. Jan 11, 2018
@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85978 has finished for PR 20229 at commit 17854de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85982 has finished for PR 20229 at commit 73fb60c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

encoderStages += new OneHotEncoderEstimator(uid)
.setInputCols(Array(indexed(term)))
.setOutputCols(Array(encodedCol))
.setDropLast(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here can optimize. You can merge this multiple (probable) OHEs into one. like:
define:

val oneHotEncodeColumnsNotDropLast = ArrayBuffer[(String, String)]()

and:

if (!hasIntercept && !keepReferenceCategory) {
    oneHotEncodeColumnsNotDropLast += indexed(term) -> encodedCol
} else {
   oneHotEncodeColumns += indexed(term) -> encodedCol
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is at most 1 encoder with dropLast(false), the next line sets keepReferenceCategory = true to ensure we won't take this code path for the remaining columns.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks!

@@ -130,3 +130,4 @@ read.ml <- function(path) {
stop("Unsupported model: ", jobj)
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we make sure R tests are passed, I think we can revert this change back.

@@ -25,7 +25,7 @@ import org.apache.hadoop.fs.Path
import org.apache.spark.annotation.{Experimental, Since}
import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, PipelineStage, Transformer}
import org.apache.spark.ml.attribute.AttributeGroup
import org.apache.spark.ml.linalg.VectorUDT
import org.apache.spark.ml.linalg.{Vector, VectorUDT}
Copy link
Member

@viirya viirya Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector is an unused import here.

import org.apache.spark.ml.attribute._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.{Vector, Vectors}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R tests passed

@MrBago
Copy link
Contributor Author

MrBago commented Jan 15, 2018

I've rebased on master, I think that should resolve the the issues @viirya raised.

@SparkQA
Copy link

SparkQA commented Jan 15, 2018

Test build #86133 has finished for PR 20229 at commit 68d9ba1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jkbradley
Copy link
Member

This should go into branch-2.3 to fix the deprecation warning from the old OneHotEncoder, so I'll merge it with master and branch-2.3. Thanks @MrBago and everyone who reviewed!

asfgit pushed a commit that referenced this pull request Jan 16, 2018
## What changes were proposed in this pull request?

RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid using the deprecated OneHotEncoder & to ensure the model produced can be used in streaming.

## How was this patch tested?

Unit tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Bago Amirbekian <bago@databricks.com>

Closes #20229 from MrBago/rFormula.

(cherry picked from commit 4371466)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in 4371466 Jan 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants