-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-23045][ML][SparkR] Update RFormula to use OneHotEncoderEstimator. #20229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to get it to kick off R tests - could you touch one of the files under R/?
also please update PR to include [SPARKR]
Actually I think this can be separated in two different PRs. One for OneHotEncoderEstimator and one for VectorSizeHint. |
// being used as reference category, we will not drop any category for that feature. | ||
if (!hasIntercept && !keepReferenceCategory) { | ||
keepReferenceCategory = true | ||
encoderStages += new OneHotEncoderEstimator(uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As OneHotEncoderEstimator
can handle multi-column now, can't we just have one OneHotEncoderEstimator
to fit for all terms at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. There is one OneHotEncoderEstimator
for dropLast=false
.
Test build #85950 has finished for PR 20229 at commit
|
@@ -199,6 +199,7 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) | |||
val parsedFormula = RFormulaParser.parse($(formula)) | |||
val resolvedFormula = parsedFormula.resolve(dataset.schema) | |||
val encoderStages = ArrayBuffer[PipelineStage]() | |||
val oneHotEncodeStages = ArrayBuffer[(String, String)]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oneHotEncodeColumns
Test build #85957 has finished for PR 20229 at commit
|
Test build #85958 has finished for PR 20229 at commit
|
Test build #85978 has finished for PR 20229 at commit
|
Test build #85982 has finished for PR 20229 at commit
|
encoderStages += new OneHotEncoderEstimator(uid) | ||
.setInputCols(Array(indexed(term))) | ||
.setOutputCols(Array(encodedCol)) | ||
.setDropLast(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here can optimize. You can merge this multiple (probable) OHEs into one. like:
define:
val oneHotEncodeColumnsNotDropLast = ArrayBuffer[(String, String)]()
and:
if (!hasIntercept && !keepReferenceCategory) {
oneHotEncodeColumnsNotDropLast += indexed(term) -> encodedCol
} else {
oneHotEncodeColumns += indexed(term) -> encodedCol
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is at most 1 encoder with dropLast(false)
, the next line sets keepReferenceCategory = true
to ensure we won't take this code path for the remaining columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks!
R/pkg/R/mllib_utils.R
Outdated
@@ -130,3 +130,4 @@ read.ml <- function(path) { | |||
stop("Unsupported model: ", jobj) | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we make sure R tests are passed, I think we can revert this change back.
@@ -25,7 +25,7 @@ import org.apache.hadoop.fs.Path | |||
import org.apache.spark.annotation.{Experimental, Since} | |||
import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, PipelineStage, Transformer} | |||
import org.apache.spark.ml.attribute.AttributeGroup | |||
import org.apache.spark.ml.linalg.VectorUDT | |||
import org.apache.spark.ml.linalg.{Vector, VectorUDT} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector
is an unused import here.
import org.apache.spark.ml.attribute._ | ||
import org.apache.spark.ml.linalg.Vectors | ||
import org.apache.spark.ml.linalg.{Vector, Vectors} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R tests passed
I've rebased on master, I think that should resolve the the issues @viirya raised. |
Test build #86133 has finished for PR 20229 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This should go into branch-2.3 to fix the deprecation warning from the old OneHotEncoder, so I'll merge it with master and branch-2.3. Thanks @MrBago and everyone who reviewed! |
## What changes were proposed in this pull request? RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid using the deprecated OneHotEncoder & to ensure the model produced can be used in streaming. ## How was this patch tested? Unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #20229 from MrBago/rFormula. (cherry picked from commit 4371466) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
What changes were proposed in this pull request?
RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid using the deprecated OneHotEncoder & to ensure the model produced can be used in streaming.
How was this patch tested?
Unit tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.