[SPARK-17356][SQL][1.6] Fix out of memory issue when generating JSON for TreeNode #14973

clockfly · 2016-09-06T10:12:56Z

This is a backport of PR #14915 to branch 1.6.

What changes were proposed in this pull request?

class org.apache.spark.sql.types.Metadata is widely used in mllib to store some ml attributes. Metadata is commonly stored in Alias expression.

case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)

The Metadata can take a big memory footprint since the number of attributes is big ( in scale of million). When toJSON is called on Alias expression, the Metadata will also be converted to a big JSON string.
If a plan contains many such kind of Alias expressions, it may trigger out of memory error when toJSON is called, since converting all Metadata references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

How was this patch tested?

Existing tests.

hvanhovell · 2016-09-06T10:37:50Z

Dumb question, but aren't we using this in MLLib?

clockfly · 2016-09-06T10:40:35Z

@hvanhovell, the meta data is still kept in the plan. MLLib doesn't use toJson directly.

SparkQA · 2016-09-06T11:46:16Z

Test build #64992 has finished for PR 14973 at commit b0b4b9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…for TreeNode This is a backport of PR #14915 to branch 1.6. ## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #14973 from clockfly/json_oom_1.6.

cloud-fan · 2016-09-06T12:08:15Z

LGTM, merging to 1.6!

yhuai · 2016-09-06T17:25:39Z

Thanks!

…for TreeNode This is a backport of PR apache#14915 to branch 1.6. ## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzhong@databricks.com> Closes apache#14973 from clockfly/json_oom_1.6. (cherry picked from commit e6480a6)

clockfly force-pushed the json_oom_1.6 branch from 0b22ec6 to b0b4b9e Compare September 6, 2016 10:14

OOM

b0b4b9e

clockfly closed this Sep 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17356][SQL][1.6] Fix out of memory issue when generating JSON for TreeNode #14973

[SPARK-17356][SQL][1.6] Fix out of memory issue when generating JSON for TreeNode #14973

Uh oh!

clockfly commented Sep 6, 2016

Uh oh!

hvanhovell commented Sep 6, 2016

Uh oh!

clockfly commented Sep 6, 2016 •

edited

Loading

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

cloud-fan commented Sep 6, 2016

Uh oh!

yhuai commented Sep 6, 2016

Uh oh!

Uh oh!

[SPARK-17356][SQL][1.6] Fix out of memory issue when generating JSON for TreeNode #14973

[SPARK-17356][SQL][1.6] Fix out of memory issue when generating JSON for TreeNode #14973

Uh oh!

Conversation

clockfly commented Sep 6, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented Sep 6, 2016

Uh oh!

clockfly commented Sep 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

cloud-fan commented Sep 6, 2016

Uh oh!

yhuai commented Sep 6, 2016

Uh oh!

Uh oh!

clockfly commented Sep 6, 2016 •

edited

Loading