Skip to content

[SPARK-17356][SQL][1.6] Fix out of memory issue when generating JSON for TreeNode #14973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

clockfly
Copy link
Contributor

@clockfly clockfly commented Sep 6, 2016

This is a backport of PR #14915 to branch 1.6.

What changes were proposed in this pull request?

class org.apache.spark.sql.types.Metadata is widely used in mllib to store some ml attributes. Metadata is commonly stored in Alias expression.

case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)

The Metadata can take a big memory footprint since the number of attributes is big ( in scale of million). When toJSON is called on Alias expression, the Metadata will also be converted to a big JSON string.
If a plan contains many such kind of Alias expressions, it may trigger out of memory error when toJSON is called, since converting all Metadata references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

How was this patch tested?

Existing tests.

@hvanhovell
Copy link
Contributor

Dumb question, but aren't we using this in MLLib?

@clockfly
Copy link
Contributor Author

clockfly commented Sep 6, 2016

@hvanhovell, the meta data is still kept in the plan. MLLib doesn't use toJson directly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64992 has finished for PR 14973 at commit b0b4b9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Sep 6, 2016
…for TreeNode

This is a backport of PR #14915 to branch 1.6.

## What changes were proposed in this pull request?

class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression.

```
case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)
```

The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string.
If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

## How was this patch tested?

Existing tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14973 from clockfly/json_oom_1.6.
@cloud-fan
Copy link
Contributor

LGTM, merging to 1.6!

@yhuai
Copy link
Contributor

yhuai commented Sep 6, 2016

Thanks!

@clockfly clockfly closed this Sep 7, 2016
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Sep 7, 2016
…for TreeNode

This is a backport of PR apache#14915 to branch 1.6.

## What changes were proposed in this pull request?

class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression.

```
case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)
```

The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string.
If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

## How was this patch tested?

Existing tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes apache#14973 from clockfly/json_oom_1.6.

(cherry picked from commit e6480a6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants