[SPARK-17549][sql] Only collect table size stat in driver for cached relation. #15112

vanzin · 2016-09-15T16:55:20Z

The existing code caches all stats for all columns for each partition
in the driver; for a large relation, this causes extreme memory usage,
which leads to gc hell and application failures.

It seems that only the size in bytes of the data is actually used in the
driver, so instead just colllect that. In executors, the full stats are
still kept, but that's not a big problem; we expect the data to be distributed
and thus not really incur in too much memory pressure in each individual
executor.

There are also potential improvements on the executor side, since the data
being stored currently is very wasteful (e.g. storing boxed types vs.
primitive types for stats). But that's a separate issue.

On a mildly related change, I'm also adding code to catch exceptions in the
code generator since Janino was breaking with the test data I tried this
patch on.

Tested with unit tests and by doing a count a very wide table (20k columns)
with many partitions.

…relation. The existing code caches all stats for all columns for each partition in the driver; for a large relation, this causes extreme memory usage, which leads to gc hell and application failures. It seems that only the size in bytes of the data is actually used in the driver, so instead just colllect that. In executors, the full stats are still kept, but that's not a big problem; we expect the data to be distributed and thus not really incur in too much memory pressure in each individual executor. There are also potential improvements on the executor side, since the data being stored currently is very wasteful (e.g. storing boxed types vs. primitive types for stats). But that's a separate issue. On a mildly related change, I'm also adding code to catch exceptions in the code generator since Janino was breaking with the test data I tried this patch on. Tested with unit tests and by doing a count a very wide table (20k columns) with many partitions.

vanzin · 2016-09-15T16:56:01Z

@marmbrus @yhuai this is not really my area so I'd appreciate eyes to check if this is sane.

yhuai · 2016-09-15T17:11:34Z

I have touched this part for a long time. I think we also use min/max to evaluate predicates. Can you double check? Also, what stats do we collect right now?

vanzin · 2016-09-15T17:16:20Z

The current code collects ColumnStats objects for every column of the relation; those are captured in a huge GenericInternalRow (one per partition) and sent back to the driver, which stores everything in a collection accumulator. Thus the memory issue.

On the driver side, all that seems to be done is to sum up the sizes of each column to provide a Statistics object, so it seems like most of the data could be thrown away.

vanzin · 2016-09-15T17:21:18Z

I think we also use min/max to evaluate predicates.

I see that the stats are used in InMemoryTableScanExec.scala, but they only seem to be used on the executor side; since I'm not removing the stats on the executor side, it should work the same as before.

yhuai · 2016-09-15T17:28:08Z

Thanks! I will take a look.

btw, if you have comparison related to memory footprint before and after the change, it will be good to add that in the description.

vanzin · 2016-09-15T17:34:07Z

btw, if you have comparison related to memory footprint before and after the change

That's all in the bug.

yhuai · 2016-09-15T17:30:08Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+            if (a.getClass.getName == codeAttr.getName) {
+              CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
+                codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+            }
          }
        }


When will this block fail?

OK. Thanks. Seems it is a separate issue. Let's not change this part.

oops. I forgot to send the above comment...

But without this fix you can't read the table I described in the bug at all, because SQL just blows up. If you want a separate patch just for that ok, but seems like overkill to me.

My worry is that we will just forget about this issue if we just make it log a warning. Removing this try/catch will not fail any existing tests, right? We can create a new jira to fix this issue for Spark 2.0.

How about my suggestion of adding the workaround and filing a bug? Then there's no worry about forgetting anything.

Because it's most probably a Janino bug, fixing it might not be as simple as just making some change in Spark.

OK. Seems this part is used to record some metrics. I guess it is fine. But, let me ping @ericl who added this method to double check.

In any case I filed SPARK-17565 to track the actual fix. This is just a workaround so Spark doesn't fail.

This seems ok to me. We have a unit test for the metric here so it isn't likely to break entirely without notice.

yhuai · 2016-09-15T17:32:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

-      val sizeInBytes =
-        batchStats.value.asScala.map(row => sizeOfRow.eval(row).asInstanceOf[Long]).sum
-      Statistics(sizeInBytes = sizeInBytes)
+      Statistics(sizeInBytes = batchStats.value.longValue)


Can you double check if we have test to make sure the total size is correct?

Given that I changed the stat and all tests still passed locally, I doubt we have one... I'll take a look once I find some time to get back to this patch.

SparkQA · 2016-09-15T19:04:46Z

Test build #65449 has finished for PR 15112 at commit ede3548.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-09-15T21:07:44Z

...core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala

+
+    // Check that the right size was calculated.
+    assert(cached.batchStats.value === expectedAnswer.size * INT.defaultSize)
+  }


SparkQA · 2016-09-15T22:35:41Z

Test build #65459 has finished for PR 15112 at commit d356ae3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-09-16T16:38:54Z

@yhuai any more comments?

I really want to keep the codegen metrics change because otherwise Spark just fails on the large table I tested on. We can file a separate bug to look at the janino issue and point at this one (and the data attached to the bug) as the source of the issue.

ericl · 2016-09-16T18:33:00Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+            if (a.getClass.getName == codeAttr.getName) {
+              CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
+                codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+            }
          }
        }


This seems ok to me. We have a unit test for the metric here so it isn't likely to break entirely without notice.

ericl · 2016-09-16T18:33:14Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

          }
        }
+      } catch {
+        case e: Exception =>


NonFatal(e)?

SparkQA · 2016-09-16T20:54:22Z

Test build #65501 has finished for PR 15112 at commit dc86fe3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-09-16T21:02:03Z

LGTM. Merging to master and branch 2.0.

…relation. The existing code caches all stats for all columns for each partition in the driver; for a large relation, this causes extreme memory usage, which leads to gc hell and application failures. It seems that only the size in bytes of the data is actually used in the driver, so instead just colllect that. In executors, the full stats are still kept, but that's not a big problem; we expect the data to be distributed and thus not really incur in too much memory pressure in each individual executor. There are also potential improvements on the executor side, since the data being stored currently is very wasteful (e.g. storing boxed types vs. primitive types for stats). But that's a separate issue. On a mildly related change, I'm also adding code to catch exceptions in the code generator since Janino was breaking with the test data I tried this patch on. Tested with unit tests and by doing a count a very wide table (20k columns) with many partitions. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #15112 from vanzin/SPARK-17549. (cherry picked from commit 39e2bad) Signed-off-by: Yin Huai <yhuai@databricks.com>

…relation. The existing code caches all stats for all columns for each partition in the driver; for a large relation, this causes extreme memory usage, which leads to gc hell and application failures. It seems that only the size in bytes of the data is actually used in the driver, so instead just colllect that. In executors, the full stats are still kept, but that's not a big problem; we expect the data to be distributed and thus not really incur in too much memory pressure in each individual executor. There are also potential improvements on the executor side, since the data being stored currently is very wasteful (e.g. storing boxed types vs. primitive types for stats). But that's a separate issue. On a mildly related change, I'm also adding code to catch exceptions in the code generator since Janino was breaking with the test data I tried this patch on. Tested with unit tests and by doing a count a very wide table (20k columns) with many partitions. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#15112 from vanzin/SPARK-17549.

yhuai reviewed Sep 15, 2016

View reviewed changes

Add unit test for table size stat.

d356ae3

yhuai reviewed Sep 15, 2016

View reviewed changes

ericl reviewed Sep 16, 2016

View reviewed changes

Catch NonFatal() only.

dc86fe3

vanzin changed the title ~~[RFC][SPARK-17549][sql] Only collect table size stat in driver for cached relation.~~ [SPARK-17549][sql] Only collect table size stat in driver for cached relation. Sep 16, 2016

asfgit closed this in 39e2bad Sep 16, 2016

vanzin deleted the SPARK-17549 branch September 20, 2016 16:35

vanzin mentioned this pull request Sep 26, 2016

[SPARK-17549][sql] Coalesce cached relation stats in driver. #15189

Closed

[SPARK-17549][sql] Only collect table size stat in driver for cached relation. #15112

[SPARK-17549][sql] Only collect table size stat in driver for cached relation. #15112

Uh oh!

Conversation

vanzin commented Sep 15, 2016

Uh oh!

vanzin commented Sep 15, 2016

Uh oh!

yhuai commented Sep 15, 2016

Uh oh!

vanzin commented Sep 15, 2016

Uh oh!

vanzin commented Sep 15, 2016

Uh oh!

yhuai commented Sep 15, 2016

Uh oh!

vanzin commented Sep 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

vanzin commented Sep 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Sep 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 16, 2016

Uh oh!

yhuai commented Sep 16, 2016

Uh oh!

Uh oh!

ericl Sep 16, 2016 •

edited

Loading