[SPARK-18372][SQL][Branch-1.6].Staging directory fail to be removed #15819

merlintang · 2016-11-09T00:51:16Z

What changes were proposed in this pull request?

This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 .
The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end.

This is backport from spark 2.0.x code, and is related to PR #12770

How was this patch tested?

manual tests

Author: Mingjie Tang mtang@hortonworks.com

rxin · 2016-11-09T05:14:30Z

Can you add some documentation? The current code is very difficult to follow.

cloud-fan · 2016-11-10T06:25:39Z

do you have a unit test to reproduce this bug?

merlintang · 2016-11-10T07:00:05Z

Actually, I do not have the unit test, but the code list below (same as we
posted in the JIRA) can reproduce this bug.

The related code would be this way:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS T1 (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH
'../examples/src/main/resources/kv1.txt' INTO TABLE T1")
sqlContext.sql("CREATE TABLE IF NOT EXISTS T2 (key INT, value STRING)")
val sparktestdf = sqlContext.table("T1")
val dfw = sparktestdf.write
dfw.insertInto("T2")
val sparktestcopypydfdf = sqlContext.sql("""SELECT * from T2 """)
sparktestcopypydfdf.show

Our customer and ourself also have manually reproduced this bug for spark
1.6.x and 1.5.x.

For the unit test, because we do not know how to find the hive directory
for the related table in the test case, we can not check the computed
directory in the end.

The solution is that we reuse three functions in the 2.0.2 to create the
staging directory, then this bug is fixed.

On Wed, Nov 9, 2016 at 10:26 PM, Wenchen Fan notifications@github.com
wrote:

do you have a unit test to reproduce this bug?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#15819 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABXY-YcT4gOF3RyXk0YhQTVZpHYVDSHRks5q8rj6gaJpZM4KtFSt
.

fidato13 · 2016-11-11T23:45:31Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+    val rand: Random = new Random
+    val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS")
+    val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)
+    return executionId


Can the return statement in scala code be removed please.

hi @fidato13 this is ok, since the part of this code is reused from spark 2.0.2.

@merlintang Can we take this opportunity to rectify at other places as well, Adding a return statement at the end of a simple method where no complex control flows are introduced would rather make it seem like java style coding. Check on the below link for Scala style guide for reference:
https://github.com/databricks/scala-style-guide#return-statements

thanks, I will fix it.

fidato13 · 2016-11-11T23:46:02Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+          "Cannot create staging directory '" + dir.toString + "': " + e.getMessage, e)
+
+    }
+    return dir


Can the return statement in scala code be removed please.

thanks for your comment, I will update this push it again.

merlintang · 2016-11-16T22:23:56Z

@cloud-fan @rxin can you review this code? since several customers are complaining about the hive generated empty staging files in the HDFS.

srowen

I don't quite see how this removes the staging dir. Just the deleteOnExit? does it need this complexity then?

srowen · 2016-11-17T10:57:51Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+  private def executionId: String = {
+    val rand: Random = new Random
+    val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS")
+    val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)


Why all this -- just us a UUID? you also have a redundant return and types here.

yes, it is. I am working on this way because I want to code is exactly the same as the spark 2.0.x version.

srowen · 2016-11-17T10:58:46Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+    }
+    catch {
+      case e: IOException =>
+        throw new RuntimeException(


Don't use RuntimeException; why even handle this?

You can find the reason that we use this code is because (1) the old version need to use the hive package to create the staging directory, in the hive code, this staging directory is storied in a hash map, and then these staging directories would be removed when the session is closed. however, our spark code do not trigger the hive session close, then, these directories will not be removed. (2) you can find the pushed code just simulate the hive way to create the staging directory inside the spark rather than based on the hive. Then, the staging directory will be removed. (3) I will fix the return type issue, thanks for your comments @srowen

Almost all the codes in this PR are copied from the existing master. This PR is just for branch 1.6

merlintang · 2016-12-04T18:23:45Z

yes, exactly. This path is only for spark 1.x. what i proposed here is that we need to use the code of spark 2.0.x o fix the bug of spark 1.x. you can see this message from the my previous replies. I do not want to change the code, since it will make the 1.x and 2.x in great different.

…

On Sun, Dec 4, 2016 at 10:08 AM, Xiao Li ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/hive/src/main/scala/org/apache/spark/sql/hive/ execution/InsertIntoHiveTable.scala <#15819>: > + } else { + inputPathName.substring(0, inputPathName.indexOf(stagingDir) + stagingDir.length) + } + val dir: Path = + fs.makeQualified( + new Path(stagingPathName + "_" + executionId + "-" + TaskRunner.getTaskRunnerID)) + logDebug("Created staging dir = " + dir + " for path = " + inputPath) + try { + if (!FileUtils.mkdir(fs, dir, true, hadoopConf)) { + throw new IllegalStateException("Cannot create staging directory '" + dir.toString + "'") + } + fs.deleteOnExit(dir) + } + catch { + case e: IOException => + throw new RuntimeException( Almost all the codes in this PR are copied from the existing master. This PR is just for branch 1.6 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15819>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXY-aaIs7Wx6ha3mvqrTVIxehcxGkaYks5rEwGKgaJpZM4KtFSt> .

gatorsmile · 2016-12-04T19:23:21Z

@merlintang Could you please add [Branch-1.6] in your PR title?

merlintang · 2016-12-04T22:52:12Z

it is updated.

…

On Sun, Dec 4, 2016 at 11:23 AM, Xiao Li ***@***.***> wrote: @merlintang <https://github.com/merlintang> Could you please add [Branch-1.6] in your PR title? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXY-Q749lhH4-ePuwIlqR_-AjMhdlDIks5rExNAgaJpZM4KtFSt> .

cloud-fan · 2016-12-05T00:44:32Z

OK so the problem becomes, do we want to backport this to 1.6? cc @rxin

rxin · 2016-12-05T02:20:21Z

If it is a bug fix and low risk, sure.

merlintang · 2016-12-05T02:23:10Z

this bug is related to 1.5.x as well as 1.6.x. please backport to 1.5.x as well.

…

On Sun, Dec 4, 2016 at 6:20 PM, Reynold Xin ***@***.***> wrote: If it is a bug fix and low risk, sure. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXY-fX25g3sjKumkkWJPXjq1Wq2jMqvks5rE3T8gaJpZM4KtFSt> .

rxin · 2016-12-05T02:25:21Z

We have stopped making new releases for 1.5 so it makes no sense to backport.

merlintang · 2016-12-05T02:27:15Z

Ok.

…

On Sun, Dec 4, 2016 at 6:25 PM, Reynold Xin ***@***.***> wrote: We have stopped making new releases for 1.5 so it makes no sense to backport. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXY-VjuhfvucqSwiSitncO_gIX_7G-wks5rE3YogaJpZM4KtFSt> .

cloud-fan · 2016-12-05T03:47:10Z

ok @merlintang can you find out which PR adds these codes to 2.0? Then other people can know what we are backporting in this PR

merlintang · 2016-12-05T04:01:49Z

@cloud-fan this is related to this PR in the 2.0.x
#12770

lichenglin · 2016-12-06T10:03:03Z

I'm using spark 2.0.2 I got a really big hive-stage folder.
May I delete the folder Manually ?
does it make any bad affect on warehouse?

gatorsmile · 2016-12-06T17:33:14Z

@lichenglin Could you post the layout of that staging folder?

lichenglin · 2016-12-07T01:09:18Z

here is some result for du -h --max-depth=1 .
3.3G ./.hive-staging_hive_2016-12-06_18-17-48_899_1400956608265117052-5
13G ./.hive-staging_hive_2016-12-06_15-43-35_928_6647980494630196053-5
8.6G ./.hive-staging_hive_2016-12-06_17-05-51_951_8422682528744006964-5
9.7G ./.hive-staging_hive_2016-12-06_17-14-44_748_6947381677226271245-5
9.2G ./day=2016-12-01
8.5G ./day=2016-11-19

I run a sql like insert overwrite db.table partition(day='2016-12-06') select * from tmpview everyday
each sql create a "hive-staging folder".

Can I delete the folders manually??

merlintang · 2016-12-07T01:15:35Z

do you exit the spark shell ? I have tested on this, and this staging file would be removed after we exit the spark shell under spark 2.0.x. meanwhile, the staging file are used for hive to write data, and if one hive insert data fail in the middle, the staging file could be used.

…

On Tue, Dec 6, 2016 at 5:09 PM, lichenglin ***@***.***> wrote: here is some result for du -h --max-depth=1 . 3.3G ./.hive-staging_hive_2016-12-06_18-17-48_899_1400956608265117052-5 13G ./.hive-staging_hive_2016-12-06_15-43-35_928_6647980494630196053-5 8.6G ./.hive-staging_hive_2016-12-06_17-05-51_951_8422682528744006964-5 9.7G ./.hive-staging_hive_2016-12-06_17-14-44_748_6947381677226271245-5 9.2G ./day=2016-12-01 8.5G ./day=2016-11-19 I run a sql like insert overwrite db.table partition(day='2016-12-06') select * from tmpview everyday each sql create a "hive-staging folder". Can I delete the folders manually?? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXY-cCebx3piETzStocxtvovCRPX6Ukks5rFgdYgaJpZM4KtFSt> .

lichenglin · 2016-12-07T01:50:50Z

In fact,I'm using zeppelin to run sql.
When I restart spark interpreter , the folders are deleted.
Thank you a lot

gatorsmile · 2016-12-07T02:27:18Z

@lichenglin Another PR #16134 is trying to delete the staging directory and the temporary data files (which is pretty big in your case) after each insert.

merlintang · 2016-12-12T23:16:59Z

@gatorsmile what is going on this patch? this is a backport code, thus, can you merge this patch into 1.6.x ? more than one users are running into this issue in the spark 1.6.x.

gatorsmile · 2016-12-13T07:23:32Z

The current fix does not resolve the issue when users hitting abnormal termination of JVM. In addition, if the JVM does not stop, these temporary files could consume a lot of spaces. Thus, I think #16134 needs to be added too.

This is just my opinion. Also need to get the feedbacks from the other Committers.

cloud-fan · 2016-12-13T08:17:39Z

yea, I think we should backport a complete staging dir cleanup functionality to 1.6, let's wait for #16134

merlintang · 2016-12-13T17:56:11Z

Great, once the #16134 <#16134> is done, we can backport them together.

…

On Tue, Dec 13, 2016 at 12:18 AM, Wenchen Fan ***@***.***> wrote: yea, I think we should backport a complete staging dir cleanup functionality to 1.6, let's wait for #16134 <#16134> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXY-VSxRlOyt2H4ySKmNJm4j4q5facoks5rHlS8gaJpZM4KtFSt> .

gatorsmile · 2017-01-02T19:19:11Z

retest this please

gatorsmile · 2017-01-02T19:25:33Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+    val rand: Random = new Random
+    val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS")
+    val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)
+    return executionId


Please remove the return?

gatorsmile · 2017-01-02T19:35:35Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+               |location '${tmpDir.toURI.toString}'
+             """.stripMargin)
+
+          sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a")


you can create a temporary view, instead of creating another table.

does the temporary view supported in the 1.6.x? I just used the hivecontext to create the view, but it does not work. because this is small test case, the created table here would be ok. please advise. thanks so much, Tao.

In 1.6, the function is registerTempTable. The name was changed in 2.0 to temp view.

thanks Xiao, I have created a dataframe, then create registerTempTable as following.

val df = sqlContext.createDataFrame((1 to 2).map(i => (i, "a"))).toDF("key", "value")
df.select("value").repartition(1).registerTempTable("tbl")

it can work, but it looks like fuzzy. what do you think?

How about the following line?

Seq((1, "a")).toDF("key", "value").registerTempTable("tbl")

BTW, I am Xiao Li. : )

You just want one column. Then, you can do it by

Seq(Tuple1("a")).toDF("value").registerTempTable("tbl")

Sorry Xiao, since one of my best friend is Tao. :). Sorry. It is updated. Thanks again.

SparkQA · 2017-01-02T20:51:33Z

Test build #70785 has finished for PR 15819 at commit 8648a46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

merlintang · 2017-01-04T06:37:12Z

@gatorsmile can you retest the patch, then we can merge. Sorry to ping you multiple times since several users are asking this.

gatorsmile · 2017-01-05T06:21:05Z

retest this please

gatorsmile · 2017-01-05T06:23:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+        withTable("tab", "tbl") {
+          sqlContext.sql(
+            s"""
+               |CREATE  TABLE tab(c1 string)


Nit: two spaces -> one space

thanks, it is updated.

gatorsmile · 2017-01-05T06:25:48Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+    val rand: Random = new Random
+    val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS")
+    val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)
+     executionId


Nit: an indent issue. Please remove one more space.

Done! thanks xiao.

SparkQA · 2017-01-05T06:29:19Z

Test build #70907 has finished for PR 15819 at commit 15da7a8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-05T06:30:04Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+               |location '${tmpDir.toURI.toString}'
+             """.stripMargin)
+
+          import sqlContext.implicits._


Nit: move this import to line 231.

gatorsmile · 2017-01-05T06:32:03Z

retest this please

SparkQA · 2017-01-05T06:38:31Z

Test build #70908 has finished for PR 15819 at commit 15da7a8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-05T07:01:56Z

Weird... Not sure why the build failed. The build works in my local environment. cc @srowen @JoshRosen

gatorsmile · 2017-01-06T05:31:41Z

retest this please

SparkQA · 2017-01-06T05:38:07Z

Test build #70964 has finished for PR 15819 at commit 4f26b28.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-06T05:41:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+    }
+
+    test(s"$version: Delete the temporary staging directory and files after each insert") {
+      import sqlContext.implicits._


Let us roll back to the way you did in the last run, instead of using the temp table. I am not sure whether this trigger the build issue.

thanks, xiao, I have reverted that and test locally.

gatorsmile · 2017-01-06T18:13:35Z

retest this please

gatorsmile · 2017-01-06T18:33:41Z

LGTM pending test

SparkQA · 2017-01-06T19:56:43Z

Test build #70990 has finished for PR 15819 at commit ab5e369.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. This is backport from spark 2.0.x code, and is related to PR #12770 ## How was this patch tested? manual tests Author: Mingjie Tang <mtanghortonworks.com> Author: Mingjie Tang <mtang@hortonworks.com> Author: Mingjie Tang <mtang@HW12398.local> Closes #15819 from merlintang/branch-1.6.

gatorsmile · 2017-01-07T01:31:38Z

Thanks! Merging to 1.6

gatorsmile · 2017-01-07T01:32:21Z

@merlintang Can you close this PR? Thanks!

merlintang · 2017-01-07T01:37:31Z

Many thanks, Xiao. I learnt lots.

## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. This is backport from spark 2.0.x code, and is related to PR apache#12770 ## How was this patch tested? manual tests Author: Mingjie Tang <mtanghortonworks.com> Author: Mingjie Tang <mtang@hortonworks.com> Author: Mingjie Tang <mtang@HW12398.local> Closes apache#15819 from merlintang/branch-1.6. (cherry picked from commit 2303887)

## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. This is backport from spark 2.0.x code, and is related to PR apache#12770 ## How was this patch tested? manual tests Author: Mingjie Tang <mtanghortonworks.com> Author: Mingjie Tang <mtang@hortonworks.com> Author: Mingjie Tang <mtang@HW12398.local> Closes apache#15819 from merlintang/branch-1.6.

SPARK-18372

ac65375

fidato13 reviewed Nov 11, 2016

View reviewed changes

srowen reviewed Nov 17, 2016

View reviewed changes

cloud-fan mentioned this pull request Dec 4, 2016

[SPARK-18703] [SQL] Drop Staging Directories and Data Files After each Insertion/CTAS of Hive serde Tables #16134

Closed

merlintang changed the title ~~[SPARK-18372][SQL].Staging directory fail to be removed~~ [SPARK-18372][SQL][Branch-1.6].Staging directory fail to be removed Dec 4, 2016

gatorsmile reviewed Jan 2, 2017

View reviewed changes

Mingjie Tang added 2 commits January 2, 2017 21:57

fix based on tao's comment

881c96b

Based ON Xiao Li's review

15da7a8

gatorsmile reviewed Jan 5, 2017

View reviewed changes

fix indent and repartition the DF to meet the test case

4f26b28

gatorsmile reviewed Jan 6, 2017

View reviewed changes

revert to create table for testing

ab5e369

merlintang closed this Jan 7, 2017

[SPARK-18372][SQL][Branch-1.6].Staging directory fail to be removed #15819

[SPARK-18372][SQL][Branch-1.6].Staging directory fail to be removed #15819

Uh oh!

Conversation

merlintang commented Nov 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Nov 9, 2016

Uh oh!

cloud-fan commented Nov 10, 2016

Uh oh!

merlintang commented Nov 10, 2016

Uh oh!

fidato13 Nov 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fidato13 Nov 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merlintang commented Nov 16, 2016

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merlintang commented Dec 4, 2016 via email

Uh oh!

gatorsmile commented Dec 4, 2016

Uh oh!

merlintang commented Dec 4, 2016 via email

Uh oh!

cloud-fan commented Dec 5, 2016

Uh oh!

rxin commented Dec 5, 2016

Uh oh!

merlintang commented Dec 5, 2016 via email

Uh oh!

rxin commented Dec 5, 2016

Uh oh!

merlintang commented Dec 5, 2016 via email

Uh oh!

cloud-fan commented Dec 5, 2016

Uh oh!

merlintang commented Dec 5, 2016

Uh oh!

lichenglin commented Dec 6, 2016

Uh oh!

gatorsmile commented Dec 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lichenglin commented Dec 7, 2016

Uh oh!

merlintang commented Dec 7, 2016 via email

Uh oh!

lichenglin commented Dec 7, 2016

merlintang commented Nov 9, 2016 •

edited

Loading

fidato13 Nov 11, 2016 •

edited

Loading

fidato13 Nov 11, 2016 •

edited

Loading

gatorsmile commented Dec 6, 2016 •

edited

Loading

merlintang Jan 3, 2017 •

edited

Loading