Skip to content

[SPARK-18372][SQL][Branch-1.6].Staging directory fail to be removed #15819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

merlintang
Copy link

@merlintang merlintang commented Nov 9, 2016

What changes were proposed in this pull request?

This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 .
The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end.

This is backport from spark 2.0.x code, and is related to PR #12770

How was this patch tested?

manual tests

Author: Mingjie Tang mtang@hortonworks.com

@rxin
Copy link
Contributor

rxin commented Nov 9, 2016

Can you add some documentation? The current code is very difficult to follow.

@cloud-fan
Copy link
Contributor

do you have a unit test to reproduce this bug?

@merlintang
Copy link
Author

Actually, I do not have the unit test, but the code list below (same as we
posted in the JIRA) can reproduce this bug.

The related code would be this way:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS T1 (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH
'../examples/src/main/resources/kv1.txt' INTO TABLE T1")
sqlContext.sql("CREATE TABLE IF NOT EXISTS T2 (key INT, value STRING)")
val sparktestdf = sqlContext.table("T1")
val dfw = sparktestdf.write
dfw.insertInto("T2")
val sparktestcopypydfdf = sqlContext.sql("""SELECT * from T2 """)
sparktestcopypydfdf.show

Our customer and ourself also have manually reproduced this bug for spark
1.6.x and 1.5.x.

For the unit test, because we do not know how to find the hive directory
for the related table in the test case, we can not check the computed
directory in the end.

The solution is that we reuse three functions in the 2.0.2 to create the
staging directory, then this bug is fixed.

On Wed, Nov 9, 2016 at 10:26 PM, Wenchen Fan notifications@github.com
wrote:

do you have a unit test to reproduce this bug?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#15819 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABXY-YcT4gOF3RyXk0YhQTVZpHYVDSHRks5q8rj6gaJpZM4KtFSt
.

val rand: Random = new Random
val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS")
val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)
return executionId
Copy link
Contributor

@fidato13 fidato13 Nov 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the return statement in scala code be removed please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @fidato13 this is ok, since the part of this code is reused from spark 2.0.2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlintang Can we take this opportunity to rectify at other places as well, Adding a return statement at the end of a simple method where no complex control flows are introduced would rather make it seem like java style coding. Check on the below link for Scala style guide for reference:
https://github.com/databricks/scala-style-guide#return-statements

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I will fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers.

"Cannot create staging directory '" + dir.toString + "': " + e.getMessage, e)

}
return dir
Copy link
Contributor

@fidato13 fidato13 Nov 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the return statement in scala code be removed please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your comment, I will update this push it again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@merlintang
Copy link
Author

@cloud-fan @rxin can you review this code? since several customers are complaining about the hive generated empty staging files in the HDFS.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite see how this removes the staging dir. Just the deleteOnExit? does it need this complexity then?

private def executionId: String = {
val rand: Random = new Random
val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS")
val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all this -- just us a UUID? you also have a redundant return and types here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is. I am working on this way because I want to code is exactly the same as the spark 2.0.x version.

}
catch {
case e: IOException =>
throw new RuntimeException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use RuntimeException; why even handle this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find the reason that we use this code is because (1) the old version need to use the hive package to create the staging directory, in the hive code, this staging directory is storied in a hash map, and then these staging directories would be removed when the session is closed. however, our spark code do not trigger the hive session close, then, these directories will not be removed. (2) you can find the pushed code just simulate the hive way to create the staging directory inside the spark rather than based on the hive. Then, the staging directory will be removed. (3) I will fix the return type issue, thanks for your comments @srowen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all the codes in this PR are copied from the existing master. This PR is just for branch 1.6

@merlintang
Copy link
Author

merlintang commented Dec 4, 2016 via email

@gatorsmile
Copy link
Member

@merlintang Could you please add [Branch-1.6] in your PR title?

@merlintang merlintang changed the title [SPARK-18372][SQL].Staging directory fail to be removed [SPARK-18372][SQL][Branch-1.6].Staging directory fail to be removed Dec 4, 2016
@merlintang
Copy link
Author

merlintang commented Dec 4, 2016 via email

@cloud-fan
Copy link
Contributor

OK so the problem becomes, do we want to backport this to 1.6? cc @rxin

@rxin
Copy link
Contributor

rxin commented Dec 5, 2016

If it is a bug fix and low risk, sure.

@merlintang
Copy link
Author

merlintang commented Dec 5, 2016 via email

@rxin
Copy link
Contributor

rxin commented Dec 5, 2016

We have stopped making new releases for 1.5 so it makes no sense to backport.

@merlintang
Copy link
Author

merlintang commented Dec 5, 2016 via email

@cloud-fan
Copy link
Contributor

ok @merlintang can you find out which PR adds these codes to 2.0? Then other people can know what we are backporting in this PR

@merlintang
Copy link
Author

@cloud-fan this is related to this PR in the 2.0.x
#12770

@lichenglin
Copy link

I'm using spark 2.0.2 I got a really big hive-stage folder.
May I delete the folder Manually ?
does it make any bad affect on warehouse?

@gatorsmile
Copy link
Member

gatorsmile commented Dec 6, 2016

@lichenglin Could you post the layout of that staging folder?

@lichenglin
Copy link

here is some result for du -h --max-depth=1 .
3.3G ./.hive-staging_hive_2016-12-06_18-17-48_899_1400956608265117052-5
13G ./.hive-staging_hive_2016-12-06_15-43-35_928_6647980494630196053-5
8.6G ./.hive-staging_hive_2016-12-06_17-05-51_951_8422682528744006964-5
9.7G ./.hive-staging_hive_2016-12-06_17-14-44_748_6947381677226271245-5
9.2G ./day=2016-12-01
8.5G ./day=2016-11-19

I run a sql like insert overwrite db.table partition(day='2016-12-06') select * from tmpview everyday
each sql create a "hive-staging folder".

Can I delete the folders manually??

@merlintang
Copy link
Author

merlintang commented Dec 7, 2016 via email

@lichenglin
Copy link

In fact,I'm using zeppelin to run sql.
When I restart spark interpreter , the folders are deleted.
Thank you a lot

@gatorsmile
Copy link
Member

@lichenglin Another PR #16134 is trying to delete the staging directory and the temporary data files (which is pretty big in your case) after each insert.

@merlintang
Copy link
Author

@gatorsmile what is going on this patch? this is a backport code, thus, can you merge this patch into 1.6.x ? more than one users are running into this issue in the spark 1.6.x.

@gatorsmile
Copy link
Member

The current fix does not resolve the issue when users hitting abnormal termination of JVM. In addition, if the JVM does not stop, these temporary files could consume a lot of spaces. Thus, I think #16134 needs to be added too.

This is just my opinion. Also need to get the feedbacks from the other Committers.

@cloud-fan
Copy link
Contributor

yea, I think we should backport a complete staging dir cleanup functionality to 1.6, let's wait for #16134

@merlintang
Copy link
Author

merlintang commented Dec 13, 2016 via email

@gatorsmile
Copy link
Member

retest this please

val rand: Random = new Random
val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS")
val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)
return executionId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the return?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

|location '${tmpDir.toURI.toString}'
""".stripMargin)

sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can create a temporary view, instead of creating another table.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the temporary view supported in the 1.6.x? I just used the hivecontext to create the view, but it does not work. because this is small test case, the created table here would be ok. please advise. thanks so much, Tao.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 1.6, the function is registerTempTable. The name was changed in 2.0 to temp view.

Copy link
Author

@merlintang merlintang Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Xiao, I have created a dataframe, then create registerTempTable as following.

val df = sqlContext.createDataFrame((1 to 2).map(i => (i, "a"))).toDF("key", "value")
df.select("value").repartition(1).registerTempTable("tbl")

it can work, but it looks like fuzzy. what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following line?

Seq((1, "a")).toDF("key", "value").registerTempTable("tbl")

BTW, I am Xiao Li. : )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just want one column. Then, you can do it by

Seq(Tuple1("a")).toDF("value").registerTempTable("tbl")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Xiao, since one of my best friend is Tao. :). Sorry. It is updated. Thanks again.

@SparkQA
Copy link

SparkQA commented Jan 2, 2017

Test build #70785 has finished for PR 15819 at commit 8648a46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@merlintang
Copy link
Author

@gatorsmile can you retest the patch, then we can merge. Sorry to ping you multiple times since several users are asking this.

@gatorsmile
Copy link
Member

retest this please

withTable("tab", "tbl") {
sqlContext.sql(
s"""
|CREATE TABLE tab(c1 string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: two spaces -> one space

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, it is updated.

val rand: Random = new Random
val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd_HH-mm-ss_SSS")
val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong)
executionId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: an indent issue. Please remove one more space.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! thanks xiao.

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70907 has finished for PR 15819 at commit 15da7a8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

|location '${tmpDir.toURI.toString}'
""".stripMargin)

import sqlContext.implicits._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move this import to line 231.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70908 has finished for PR 15819 at commit 15da7a8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Weird... Not sure why the build failed. The build works in my local environment. cc @srowen @JoshRosen

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70964 has finished for PR 15819 at commit 4f26b28.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

test(s"$version: Delete the temporary staging directory and files after each insert") {
import sqlContext.implicits._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us roll back to the way you did in the last run, instead of using the temp table. I am not sure whether this trigger the build issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, xiao, I have reverted that and test locally.

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

LGTM pending test

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70990 has finished for PR 15819 at commit ab5e369.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jan 7, 2017
## What changes were proposed in this pull request?

This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 .
The insertIntoHiveTable would generate a .staging directory, but this directory  fail to be removed in the end.

This is backport from spark 2.0.x code, and is related to PR #12770

## How was this patch tested?
manual tests

Author: Mingjie Tang <mtanghortonworks.com>

Author: Mingjie Tang <mtang@hortonworks.com>
Author: Mingjie Tang <mtang@HW12398.local>

Closes #15819 from merlintang/branch-1.6.
@gatorsmile
Copy link
Member

Thanks! Merging to 1.6

@gatorsmile
Copy link
Member

@merlintang Can you close this PR? Thanks!

@merlintang
Copy link
Author

Many thanks, Xiao. I learnt lots.

@merlintang merlintang closed this Jan 7, 2017
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Jan 7, 2017
## What changes were proposed in this pull request?

This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 .
The insertIntoHiveTable would generate a .staging directory, but this directory  fail to be removed in the end.

This is backport from spark 2.0.x code, and is related to PR apache#12770

## How was this patch tested?
manual tests

Author: Mingjie Tang <mtanghortonworks.com>

Author: Mingjie Tang <mtang@hortonworks.com>
Author: Mingjie Tang <mtang@HW12398.local>

Closes apache#15819 from merlintang/branch-1.6.

(cherry picked from commit 2303887)
dosoft pushed a commit to WANdisco/spark that referenced this pull request Jan 24, 2017
## What changes were proposed in this pull request?

This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 .
The insertIntoHiveTable would generate a .staging directory, but this directory  fail to be removed in the end.

This is backport from spark 2.0.x code, and is related to PR apache#12770

## How was this patch tested?
manual tests

Author: Mingjie Tang <mtanghortonworks.com>

Author: Mingjie Tang <mtang@hortonworks.com>
Author: Mingjie Tang <mtang@HW12398.local>

Closes apache#15819 from merlintang/branch-1.6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants