Skip to content

[SPARK-17475] [STREAMING] Delete CRC files if the filesystem doesn't use checksum files #15027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

frreiss
Copy link
Contributor

@frreiss frreiss commented Sep 9, 2016

What changes were proposed in this pull request?

When the metadata logs for various parts of Structured Streaming are stored on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves hidden HDFS-style checksum (CRC) files in the log directory, one file per batch. This PR modifies HDFSMetadataLog so that it detects the use of a filesystem that doesn't use CRC files and removes the CRC files.

How was this patch tested?

Modified an existing test case in HDFSMetadataLogSuite to check whether HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem. Ran the entire regression suite.

@@ -146,6 +146,11 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String)
// It will fail if there is an existing file (someone has committed the batch)
logDebug(s"Attempting to write log #${batchIdToPath(batchId)}")
fileManager.rename(tempPath, batchIdToPath(batchId))

// SPARK-17475: HDFSMetadataLog should not leak CRC files
// If the underlying filesystem didn't rename the CRC file, delete it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specific to streaming, or would other parts of spark benefit if this behavior were changed in the file manager?

Copy link
Contributor Author

@frreiss frreiss Sep 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe HDFSMetadataLog is only called from Structured Streaming classes currently. The FileManager trait here is part of HDFSMetadataLog.

Copy link
Member

@jodersky jodersky Sep 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, looks good otherwise

@rxin
Copy link
Contributor

rxin commented Sep 17, 2016

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 17, 2016

Test build #3275 has finished for PR 15027 at commit 9ff89c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Oct 24, 2016

@frreiss I saw ChecksumFileSystem will also rename the checksum file: https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L520

So I think this fix is not needed. [EDITED]

@frreiss
Copy link
Contributor Author

frreiss commented Oct 27, 2016

When I comment out line 155 in HDFSMetadataLog.scala on this branch (if (fileManager.exists(crcPath)) fileManager.delete(crcPath)) and run the test case attached to this PR, the test case fails:

[freiss@fuzzy]:~/spark/from_git/spark-17475$ build/sbt -Phadoop-2.7 -Pscala-2.11 "test-only org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite"
...
[info] - HDFSMetadataLog: purge *** FAILED *** (135 milliseconds)
[info]   Array(/Users/freiss/spark/from_git/spark-17475/target/tmp/spark-682aa8da-3f04-494f-846f-13c97d3e5538/..29ef67f7-1712-4350-8552-1f8bc6424d0b.tmp.crc, /Users/freiss/spark/from_git/spark-17475/target/tmp/spark-682aa8da-3f04-494f-846f-13c97d3e5538/..ab9bafcb-bdf5-4411-9a9b-60d293d653a6.tmp.crc, /Users/freiss/spark/from_git/spark-17475/target/tmp/spark-682aa8da-3f04-494f-846f-13c97d3e5538/..f79fbc34-58c5-4856-a40d-84eef49c8b9e.tmp.crc, /Users/freiss/spark/from_git/spark-17475/target/tmp/spark-682aa8da-3f04-494f-846f-13c97d3e5538/2) had size 4 instead of expected size 1 (HDFSMetadataLogSuite.scala:126)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
...

Note how the test case failure message contains a list of orphan .crc files left in HDFSMetadataLog's temp directory. So, while the filesystem code appears on the surface to be immune to this problem, the problem is clearly happening in the context of the unit tests. Determining exactly what is going on will require more in-depth investigation. Depending on the true root cause, it's possible that this problem also occurs in some distributed settings.

@zsxwing
Copy link
Member

zsxwing commented Nov 1, 2016

retest this please

@zsxwing
Copy link
Member

zsxwing commented Nov 1, 2016

It turns out there is a bug in Hadoop's FileContext that doesn't rename the checksum file.

LGTM pending tests.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67931 has finished for PR 15027 at commit 9ff89c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Nov 2, 2016

Should we just delete the crc file or rename it? As the crc file is created for the temp file, I suppose the filesystem uses it? Especially it is a bug in Hadoop's FileContext that doesn't rename the crc file, looks like not a fault of the underlying filesystem.

@rxin
Copy link
Contributor

rxin commented Nov 2, 2016

Merging in master/branch-2.1.

@asfgit asfgit closed this in 620da3b Nov 2, 2016
asfgit pushed a commit that referenced this pull request Nov 2, 2016
…se checksum files

## What changes were proposed in this pull request?

When the metadata logs for various parts of Structured Streaming are stored on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves hidden HDFS-style checksum (CRC) files in the log directory, one file per batch. This PR modifies HDFSMetadataLog so that it detects the use of a filesystem that doesn't use CRC files and removes the CRC files.
## How was this patch tested?

Modified an existing test case in HDFSMetadataLogSuite to check whether HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem.  Ran the entire regression suite.

Author: frreiss <frreiss@us.ibm.com>

Closes #15027 from frreiss/fred-17475.

(cherry picked from commit 620da3b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@frreiss
Copy link
Contributor Author

frreiss commented Nov 2, 2016

@viirya to answer your question re deleting vs moving the files: Deleting is easier to implement, because once the .crc file is deleted, you can be sure it won't appear again. Moving the checksum files would be better but would require tracking down other Spark code that does filesystem operations on the files later on. These log files are very short by HDFS standards, so the checksums aren't providing much value anyway.

@viirya
Copy link
Member

viirya commented Nov 3, 2016

@frreiss Thanks for explanation!

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…se checksum files

## What changes were proposed in this pull request?

When the metadata logs for various parts of Structured Streaming are stored on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves hidden HDFS-style checksum (CRC) files in the log directory, one file per batch. This PR modifies HDFSMetadataLog so that it detects the use of a filesystem that doesn't use CRC files and removes the CRC files.
## How was this patch tested?

Modified an existing test case in HDFSMetadataLogSuite to check whether HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem.  Ran the entire regression suite.

Author: frreiss <frreiss@us.ibm.com>

Closes apache#15027 from frreiss/fred-17475.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants