-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17475] [STREAMING] Delete CRC files if the filesystem doesn't use checksum files #15027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -146,6 +146,11 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String) | |||
// It will fail if there is an existing file (someone has committed the batch) | |||
logDebug(s"Attempting to write log #${batchIdToPath(batchId)}") | |||
fileManager.rename(tempPath, batchIdToPath(batchId)) | |||
|
|||
// SPARK-17475: HDFSMetadataLog should not leak CRC files | |||
// If the underlying filesystem didn't rename the CRC file, delete it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this specific to streaming, or would other parts of spark benefit if this behavior were changed in the file manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe HDFSMetadataLog is only called from Structured Streaming classes currently. The FileManager trait here is part of HDFSMetadataLog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, looks good otherwise
Jenkins, test this please. |
Test build #3275 has finished for PR 15027 at commit
|
@frreiss I saw ChecksumFileSystem will also rename the checksum file: https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L520 So I think this fix is not needed. [EDITED] |
When I comment out line 155 in HDFSMetadataLog.scala on this branch (
Note how the test case failure message contains a list of orphan .crc files left in HDFSMetadataLog's temp directory. So, while the filesystem code appears on the surface to be immune to this problem, the problem is clearly happening in the context of the unit tests. Determining exactly what is going on will require more in-depth investigation. Depending on the true root cause, it's possible that this problem also occurs in some distributed settings. |
retest this please |
It turns out there is a bug in Hadoop's FileContext that doesn't rename the checksum file. LGTM pending tests. |
Test build #67931 has finished for PR 15027 at commit
|
Should we just delete the crc file or rename it? As the crc file is created for the temp file, I suppose the filesystem uses it? Especially it is a bug in Hadoop's FileContext that doesn't rename the crc file, looks like not a fault of the underlying filesystem. |
Merging in master/branch-2.1. |
…se checksum files ## What changes were proposed in this pull request? When the metadata logs for various parts of Structured Streaming are stored on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves hidden HDFS-style checksum (CRC) files in the log directory, one file per batch. This PR modifies HDFSMetadataLog so that it detects the use of a filesystem that doesn't use CRC files and removes the CRC files. ## How was this patch tested? Modified an existing test case in HDFSMetadataLogSuite to check whether HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem. Ran the entire regression suite. Author: frreiss <frreiss@us.ibm.com> Closes #15027 from frreiss/fred-17475. (cherry picked from commit 620da3b) Signed-off-by: Reynold Xin <rxin@databricks.com>
@viirya to answer your question re deleting vs moving the files: Deleting is easier to implement, because once the .crc file is deleted, you can be sure it won't appear again. Moving the checksum files would be better but would require tracking down other Spark code that does filesystem operations on the files later on. These log files are very short by HDFS standards, so the checksums aren't providing much value anyway. |
@frreiss Thanks for explanation! |
…se checksum files ## What changes were proposed in this pull request? When the metadata logs for various parts of Structured Streaming are stored on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves hidden HDFS-style checksum (CRC) files in the log directory, one file per batch. This PR modifies HDFSMetadataLog so that it detects the use of a filesystem that doesn't use CRC files and removes the CRC files. ## How was this patch tested? Modified an existing test case in HDFSMetadataLogSuite to check whether HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem. Ran the entire regression suite. Author: frreiss <frreiss@us.ibm.com> Closes apache#15027 from frreiss/fred-17475.
What changes were proposed in this pull request?
When the metadata logs for various parts of Structured Streaming are stored on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves hidden HDFS-style checksum (CRC) files in the log directory, one file per batch. This PR modifies HDFSMetadataLog so that it detects the use of a filesystem that doesn't use CRC files and removes the CRC files.
How was this patch tested?
Modified an existing test case in HDFSMetadataLogSuite to check whether HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem. Ran the entire regression suite.