-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ignore_missing_files Hive config #8615
Add ignore_missing_files Hive config #8615
Conversation
✅ Deploy Preview for meta-velox ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@rui-mo and @mbasmanova , could you help review this PR? |
std::shared_ptr<FileHandle> fileHandle = nullptr; | ||
try { | ||
fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second; | ||
} catch (VeloxRuntimeError& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen some discussions about the poor performance of try-catch, and am not sure if there's better way to achieve this. Maybe Status is an option. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Throwing is problematic when it happens per-row. Here, throwing should be fine.
velox/common/file/File.cpp
Outdated
folly::errnoStr(errno)); | ||
if (fd_ < 0) { | ||
if (errno == ENOENT) { | ||
VELOX_FILE_NOT_FOUND_ERROR("file {} does not exist", path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: messages ends with dot .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhli1142015 Looks good overall. Some comments.
velox/common/file/tests/FileTest.cpp
Outdated
auto tempFolder = ::exec::test::TempDirectoryPath::create(); | ||
auto path = fmt::format("{}/file", tempFolder->path); | ||
auto localFs = filesystems::getFileSystem(path, nullptr); | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this scope
velox/common/file/tests/FileTest.cpp
Outdated
{ | ||
try { | ||
LocalReadFile readFile(path); | ||
ASSERT_FALSE(true) << "Function should throw."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FAIL()
velox/connectors/hive/HiveConfig.cpp
Outdated
@@ -152,6 +152,14 @@ bool HiveConfig::isPartitionPathAsLowerCase(const Config* session) const { | |||
return session->get<bool>(kPartitionPathAsLowerCaseSession, true); | |||
} | |||
|
|||
bool HiveConfig::ignoreMissingFiles(const Config* session) const { | |||
if (session->isValueExists(kIgnoreMissingFilesSession)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern of double checking (exists + get) is an anti-pattern. Let's change to session->get<T>(name, defaultValue)
. Would you make a separate PR for that? CC: @kewang1024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#8633, thanks.
std::shared_ptr<FileHandle> fileHandle = nullptr; | ||
try { | ||
fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second; | ||
} catch (VeloxRuntimeError& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Throwing is problematic when it happens per-row. Here, throwing should be fine.
@@ -91,7 +92,19 @@ void SplitReader::prepareSplit( | |||
VELOX_CHECK_NE( | |||
baseReaderOpts_.getFileFormat(), dwio::common::FileFormat::UNKNOWN); | |||
|
|||
auto fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second; | |||
std::shared_ptr<FileHandle> fileHandle = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop = nullptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
velox/exec/tests/TableScanTest.cpp
Outdated
connector::hive::HiveConfig::kIgnoreMissingFilesSession, | ||
std::to_string(ignoreMissingFile)) | ||
.splits({split}) | ||
.assertResults(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertEmptyResults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
velox/exec/tests/TableScanTest.cpp
Outdated
kHiveConnectorId, | ||
connector::hive::HiveConfig::kIgnoreMissingFilesSession, | ||
std::to_string(ignoreMissingFile)) | ||
.splits({split}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use split(split)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
velox/exec/tests/TableScanTest.cpp
Outdated
EXPECT_THROW(cursor->moveNext(), VeloxRuntimeError); | ||
auto split = HiveConnectorSplitBuilder("/path/to/nowhere.orc").build(); | ||
for (bool ignoreMissingFiles : {true, false}) { | ||
auto assertQueryForMissingFile = [&](bool ignoreMissingFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto assertMissingFile = .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
velox/exec/tests/TableScanTest.cpp
Outdated
assertQueryForMissingFile(ignoreMissingFiles); | ||
} else { | ||
try { | ||
assertQueryForMissingFile(ignoreMissingFiles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use VELOX_ASSERT_THROW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks.
velox/common/file/tests/FileTest.cpp
Outdated
LocalReadFile readFile(path); | ||
ASSERT_FALSE(true) << "Function should throw."; | ||
} catch (VeloxRuntimeError& e) { | ||
ASSERT_EQ(e.errorCode(), error_code::kFileNotFound); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a macro similar to VELOX_ASSERT_THROW that can check error code as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great % a couple nits.
velox/common/base/tests/GTestUtils.h
Outdated
@@ -63,6 +63,16 @@ | |||
<< "Expected error message to contain '" << (_errorMessage) \ | |||
<< "', but received '" << status.message() << "'." | |||
|
|||
#define VELOX_ASSERT_ERROR_CODE(_expression, _type, _errorCode) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps, VELOX_ASSERT_RUNTIME_THROW_CODE(_expression, _errorCode) to match VELOX_ASSERT_RUNTIME_THROW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks.
velox/exec/tests/TableScanTest.cpp
Outdated
cursor->task()->addSplit("0", makeHiveSplit("/path/to/nowhere.orc")); | ||
EXPECT_THROW(cursor->moveNext(), VeloxRuntimeError); | ||
auto split = HiveConnectorSplitBuilder("/path/to/nowhere.orc").build(); | ||
for (bool ignoreMissingFiles : {true, false}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps, it would be more readable as
assertMissingFile(true);
VELOX_ASSERT_RUNTIME_CODE_THROW(
assertMissingFile(false),
error_code::kFileNotFound);
without the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
auto fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second; | ||
std::shared_ptr<FileHandle> fileHandle; | ||
try { | ||
fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also document 'generate' API to explain that it should throw a specific error code to indicate file-not-found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to add doc for this in the follow up pr, sth like: c195d90#diff-b2b8b019f80c44db000432c6c90bfffeb215c772b15908246f736ddd30011d1eR84. Please let me know if other docs are needed, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhli1142015 This change is breaking Prestissimo tests.
[ RUN ] BroadcastTest.invalidBroadcastFilePath
fbcode/github/presto-trunk/presto-native-execution/presto_cpp/main/operators/tests/BroadcastTest.cpp:339: Failure
Value of: e.message().find("No such file or directory") != std::string::npos
Actual: false
Expected: true
Expected error message to contain 'No such file or directory', but received 'file /tmp/this-should-not-exist/velox--missing-broadcast-file.bin does not exist.'.
Maybe change
VELOX_FILE_NOT_FOUND_ERROR("file {} does not exist.", path);
to
VELOX_FILE_NOT_FOUND_ERROR("No such file or directory: {}", path);
Updated, thanks. |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova merged this pull request in 3c2b74a. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: Spark provides support for the data source configuration `spark.sql.files.ignoreMissingFiles`, to ignore missing files while reading data from files. https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files Introduce Hive config ignore_missing_files to match this behavior. Pull Request resolved: facebookincubator#8615 Reviewed By: xiaoxmeng Differential Revision: D53335663 Pulled By: mbasmanova fbshipit-source-id: 328e9abb20dba7a9b1c724e372967aa86b67c90f
Spark provides support for the data source configuration
spark.sql.files.ignoreMissingFiles
, to ignore missing files while reading datafrom files.
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files
Introduce Hive config ignore_missing_files to match this behavior.