Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ignore_missing_files Hive config #8615

Closed

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Jan 31, 2024

Spark provides support for the data source configuration
spark.sql.files.ignoreMissingFiles, to ignore missing files while reading data
from files.
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files

Introduce Hive config ignore_missing_files to match this behavior.

Copy link

netlify bot commented Jan 31, 2024

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit d42a35f
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65bcda10b264550008d41aa7
😎 Deploy Preview https://deploy-preview-8615--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 31, 2024
@zhli1142015
Copy link
Contributor Author

@rui-mo and @mbasmanova , could you help review this PR?
Thanks.

std::shared_ptr<FileHandle> fileHandle = nullptr;
try {
fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second;
} catch (VeloxRuntimeError& e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen some discussions about the poor performance of try-catch, and am not sure if there's better way to achieve this. Maybe Status is an option. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Throwing is problematic when it happens per-row. Here, throwing should be fine.

folly::errnoStr(errno));
if (fd_ < 0) {
if (errno == ENOENT) {
VELOX_FILE_NOT_FOUND_ERROR("file {} does not exist", path);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: messages ends with dot .

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhli1142015 Looks good overall. Some comments.

auto tempFolder = ::exec::test::TempDirectoryPath::create();
auto path = fmt::format("{}/file", tempFolder->path);
auto localFs = filesystems::getFileSystem(path, nullptr);
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this scope

{
try {
LocalReadFile readFile(path);
ASSERT_FALSE(true) << "Function should throw.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FAIL()

@@ -152,6 +152,14 @@ bool HiveConfig::isPartitionPathAsLowerCase(const Config* session) const {
return session->get<bool>(kPartitionPathAsLowerCaseSession, true);
}

bool HiveConfig::ignoreMissingFiles(const Config* session) const {
if (session->isValueExists(kIgnoreMissingFilesSession)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern of double checking (exists + get) is an anti-pattern. Let's change to session->get<T>(name, defaultValue). Would you make a separate PR for that? CC: @kewang1024

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8633, thanks.

std::shared_ptr<FileHandle> fileHandle = nullptr;
try {
fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second;
} catch (VeloxRuntimeError& e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Throwing is problematic when it happens per-row. Here, throwing should be fine.

@@ -91,7 +92,19 @@ void SplitReader::prepareSplit(
VELOX_CHECK_NE(
baseReaderOpts_.getFileFormat(), dwio::common::FileFormat::UNKNOWN);

auto fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second;
std::shared_ptr<FileHandle> fileHandle = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop = nullptr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

connector::hive::HiveConfig::kIgnoreMissingFilesSession,
std::to_string(ignoreMissingFile))
.splits({split})
.assertResults("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assertEmptyResults

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

kHiveConnectorId,
connector::hive::HiveConfig::kIgnoreMissingFilesSession,
std::to_string(ignoreMissingFile))
.splits({split})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use split(split)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

EXPECT_THROW(cursor->moveNext(), VeloxRuntimeError);
auto split = HiveConnectorSplitBuilder("/path/to/nowhere.orc").build();
for (bool ignoreMissingFiles : {true, false}) {
auto assertQueryForMissingFile = [&](bool ignoreMissingFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto assertMissingFile = .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

assertQueryForMissingFile(ignoreMissingFiles);
} else {
try {
assertQueryForMissingFile(ignoreMissingFiles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use VELOX_ASSERT_THROW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks.

LocalReadFile readFile(path);
ASSERT_FALSE(true) << "Function should throw.";
} catch (VeloxRuntimeError& e) {
ASSERT_EQ(e.errorCode(), error_code::kFileNotFound);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a macro similar to VELOX_ASSERT_THROW that can check error code as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks.

@mbasmanova mbasmanova changed the title Add hive config 'ignore_missing_files' Add Hive config 'ignore_missing_files' Feb 1, 2024
@mbasmanova mbasmanova changed the title Add Hive config 'ignore_missing_files' Add ignore_missing_files Hive config Feb 1, 2024
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great % a couple nits.

@@ -63,6 +63,16 @@
<< "Expected error message to contain '" << (_errorMessage) \
<< "', but received '" << status.message() << "'."

#define VELOX_ASSERT_ERROR_CODE(_expression, _type, _errorCode) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, VELOX_ASSERT_RUNTIME_THROW_CODE(_expression, _errorCode) to match VELOX_ASSERT_RUNTIME_THROW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks.

cursor->task()->addSplit("0", makeHiveSplit("/path/to/nowhere.orc"));
EXPECT_THROW(cursor->moveNext(), VeloxRuntimeError);
auto split = HiveConnectorSplitBuilder("/path/to/nowhere.orc").build();
for (bool ignoreMissingFiles : {true, false}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, it would be more readable as

assertMissingFile(true);

VELOX_ASSERT_RUNTIME_CODE_THROW(
          assertMissingFile(false),
          error_code::kFileNotFound);

without the loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

auto fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second;
std::shared_ptr<FileHandle> fileHandle;
try {
fileHandle = fileHandleFactory_->generate(hiveSplit_->filePath).second;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also document 'generate' API to explain that it should throw a specific error code to indicate file-not-found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to add doc for this in the follow up pr, sth like: c195d90#diff-b2b8b019f80c44db000432c6c90bfffeb215c772b15908246f736ddd30011d1eR84. Please let me know if other docs are needed, thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhli1142015 This change is breaking Prestissimo tests.

[ RUN      ] BroadcastTest.invalidBroadcastFilePath
fbcode/github/presto-trunk/presto-native-execution/presto_cpp/main/operators/tests/BroadcastTest.cpp:339: Failure
Value of: e.message().find("No such file or directory") != std::string::npos
  Actual: false
Expected: true
Expected error message to contain 'No such file or directory', but received 'file /tmp/this-should-not-exist/velox--missing-broadcast-file.bin does not exist.'.

Maybe change

VELOX_FILE_NOT_FOUND_ERROR("file {} does not exist.", path);

to

VELOX_FILE_NOT_FOUND_ERROR("No such file or directory: {}", path);

@zhli1142015
Copy link
Contributor Author

@zhli1142015 This change is breaking Prestissimo tests.

[ RUN      ] BroadcastTest.invalidBroadcastFilePath
fbcode/github/presto-trunk/presto-native-execution/presto_cpp/main/operators/tests/BroadcastTest.cpp:339: Failure
Value of: e.message().find("No such file or directory") != std::string::npos
  Actual: false
Expected: true
Expected error message to contain 'No such file or directory', but received 'file /tmp/this-should-not-exist/velox--missing-broadcast-file.bin does not exist.'.

Maybe change

VELOX_FILE_NOT_FOUND_ERROR("file {} does not exist.", path);

to

VELOX_FILE_NOT_FOUND_ERROR("No such file or directory: {}", path);

Updated, thanks.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 3c2b74a.

Copy link

Conbench analyzed the 1 benchmark run on commit 3c2b74a7.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

FelixYBW pushed a commit to FelixYBW/velox that referenced this pull request Feb 12, 2024
Summary:
Spark provides support for the data source configuration
`spark.sql.files.ignoreMissingFiles`, to ignore missing files while reading data
from files.
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files

Introduce Hive config ignore_missing_files to match this behavior.

Pull Request resolved: facebookincubator#8615

Reviewed By: xiaoxmeng

Differential Revision: D53335663

Pulled By: mbasmanova

fbshipit-source-id: 328e9abb20dba7a9b1c724e372967aa86b67c90f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants