Skip to content

GH-37626: [Java][Cpp] support using existing java file system #37690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zinking
Copy link

@zinking zinking commented Sep 13, 2023

Rationale for this change

when using the dataset api to repeatedly read chunks of file, FileSystem instances are created every time. causing performance degrade.

What changes are included in this PR?

to reuse existing filesystem passed from java context.

Are these changes tested?

tested on my local integration. as there seems no end to end dataset testing in the base

Are there any user-facing changes?

NO

@github-actions
Copy link

⚠️ GitHub issue apache/arrow-java#166 has been automatically assigned in GitHub to PR creator.

@lidavidm
Copy link
Member

I think what we'd want is to add a FileSystem parameter to FileSystemDatasetFactory::Make or FileSystemFactoryOptions, instead of a global.

@zinking
Copy link
Author

zinking commented Sep 13, 2023

I think what we'd want is to add a FileSystem parameter to FileSystemDatasetFactory::Make or FileSystemFactoryOptions, instead of a global.

hmm, just the naming, it is a field in FileSystemFactoryOptions and then passed down not a global variable.

@lidavidm
Copy link
Member

Ah, sorry. In that case it should be named something less Java-specific and should probably be typed std::shared_ptr<FileSystem>? You can use PathFromUri to validate that the filesystem is compatible with the given URI

Comment on lines +195 to +198

/// when java context have a file system reference to be used
// then use this file system reference in the context
std::shared_ptr<void> file_system_java = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be Java-specific at all.

This should be shared_ptr<FileSystem>.

@@ -217,7 +217,7 @@ Result<std::shared_ptr<DatasetFactory>> FileSystemDatasetFactory::Make(
// ARROW-12481.
std::string internal_path;
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<fs::FileSystem> filesystem,
fs::FileSystemFromUri(uri, &internal_path))
fs::FileSystemFromUriAndFs(uri, &internal_path, options.file_system_java))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you would do is, if a filesystem is provided, call FileSystem::PathFromUri to set internal_path. If it succeeds, then proceed, else raise an error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds correct, will do later

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 15, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change (allowing a filesystem to be specified when running discovery) seems good. I agree with @lidavidm that this shouldn't be java-specific.

Copy link

⚠️ GitHub issue #37626 has no components, please add labels for components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] new FileSystem is called repeatedly and cause performance degrade
3 participants