Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically register tables if ObjectStore root is configured #4095

Merged
merged 12 commits into from
Nov 4, 2022

Conversation

avantgardnerio
Copy link
Contributor

@avantgardnerio avantgardnerio commented Nov 3, 2022

Which issue does this PR close?

Closes #4094.

Rationale for this change

Described in issue.

What changes are included in this PR?

  1. a ListingSchemaProvider is defined
  2. if appropriate config_options are set, the ListingSchemaProvider is automatically registered
  3. if the ListingSchemaProvider is registered, it scans the ObjectStore and registers the appropriate tables

Are there any user-facing changes?

When they connect, tables are there as they would expect.

Screenshot from 2022-11-02 15-52-18

Screenshot from 2022-11-02 15-50-55

@avantgardnerio
Copy link
Contributor Author

@andygrove and @alamb as usual, I'd appreciate a review as well as adding anyone else who is appropriate. Thanks guys 🙌

@github-actions github-actions bot added the core Core DataFusion crate label Nov 3, 2022
@alamb alamb changed the title Bg listing catalog Automatically register tables if ObjectStore root is configured Nov 3, 2022
@alamb
Copy link
Contributor

alamb commented Nov 3, 2022

Until the fix for #4100 is merged, clippy will likely fail on this PR as well

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @avantgardnerio -- I think the only thing I think is missing is some sort of test (mostly so we don't accidentally break this feature when it is refactored, etc)

datafusion/core/src/catalog/listing_schema.rs Show resolved Hide resolved
datafusion/core/src/catalog/listing_schema.rs Show resolved Hide resolved
datafusion/core/src/catalog/listing_schema.rs Outdated Show resolved Hide resolved
}
}

/// Reload table information from ObjectStore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend documenting here when refresh() should be called -- like it has to be called explicitly after construction for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the big open question with this PR 😄 . Presently, in flight_sql.rs in Ballista, when FlightSql clients enumerate the catalogs/schemas/tables, I am doing a checked downcast_ref to see if it is a ListingSchemaProvider and if so calling this method. Ultimately, I think we should probably extend the SchemaProvider API? The downcast trick certainly doesn't seem elegant.

Unfortunately this is a state synchronization problem, and I'm not sure that the ObjectStore has APIs for file system listeners, so we will need to figure out the best times to try to synchronize the state. Every time we run a query perhaps? My worry is that this could get expensive, I can imagine (or have heard about) each delta table containing 1000 parquet files, and there could probably be 100s of tables, which means a lot of files to scan.

Also unfortunately, it looks like the ObjectStore doesn't let us list only children of a folder - it lists all the files in the entire bucket, thus the weird recursive .parent() stuff.

I thought I would file this PR to get the discussion going.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is no policy that will work well for all implementations / usecases so the refresh policy will need to be decided by whatever the upstream system is (e.g. ballista or iox, or whatever)

Adding a refresh() method to SchemaProvider seems fine, though to be honest I think using downcast_ref() also seems fine to me

Would definitely be interested in hearing other opinions on this

datafusion/core/src/catalog/listing_schema.rs Outdated Show resolved Hide resolved
@avantgardnerio
Copy link
Contributor Author

missing is some sort of test

For sure, working on it...

benchmarks/Cargo.toml Outdated Show resolved Hide resolved
@avantgardnerio
Copy link
Contributor Author

I don't understand why the hash collision test is failing, and I can't get it to happen locally :(

@avantgardnerio
Copy link
Contributor Author

If it looks anything like local, we should see

Looking for data in /home/bgardner/workspace/ballista/arrow-datafusion/datafusion/core/tests/tpch-csv exists=true

@avantgardnerio
Copy link
Contributor Author

So the directory exists, so it seems like:

  1. either the files in it were deleted
  2. or, the ObjectStore has some problem accessing it
Looking for data in /__w/arrow-datafusion/arrow-datafusion/datafusion/core/tests/tpch-csv exists=true
thread 'execution::context::tests::with_listing_schema_provider' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `8`', datafusion/core/src/execution/context.rs:2288:9

@avantgardnerio
Copy link
Contributor Author

Oh, could the underscores be messing with it?

@avantgardnerio
Copy link
Contributor Author

Yup, a bug in ObjectStore:

Looking for data in /home/bgardner/workspace/ballista/arrow-datafusion/datafusion/core/tests/tpch_csv exists=true


Left:  0
Right: 8
<Click to see difference>

@avantgardnerio
Copy link
Contributor Author

Yup, a bug in ObjectStore:

Nope, it was my faulty test combined with hashing inconsistency that happened to get triggered by underscores in this case 🤦 .

@avantgardnerio avantgardnerio requested review from andygrove and alamb and removed request for alamb and andygrove November 4, 2022 00:16
@avantgardnerio avantgardnerio requested review from alamb and andygrove and removed request for alamb and andygrove November 4, 2022 00:17
@avantgardnerio
Copy link
Contributor Author

Oh wow. The UI went nuts. Sorry for all the review requests.

@andygrove
Copy link
Member

I really like this feature. It will save me time as a user once we plumb this through to the CLI and Python 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good @avantgardnerio -- thank you!

I had some minor naming suggestions on the parameters, but I would also be fine with this being merged as is

datafusion/core/src/catalog/listing_schema.rs Show resolved Hide resolved
datafusion/core/src/config.rs Outdated Show resolved Hide resolved
datafusion/core/src/config.rs Outdated Show resolved Hide resolved
datafusion/core/src/config.rs Outdated Show resolved Hide resolved
Brent Gardner and others added 3 commits November 4, 2022 10:32
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @avantgardnerio

@andygrove andygrove merged commit 4a67d0d into apache:master Nov 4, 2022
@avantgardnerio avantgardnerio deleted the bg_listing_catalog branch November 4, 2022 18:25
@ursabot
Copy link

ursabot commented Nov 4, 2022

Benchmark runs are scheduled for baseline = fc669d5 and contender = 4a67d0d. 4a67d0d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Dandandan pushed a commit to yuuch/arrow-datafusion that referenced this pull request Nov 5, 2022
…he#4095)

* squash

* Debug test in CI :'(

* Hashing inconsistency

* Address Andy's concerns

* Docs

* Docs

* fmt

* treat empty string like None :(

* clippy

* PR feedback

* Update datafusion/core/src/config.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/src/config.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Nov 5, 2022
…he#4095)

* squash

* Debug test in CI :'(

* Hashing inconsistency

* Address Andy's concerns

* Docs

* Docs

* fmt

* treat empty string like None :(

* clippy

* PR feedback

* Update datafusion/core/src/config.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/src/config.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically register tables if ObjectStore root is configured
4 participants