-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically register tables if ObjectStore root is configured #4095
Conversation
@andygrove and @alamb as usual, I'd appreciate a review as well as adding anyone else who is appropriate. Thanks guys 🙌 |
Until the fix for #4100 is merged, clippy will likely fail on this PR as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @avantgardnerio -- I think the only thing I think is missing is some sort of test (mostly so we don't accidentally break this feature when it is refactored, etc)
} | ||
} | ||
|
||
/// Reload table information from ObjectStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend documenting here when refresh()
should be called -- like it has to be called explicitly after construction for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the big open question with this PR 😄 . Presently, in flight_sql.rs
in Ballista, when FlightSql clients enumerate the catalogs/schemas/tables, I am doing a checked downcast_ref
to see if it is a ListingSchemaProvider
and if so calling this method. Ultimately, I think we should probably extend the SchemaProvider
API? The downcast trick certainly doesn't seem elegant.
Unfortunately this is a state synchronization problem, and I'm not sure that the ObjectStore
has APIs for file system listeners, so we will need to figure out the best times to try to synchronize the state. Every time we run a query perhaps? My worry is that this could get expensive, I can imagine (or have heard about) each delta table containing 1000 parquet files, and there could probably be 100s of tables, which means a lot of files to scan.
Also unfortunately, it looks like the ObjectStore
doesn't let us list only children of a folder - it lists all the files in the entire bucket, thus the weird recursive .parent()
stuff.
I thought I would file this PR to get the discussion going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there is no policy that will work well for all implementations / usecases so the refresh policy will need to be decided by whatever the upstream system is (e.g. ballista or iox, or whatever)
Adding a refresh()
method to SchemaProvider
seems fine, though to be honest I think using downcast_ref()
also seems fine to me
Would definitely be interested in hearing other opinions on this
For sure, working on it... |
ee992ad
to
9f75172
Compare
I don't understand why the hash collision test is failing, and I can't get it to happen locally :( |
If it looks anything like local, we should see
|
So the directory exists, so it seems like:
|
Oh, could the underscores be messing with it? |
Yup, a bug in ObjectStore:
|
Nope, it was my faulty test combined with hashing inconsistency that happened to get triggered by underscores in this case 🤦 . |
Oh wow. The UI went nuts. Sorry for all the review requests. |
I really like this feature. It will save me time as a user once we plumb this through to the CLI and Python 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good @avantgardnerio -- thank you!
I had some minor naming suggestions on the parameters, but I would also be fine with this being merged as is
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @avantgardnerio
Benchmark runs are scheduled for baseline = fc669d5 and contender = 4a67d0d. 4a67d0d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…he#4095) * squash * Debug test in CI :'( * Hashing inconsistency * Address Andy's concerns * Docs * Docs * fmt * treat empty string like None :( * clippy * PR feedback * Update datafusion/core/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…he#4095) * squash * Debug test in CI :'( * Hashing inconsistency * Address Andy's concerns * Docs * Docs * fmt * treat empty string like None :( * clippy * PR feedback * Update datafusion/core/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #4094.
Rationale for this change
Described in issue.
What changes are included in this PR?
ListingSchemaProvider
is definedconfig_options
are set, theListingSchemaProvider
is automatically registeredListingSchemaProvider
is registered, it scans theObjectStore
and registers the appropriate tablesAre there any user-facing changes?
When they connect, tables are there as they would expect.