Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local cache for catalog #816

Closed
v0y4g3r opened this issue Jan 3, 2023 · 9 comments
Closed

Local cache for catalog #816

v0y4g3r opened this issue Jan 3, 2023 · 9 comments
Assignees
Labels
C-enhancement Category Enhancements

Comments

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Jan 3, 2023

What type of enhancement is this?

Performance

What does the enhancement do?

CatalogManager's schema/table API in in the critical path of every table request. Current implementation is a workaround since arrow-datfusion catalog is sync method while the implementation is async, which may introduce extra overhead.

I opened an issue for async version of Catalog APIs and arrow-datafusion's async catalog api is still under development. We may end up with building a local cache for catalogs/schemas/tables.

Implementation challenges

NA

@v0y4g3r v0y4g3r added the C-enhancement Category Enhancements label Jan 3, 2023
@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Jan 6, 2023

As the PR for async SchemaProvider::table is merged, we can now update to head version of datafusion.

There's still one caveat, CatalogProvider API is still sync. We may choose another workaround like ballista, which refreshes catalog on demand.

@MichaelScofield MichaelScofield self-assigned this Feb 15, 2023
@MichaelScofield
Copy link
Collaborator

A quick glimpse of datafusion and our codes, there are mainly two usage of catalog/schema manager: 1. list all tables in catalog; 2. find a specific table.

For listing tables I think we can just ignore the caching demand, because:

  • In datafusion, it's been used in information_schema and "refreshing catalog". Neither of which we need.
  • In our codes, Tables in InformationSchema is relying on it. However, I find there are no actual use of the Tables. Can it be deleted? @v0y4g3r
  • Also in our codes, "show databases/tables" requires it. But there are not frequent sqls. Or, the 2 sqls don't exist in any latency critical job execution path.

For finding tables, I think we can now use moka to cache tables in RemoteCatalogManager in Datanode and FrontendCatalogManager in Frontend. WDYT? @v0y4g3r

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Mar 9, 2023

A quick glimpse of datafusion and our codes, there are mainly two usage of catalog/schema manager: 1. list all tables in catalog; 2. find a specific table.

For listing tables I think we can just ignore the caching demand, because:

  • In datafusion, it's been used in information_schema and "refreshing catalog". Neither of which we need.
  • In our codes, Tables in InformationSchema is relying on it. However, I find there are no actual use of the Tables. Can it be deleted? @v0y4g3r
  • Also in our codes, "show databases/tables" requires it. But there are not frequent sqls. Or, the 2 sqls don't exist in any latency critical job execution path.

For finding tables, I think we can now use moka to cache tables in RemoteCatalogManager in Datanode and FrontendCatalogManager in Frontend. WDYT? @v0y4g3r

Since datafusion has already refactored their requirements for table resolution, the whole "schema provider"/"catalog provider" can be removed I think.

As for table cache, I'm ok with moka, as long as we can explicitly notify datanode to refresh cache on demand.

@MichaelScofield
Copy link
Collaborator

Let's wait for datafusion's catalog refactor in apache/datafusion#5291

@fengys1996
Copy link
Contributor

There may be a problem:
Assuming there is a table a, and frontend also caches the info of table a. But when others alters table a, there is no mechanism to trigger the frontend's cache update. @MichaelScofield @v0y4g3r

@MichaelScofield
Copy link
Collaborator

You can implement heartbeat in frontend first, and make this cache invalidation in heartbeat response.

@fengys1996
Copy link
Contributor

Can we establish a heartbeat mechanism between fronend and meta, similar to the heartbeat of datanode and mata. In this way, we can transmit information such as table alter through heartbeats to trigger frontend cache updates or do other things.

@MichaelScofield
Copy link
Collaborator

of course

@MichaelScofield
Copy link
Collaborator

completed in #1592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements
Projects
None yet
Development

No branches or pull requests

3 participants