-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ibis): persistent query cache #1088
Conversation
WalkthroughThe update integrates a query caching feature into the ibis-server application. A new Changes
Sequence Diagram(s)Query Endpoint FlowsequenceDiagram
participant Client
participant Router
participant CacheManager
participant DataSource
Client->>Router: Send query request (cache_enable=true)
Router->>CacheManager: get(data_source, sql, info)
alt Cache Hit
CacheManager-->>Router: Cached result
Router->>Client: Return cached result (header: X-Cache-Hit: true)
else Cache Miss
CacheManager-->>Router: No cache found
Router->>DataSource: Execute rewritten query
DataSource-->>Router: Query result
Router->>CacheManager: set(data_source, sql, result, info)
Router->>Client: Return new result (header: X-Cache-Hit: false)
end
Application Initialization FlowsequenceDiagram
participant Lifespan
participant State
participant CacheManager
Lifespan->>State: Start application lifecycle
State->>CacheManager: Instantiate QueryCacheManager
CacheManager-->>State: Return manager instance
State-->>Lifespan: Yield state with QueryCacheManager and JavaEngineConnector
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧬 Code Definitions (2)ibis-server/app/routers/v2/connector.py (6)
ibis-server/tests/routers/v3/connector/postgres/test_fallback_v2.py (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🔇 Additional comments (10)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (9)
ibis-server/app/routers/v2/connector.py (2)
57-93
: Consider adding cache invalidation mechanisms.While the current implementation provides a solid foundation for query caching, it might be beneficial to add mechanisms for cache invalidation, such as a time-to-live (TTL) parameter or explicit invalidation endpoints. This would help manage cache freshness, especially for frequently changing data.
57-93
: Consider adding cache control headers for more granular caching control.For enhanced client control over caching behavior, you might want to implement support for standard HTTP cache control headers. This would allow clients to specify caching preferences like max-age or no-cache on a per-request basis.
ibis-server/app/query_cache/__init__.py (5)
9-11
: Constructor could be more robust with configuration optionsThe
__init__
method is currently empty. Consider adding configuration options such as base cache directory path and cache expiration policy to make the cache manager more flexible.-def __init__(self): - pass +def __init__(self, cache_dir: str = "/tmp/wren-engine/", cache_expiry_seconds: Optional[int] = None): + self.cache_dir = cache_dir + self.cache_expiry_seconds = cache_expiry_seconds + os.makedirs(self.cache_dir, exist_ok=True)
13-27
: Add cache expiry check and more specific return typeThe cache retrieval logic doesn't check for stale cache entries. Additionally, the return type could be more specific.
-def get(self, data_source: str, sql: str, info) -> Optional[Any]: +def get(self, data_source: str, sql: str, info) -> Optional[pd.DataFrame]: cache_key = self._generate_cache_key(data_source, sql, info) cache_path = self._get_cache_path(cache_key) # Check if cache file exists if os.path.exists(cache_path): + # Check cache expiry if configured + if self.cache_expiry_seconds: + file_age = time.time() - os.path.getmtime(cache_path) + if file_age > self.cache_expiry_seconds: + logger.debug(f"Cache expired: {cache_path}") + return None + try: cache = ibis.read_parquet(cache_path) df = cache.execute() return df except Exception as e: logger.debug(f"Failed to read query cache {e}") return None
37-37
: Use consistent log levels for caching operationsYou're using
logger.info
for successful writes butlogger.debug
for failed operations. For consistency, either use debug level for all cache operations or info level for all.-logger.info(f"\nWriting query cache to {cache_path}\n") +logger.debug(f"Writing query cache to {cache_path}")
43-53
: Consider security implications of including sensitive information in cache keysThe cache key includes sensitive connection information. While this ensures cache entries are specific to different connections, it could potentially expose sensitive information in logs or error messages.
Consider using non-sensitive identifiers or applying additional hashing to sensitive values:
def _generate_cache_key(self, data_source: str, sql: str, info) -> str: key_parts = [ data_source, sql, - info.host.get_secret_value(), - info.port.get_secret_value(), - info.user.get_secret_value(), + hashlib.sha256(info.host.get_secret_value().encode()).hexdigest(), + hashlib.sha256(str(info.port.get_secret_value()).encode()).hexdigest(), + hashlib.sha256(info.user.get_secret_value().encode()).hexdigest(), ] key_string = "|".join(key_parts) return hashlib.sha256(key_string.encode()).hexdigest()
1-7
: Add missing imports for suggested improvementsIf implementing the suggested improvements, you'll need these additional imports.
import hashlib import os +import time from typing import Any, Optional import ibis +import pandas as pd from loguru import loggeribis-server/tests/routers/v2/query_cache/test_cache.py (2)
124-131
: Use a more robust approach for test cache directoryThe current approach creates a directory in
/tmp
, which may not work on all platforms. Consider using the built-in pytesttmp_path
fixture instead.@pytest.fixture(scope="function") -def cache_dir(): - temp_dir = "/tmp/wren-engine-test" - os.makedirs(temp_dir, exist_ok=True) - yield temp_dir - # Clean up after the test - shutil.rmtree(temp_dir, ignore_errors=True) +def cache_dir(tmp_path): + cache_dir = tmp_path / "wren-engine-test" + cache_dir.mkdir(exist_ok=True) + yield str(cache_dir) + # Clean up happens automatically with tmp_path
1-14
: Add a test for theenable_cache
parameterThe PR summary mentions an
enable_cache
parameter to opt-in to caching, but there's no test to verify that caching is disabled when this parameter is not set toTrue
.Consider adding a test function that verifies caching behavior when the enable_cache flag is set to False:
async def test_query_without_cache( client, manifest_str, postgres: PostgresContainer, cache_dir, monkeypatch ): # Override the cache path to use our test directory monkeypatch.setattr( QueryCacheManager, "_get_cache_path", lambda self, key: f"{cache_dir}/{key}.cache", ) connection_info = _to_connection_info(postgres) test_sql = 'SELECT * FROM "Orders" LIMIT 10' # First request with enable_cache=False response1 = await client.post( url=f"{base_url}/query", json={ "connectionInfo": connection_info, "manifestStr": manifest_str, "sql": test_sql, "enable_cache": False, }, ) assert response1.status_code == 200 # Second request with same SQL - should still miss cache response2 = await client.post( url=f"{base_url}/query", json={ "connectionInfo": connection_info, "manifestStr": manifest_str, "sql": test_sql, "enable_cache": False, }, ) assert response2.status_code == 200 # Verify cache wasn't used assert response2.headers.get("X-Cache-Hit") == "false", "Cache should not be used when enable_cache is False" # No cache files should be created assert len(os.listdir(cache_dir)) == 0, "No cache files should be created when enable_cache is False"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
ibis-server/app/main.py
(2 hunks)ibis-server/app/model/__init__.py
(1 hunks)ibis-server/app/query_cache/__init__.py
(1 hunks)ibis-server/app/routers/v2/connector.py
(3 hunks)ibis-server/pyproject.toml
(1 hunks)ibis-server/tests/routers/v2/query_cache/test_cache.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
🔇 Additional comments (14)
ibis-server/app/model/__init__.py (1)
21-21
: LGTM! Clean implementation of the cache opt-in flag.The addition of
enable_cache
as a boolean field with a default value ofFalse
is a good implementation of the opt-in caching approach mentioned in the PR objectives. This ensures backward compatibility as existing clients won't have their queries cached unless they explicitly set this flag.ibis-server/pyproject.toml (1)
78-78
: Good test organization with the new cache marker.Adding a dedicated pytest marker for cache tests facilitates better test organization and allows selectively running only cache-related tests during development or debugging.
ibis-server/app/main.py (3)
16-16
: LGTM! Appropriate import for the new cache manager.The import statement for the QueryCacheManager is correctly placed with other application imports.
25-26
: LGTM! Properly extending the application state.Adding the QueryCacheManager to the State class ensures it's accessible throughout the application lifecycle.
31-37
: LGTM! Proper initialization and state management for the cache manager.The changes correctly initialize the QueryCacheManager and include it in the yielded state dictionary. This approach properly integrates the cache manager into the application's lifecycle.
ibis-server/app/routers/v2/connector.py (8)
22-22
: LGTM! Appropriate import for the QueryCacheManager.The import is correctly placed with other application imports.
33-35
: LGTM! Well-structured dependency injection for the cache manager.The helper function follows the same pattern as
get_java_engine_connector
, which maintains consistency in the codebase.
44-44
: LGTM! Properly injecting the cache manager dependency.Correctly added the QueryCacheManager as a dependency to the query endpoint.
53-56
: LGTM! Good initialization of cache-related variables.Variables are clearly named and initialized with appropriate default values.
57-62
: LGTM! Clean implementation of cache retrieval.The code properly checks if caching is enabled before attempting to retrieve cached results. The cache key appears to be constructed from the data source, SQL query, and connection info, which should provide good cache specificity.
64-68
: LGTM! Clean cache hit handling.The implementation correctly returns the cached result and sets the X-Cache-Hit header to indicate a cache hit, as specified in the PR objectives.
78-83
: LGTM! Consistent header handling for dry runs.The X-Cache-Hit header is consistently applied even for dry run responses, which helps maintain a uniform API behavior.
84-93
: LGTM! Complete cache miss and result storage implementation.The code properly executes the query when there's a cache miss, and conditionally stores the result in the cache if caching is enabled. The X-Cache-Hit header is consistently set to indicate the cache status.
ibis-server/tests/routers/v2/query_cache/test_cache.py (1)
18-101
: LGTM! Comprehensive test manifestThe test manifest provides a thorough representation of the database schema for testing, including various column types and relationships.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @douenergy. Overall, it looks great, but something is missing.
- This PR only implements the cache for the v2 API. However, the v3 API also requires it.
- As per our offline discussion yesterday, we want to make the cache storing be async. However, I think we don't need to do that in this PR. We can have another follow-up PR for the async thing. If you want to do that here, it's also ok for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ibis-server/app/routers/v2/connector.py
(4 hunks)ibis-server/pyproject.toml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- ibis-server/pyproject.toml
🧰 Additional context used
🧬 Code Definitions (1)
ibis-server/app/routers/v2/connector.py (1)
ibis-server/app/query_cache/__init__.py (2)
QueryCacheManager
(9-56)get
(13-27)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
🔇 Additional comments (8)
ibis-server/app/routers/v2/connector.py (8)
23-23
: Import of QueryCacheManager added correctlyThe new import for
QueryCacheManager
has been properly added to support the caching functionality.
34-36
: LGTM - Clean implementation of cache manager dependencyThis function follows the same pattern as
get_java_engine_connector
, providing a consistent approach to dependency injection in FastAPI.
47-47
: Cache manager dependency properly injectedGood use of FastAPI's dependency injection system to pass the query cache manager to the query endpoint.
78-89
: Consider adding tracing for cache-related eventsAdding tracing events for cache operations would improve observability, especially for debugging and monitoring cache effectiveness.
if enable_cache: cached_result = query_cache_manager.get( str(data_source), dto.sql, dto.connection_info ) cache_hit = cached_result is not None + if cache_hit: + tracer.get_current_span().add_event("cache_hit") + else: + tracer.get_current_span().add_event("cache_miss")
90-94
: LGTM - Clean implementation of cache hit response pathThe code properly handles the case when a cached result is found, returning it with the appropriate cache hit header.
94-103
: LGTM - Query execution and caching logic looks goodThe implementation correctly executes the query when no cached result is found, caches the result if caching is enabled, and sets the appropriate cache header.
80-82
: Clarify dry run behavior with cachingIt's unclear what the expected behavior should be when both
dry_run
andenable_cache
are true. Should we skip the cache check for dry runs entirely (as implemented), or should we check the cache first and skip the dry run if we get a hit?Based on a previous review comment, there are two potential approaches:
- Skip cache for dry runs (current implementation) since dry runs are used to validate SQL which is already validated if in cache
- Always perform dry runs regardless of cache since they're inexpensive
Could you clarify the intended behavior for this scenario?
92-92
: Consider using a query parameter instead of a header for cache hit informationAs suggested in a previous review comment, you might want to consider using a query parameter instead of a header for indicating cache hits, similar to the dry-run parameter pattern. This would provide a more consistent API design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
ibis-server/app/routers/v3/connector.py (1)
63-65
: X-Cache-Hit header in dry run response.Setting X-Cache-Hit to "true" for dry run responses may be misleading since no actual caching occurs during a dry run.
Consider setting this to a more appropriate value like "not_applicable" or removing it for dry runs to avoid confusion:
- dry_response.headers["X-Cache-Hit"] = "true" + dry_response.headers["X-Cache-Hit"] = "not_applicable"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ibis-server/app/routers/v3/connector.py
(3 hunks)ibis-server/tests/routers/v3/connector/postgres/test_query.py
(3 hunks)
🧰 Additional context used
🧬 Code Definitions (2)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
ibis-server/tests/routers/v2/connector/test_postgres.py (3)
cache_dir
(131-136)test_query_with_cache
(179-221)manifest_str
(109-110)
ibis-server/app/routers/v3/connector.py (2)
ibis-server/app/routers/v2/connector.py (3)
get_java_engine_connector
(30-31)get_query_cache_manager
(34-35)query
(41-108)ibis-server/app/model/connector.py (10)
dry_run
(59-63)dry_run
(76-77)dry_run
(84-92)dry_run
(118-120)dry_run
(206-212)query
(56-57)query
(72-73)query
(112-115)query
(143-181)query
(197-203)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
🔇 Additional comments (12)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (4)
2-3
: Appropriate imports added for caching functionality.These imports are necessary for managing the temporary cache directory used in testing.
os
is used for directory operations, andshutil
for cleanup.
8-8
: Imports QueryCacheManager correctly.The import of QueryCacheManager is properly added to support the new cache testing functionality.
103-109
: Well-implemented cache directory fixture.This fixture correctly creates a temporary directory for testing cache functionality and ensures proper cleanup after tests. Using
exist_ok=True
is a good practice to prevent errors if the directory already exists.
151-191
: Comprehensive test case for caching functionality.This test thoroughly validates the cache mechanism by:
- Correctly overriding the cache path using monkeypatch
- Testing both cache miss (first request) and cache hit (second request) scenarios
- Verifying the cache hit/miss is reported in the response headers
- Confirming that the data returned is consistent between cached and non-cached responses
The test structure follows best practices by making assertions for both the cache hit status and data consistency.
ibis-server/app/routers/v3/connector.py (8)
23-23
: Appropriate import for query cache functionality.The QueryCacheManager import is correctly added to support the new caching feature.
25-25
: Updated import for cache manager dependency.The import has been properly updated to include the get_query_cache_manager function.
40-40
: New cache_enable parameter correctly implemented.The cache_enable parameter is properly added with a default value of False, making caching opt-in as described in the PR objectives.
44-44
: Query cache manager dependency correctly added.The QueryCacheManager is properly injected as a dependency using the Depends function.
46-51
: Improved span naming for observability.The span naming logic has been enhanced to include cache status, which is helpful for observability and tracing performance differences between cached and non-cached requests.
67-78
: Cache lookup implementation.The cache lookup logic is properly implemented. It checks for cache hits only when caching is enabled and uses the correct parameters (data source, SQL, and connection info) to generate the cache key.
79-84
: Cache hit handling.The cache hit logic is correctly implemented:
- Adding a span event for monitoring cache hits
- Returning the cached result with the appropriate header
- Converting the boolean cache_hit to lowercase string
85-92
: Cache miss handling.The cache miss and update logic is properly implemented. It executes the query, stores the result in the cache if enabled, and includes the X-Cache-Hit header in the response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
139-171
: Well-structured test for verifying cache hits and misses.Your test correctly checks that the response headers reflect a cache miss on the first request and a hit on the second, verifying that the data, columns, and data types are identical. As an enhancement, consider adding additional test cases, such as varying the SQL query slightly or testing different parameters to ensure that only truly identical queries produce cache hits.
ibis-server/app/query_cache/__init__.py (1)
13-14
: Consider externalizing the cache root path further.Although the constructor parameterizes
root
, defaulting to"/tmp/wren-engine/"
, you might consider reading it from a config file or an environment variable for flexibility across different platforms and deployment environments (e.g., Windows vs. Linux).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
ibis-server/app/query_cache/__init__.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_postgres.py
(1 hunks)ibis-server/tests/routers/v3/connector/postgres/test_query.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- ibis-server/tests/routers/v2/connector/test_postgres.py
🧰 Additional context used
🧬 Code Definitions (1)
ibis-server/tests/routers/v3/connector/postgres/test_query.py (1)
ibis-server/tests/routers/v2/connector/test_postgres.py (2)
test_query_with_cache
(167-200)manifest_str
(106-107)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci
🔇 Additional comments (1)
ibis-server/app/query_cache/__init__.py (1)
55-66
: Review potential PII within cache keys.The
_generate_cache_key
method includes user credentials such as host, port, and username in the hash input. While the resulting file name is a SHA-256 hash, ensure no logs or error messages leak these credential strings. Restricting logs to the hash or redacting sensitive info helps maintain data privacy.
@goldmedal I have addressed all the comments including
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @douenergy. 👍
Implemented a persistent query cache system that saves results to disk as Parquet files (
/tmp/wren-engine/
)Added an enable_cache parameter to the query endpoint to allow clients to opt-in to cache functionality.
Integrated cache hit/miss tracking via an
X-Cache-Hit
response header for monitoring and debugging.I prefer header instead of a json body field. Because clean separation of metadata from data, follows HTTP convention for metadata, doesn't modify the existing response structure.
Summary by CodeRabbit
New Features
Tests