-
Notifications
You must be signed in to change notification settings - Fork 56
feat(caches): Add DuckLakeCache implementation (do not merge) #695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tion - Add DuckLakeCache as subclass of DuckDBCache following MotherDuck pattern - Support DuckLake table format with metadata database and Parquet storage - Provide sensible defaults using cache_dir pattern: - db_path defaults to 'ducklake.db' in cache directory - metadata_connection_string defaults to 'sqlite:metadata.db' in cache directory - data_path defaults to 'data' subdirectory in cache directory - catalog_name defaults to 'ducklake_catalog' - Enable minimal configuration requiring only catalog_name parameter - Add example script demonstrating usage with source-faker - Export DuckLakeCache and DuckLakeConfig in caches module Co-Authored-By: AJ Steers <aj@airbyte.io>
Original prompt from AJ Steers:
|
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Devin - I'm applying several changes to your PR. Wait 30 seconds and then pull the latest.
airbyte/caches/ducklake.py
Outdated
def get_database_name(self) -> str: | ||
"""Return the name of the database.""" | ||
if self.db_path == ":memory:": | ||
return "memory" | ||
|
||
split_on = "\\" if "\\" in str(self.db_path) else "/" | ||
|
||
return str(self.db_path).split(sep=split_on)[-1].split(".")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but I believe this should just return self.catalog_name
.
- Fix MyPy errors by moving path resolution to model_post_init method - Update get_database_name() to return catalog_name instead of parsing db_path - Remove unused imports and fix linting issues - Preserve AJ's documentation improvements for data_path and storage_credentials Co-Authored-By: AJ Steers <aj@airbyte.io>
- Update duckdb dependency from ^1.1.0 to ^1.3.1 - Update duckdb-engine dependency from ^0.13.2 to ^0.17.0 - Add DuckLakeSqlProcessor with ATTACH statement handling - Update example script to use ducklake-data directory for verification - Verify DuckLake extension works and files land in correct directory Co-Authored-By: AJ Steers <aj@airbyte.io>
…with ducklake-data path Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
/fix-pr
|
Notes from AJ (@aaronsteers)
Most of the implementation appears read to go when the below features is added.
🟥 BlockedAs of now, it appears UPDATE/MERGE are not yet supported:
Workaround found: First DELETE, then INSERT.
Add DuckLakeCache implementation
Summary
This PR implements a new
DuckLakeCache
as a subclass ofDuckDBCache
to support the DuckLake table format in PyAirbyte. The implementation follows the established patterns from MotherDuck cache and provides minimal configuration requirements.Changes
airbyte/caches/ducklake.py
DuckLakeConfig
class extendingDuckDBConfig
with DuckLake-specific fieldsDuckLakeCache
class extendingDuckDBCache
with minimal configurationairbyte/caches/__init__.py
to export new classesexamples/run_faker_to_ducklake.py
demonstrating usageConfiguration Parameters
metadata_connection_string
: Connection string for DuckLake metadata database (defaults tosqlite:metadata.db
)data_path
: Local directory for Parquet data files (defaults todata
subdirectory)catalog_name
: Name for attached DuckLake catalog (defaults toducklake_catalog
)storage_credentials
: Optional dict for storage credentials (defaults to None)Key Features
catalog_name
required, all other fields have sensible defaults.cache
directory pattern for default pathsUsage Example
Testing
ruff format
,ruff check
)Implementation Notes
Link to Devin run
https://app.devin.ai/sessions/1a262eb5a472438ba8f04088ad0b91bb
Requested by
AJ Steers (aj@airbyte.io)