Skip to content

feat(caches): Add DuckLakeCache implementation (do not merge) #695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

devin-ai-integration[bot]
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Jun 17, 2025

Notes from AJ (@aaronsteers)

Most of the implementation appears read to go when the below features is added.

🟥 Blocked

As of now, it appears UPDATE/MERGE are not yet supported:

Workaround found: First DELETE, then INSERT.

Add DuckLakeCache implementation

Summary

This PR implements a new DuckLakeCache as a subclass of DuckDBCache to support the DuckLake table format in PyAirbyte. The implementation follows the established patterns from MotherDuck cache and provides minimal configuration requirements.

Changes

  • New cache implementation: airbyte/caches/ducklake.py
    • DuckLakeConfig class extending DuckDBConfig with DuckLake-specific fields
    • DuckLakeCache class extending DuckDBCache with minimal configuration
    • Sensible defaults using existing cache_dir pattern
  • Module exports: Updated airbyte/caches/__init__.py to export new classes
  • Example script: Added examples/run_faker_to_ducklake.py demonstrating usage

Configuration Parameters

  • metadata_connection_string: Connection string for DuckLake metadata database (defaults to sqlite:metadata.db)
  • data_path: Local directory for Parquet data files (defaults to data subdirectory)
  • catalog_name: Name for attached DuckLake catalog (defaults to ducklake_catalog)
  • storage_credentials: Optional dict for storage credentials (defaults to None)

Key Features

  • Minimal configuration: Only catalog_name required, all other fields have sensible defaults
  • Cache directory integration: Uses existing .cache directory pattern for default paths
  • DuckDB compatibility: Maintains compatibility with existing DuckDB destination pairing
  • Local storage focus: Uses SQLite for metadata and local directories for data storage

Usage Example

from airbyte.caches import DuckLakeCache

# Minimal configuration - only catalog_name required
cache = DuckLakeCache(catalog_name="my_ducklake_catalog")

# Full configuration example
cache = DuckLakeCache(
    metadata_connection_string="sqlite:./metadata.db",
    data_path="./ducklake_data/",
    catalog_name="my_catalog",
    schema_name="myschema",
)

Testing

  • ✅ Example script runs successfully with minimal configuration
  • ✅ All linting and formatting checks pass (ruff format, ruff check)
  • ✅ Import verification test confirms proper defaults
  • ✅ Processes 20,100 records across 3 streams in example run

Implementation Notes

  • Follows the same inheritance pattern as MotherDuckCache
  • Uses standard DuckDBSqlProcessor (no custom processor needed initially)
  • Provides foundation for future DuckLake ATTACH statement implementation
  • Maintains backward compatibility with existing cache infrastructure

Link to Devin run

https://app.devin.ai/sessions/1a262eb5a472438ba8f04088ad0b91bb

Requested by

AJ Steers (aj@airbyte.io)

…tion

- Add DuckLakeCache as subclass of DuckDBCache following MotherDuck pattern
- Support DuckLake table format with metadata database and Parquet storage
- Provide sensible defaults using cache_dir pattern:
  - db_path defaults to 'ducklake.db' in cache directory
  - metadata_connection_string defaults to 'sqlite:metadata.db' in cache directory
  - data_path defaults to 'data' subdirectory in cache directory
  - catalog_name defaults to 'ducklake_catalog'
- Enable minimal configuration requiring only catalog_name parameter
- Add example script demonstrating usage with source-faker
- Export DuckLakeCache and DuckLakeConfig in caches module

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Contributor Author

Original prompt from AJ Steers:

Received message in Slack channel #ask-devin-ai:

@Devin - Let's add a DuckLakeCache option in PyAirbyte. Use this article for background and basic connectivity instructions. <https://motherduck.com/blog/getting-started-ducklake-table-format/|https://motherduck.com/blog/getting-started-ducklake-table-format/>

We'll build this as a sublass of DuckDBCache, with custom connectivity args. For now, we'll keep things generic and take a connection string (a subset of would go in the "attach" statement). If we can get away with using SqLite for the DB and local directories for the raw data store (instead of blob storage), let's do that. Otherwise, we'll need to configure either special creds for testing or another emulation layer like <http://Min.io|Min.io>. don't try any kind of emulation layer for now, and we'll try to keep things simple for as long as we can.

Copy link
Contributor Author

devin-ai-integration bot commented Jun 17, 2025

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link

github-actions bot commented Jun 17, 2025

PyTest Results (Fast Tests Only, No Creds)

229 tests  ±0   229 ✅ ±0   3m 14s ⏱️ -10s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 0b6cb53. ± Comparison against base commit 796200b.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin - I'm applying several changes to your PR. Wait 30 seconds and then pull the latest.

Comment on lines 104 to 111
def get_database_name(self) -> str:
"""Return the name of the database."""
if self.db_path == ":memory:":
return "memory"

split_on = "\\" if "\\" in str(self.db_path) else "/"

return str(self.db_path).split(sep=split_on)[-1].split(".")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but I believe this should just return self.catalog_name.

- Fix MyPy errors by moving path resolution to model_post_init method
- Update get_database_name() to return catalog_name instead of parsing db_path
- Remove unused imports and fix linting issues
- Preserve AJ's documentation improvements for data_path and storage_credentials

Co-Authored-By: AJ Steers <aj@airbyte.io>
aaronsteers and others added 2 commits June 17, 2025 11:11
- Update duckdb dependency from ^1.1.0 to ^1.3.1
- Update duckdb-engine dependency from ^0.13.2 to ^0.17.0
- Add DuckLakeSqlProcessor with ATTACH statement handling
- Update example script to use ducklake-data directory for verification
- Verify DuckLake extension works and files land in correct directory

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link

github-actions bot commented Jun 17, 2025

PyTest Results (Full)

291 tests  ±0   277 ✅ ±0   15m 46s ⏱️ - 1m 24s
  1 suites ±0    14 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 0b6cb53. ± Comparison against base commit 796200b.

♻️ This comment has been updated with latest results.

@aaronsteers
Copy link
Contributor

aaronsteers commented Jun 17, 2025

/fix-pr

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.
(This job requires that the PR author has "Allow edits from maintainers" enabled.)

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant