-
Notifications
You must be signed in to change notification settings - Fork 473
Description
Background
We need to determine whether two dlt.Dataset instances can be joined in a single SQL query. The current is_same_physical_destination (dlt/dataset/dataset.py:426) compares str(config) — which calls str(credentials), rendering the full connection URL including username, port, and database:
def is_same_physical_destination(dataset1, dataset2):
return str(dataset1.destination_client.config) == str(dataset2.destination_client.config)This is too strict for destinations that support cross-database joins (Snowflake, BigQuery, MSSQL, ClickHouse, Databricks, Dremio, Athena) — they return False even when tables can be joined across databases on the same server/account.
It is semantically wrong for the filesystem destination where joinability comes from the in-process DuckDB query engine, not data location.
A fingerprint() method exists on DestinationClientConfiguration that hashes a location identifier (e.g. host), but it is not used by is_same_physical_destination and is missing on several destinations (DuckDB, Athena, SQLAlchemy, Fabric).
What to implement
1. Add location() to DestinationClientConfiguration
Returns the unhashed physical location string — the identity of where this destination lives. Replaces per-destination fingerprint() overrides. fingerprint() becomes derived on the base class and is no longer overridden per-destination:
def location(self) -> str:
"""Returns the physical location of this destination (e.g. host, bucket URL, project id)"""
return ""
def fingerprint(self) -> str:
"""Returns a hash of location()"""
loc = self.location()
return digest128(loc) if loc else ""2. Add can_join_with(other) to DestinationClientConfiguration
Determines if tables from this destination can be joined with tables from other in a single query. This reflects the dlt destination's current capability, not the theoretical capability of the underlying database engine (e.g. DuckDB supports ATTACH for cross-database joins, but dlt doesn't implement it yet).
Default implementation compares locations. Destinations override for special semantics (e.g. filesystem, Postgres).
def can_join_with(self, other: "DestinationClientConfiguration") -> bool:
"""Check if tables from this destination can be joined with tables from other in a single query"""
self_loc = self.location()
other_loc = other.location()
if not self_loc or not other_loc:
return False
return self_loc == other_loc3. Update is_same_physical_destination in dlt/dataset/dataset.py
Delegate to can_join_with on the config.
Join compatibility per destination
location() is the physical identity of the destination. can_join_with determines joinability — it defaults to comparing locations but can be overridden.
| Destination | location() |
can_join_with rule |
Notes |
|---|---|---|---|
| Postgres | host:port |
same host:port/database |
No cross-DB joins, override needed |
| Redshift | host:port |
same host:port/database |
No cross-DB joins, override needed |
| Snowflake | host (account identifier) |
same location (default) | Cross-DB within account |
| BigQuery | project_id |
same location (default) | Cross-dataset within project |
| MSSQL | host |
same location (default) | Cross-DB via db.schema.table |
| Synapse | host (inherited from MSSQL) |
same location (default) | Cross-DB via db.schema.table |
| ClickHouse | host |
same location (default) | Cross-DB via db.table |
| Databricks | server_hostname |
same location (default) | Cross-catalog via Unity Catalog |
| Athena | aws_data_catalog |
same location (default) | Cross-DB within Glue catalog |
| Dremio | host |
same location (default) | Federation engine, cross-source |
| DuckDB | database file path or ":memory:" |
same location (default) | No ATTACH support in dlt currently |
| MotherDuck | access token (account) | same location (default) | Cross-DB within account |
| Filesystem | scheme://netloc or "" for local |
always True vs other filesystem, override needed |
In-process DuckDB scans files directly |
| DuckLake | catalog connection string (no credentials) + ducklake_name |
same location (default) | Catalog is the identity, not storage (paths are relative, bucket is swappable) |
SQLAlchemy — dialect-dependent can_join_with
SQLAlchemy's location() is host:port (or file path for sqlite). can_join_with depends on the backend dialect:
| Backend | can_join_with rule |
|---|---|
postgresql |
same host:port/database (no cross-DB) |
mysql |
same host:port (cross-DB supported) |
mssql |
same host:port (cross-DB supported) |
oracle |
same host:port (cross-DB supported) |
db2 |
same host:port (cross-DB supported) |
sqlite |
same database file path |
| unknown | same host:port/database (conservative default) |
Missing location() implementations
These destinations currently have no fingerprint() override and return "":
| Destination | Needed location() |
|---|---|
| DuckDB | database file path or ":memory:" |
| Athena | aws_data_catalog identifier |
| SQLAlchemy | host:port (or file path for sqlite) |
| Fabric | host/workspace identifier from credentials |
Existing corrections needed
| Destination | Current fingerprint() |
Problem | Fix |
|---|---|---|---|
| DuckLake | digest128(storage bucket) |
Storage is swappable via OVERRIDE_DATA_PATH; catalog is the real identity |
location() = catalog connection string (no credentials) + ducklake_name |
Metadata
Metadata
Assignees
Labels
Type
Projects
Status