Skip to content

Conversation

deepyaman
Copy link
Contributor

@deepyaman deepyaman commented Sep 16, 2025

Description of changes

Implement Backend.upsert() using sqlglot.expressions.merge() under the hood. Upsert support is very important, especially for data engineering use cases.

Starting with a most basic implementation, including only supporting one join column. I think this could be expanded to support a list without much effort.

MERGE INTO support is limited. DuckDB only added support for MERGE statements earlier today in 1.4.0, and many other backends don't support it. However, it seems like the more standard/correct approach for supporting upserts, and it doesn't require merge keys defined ahead of time on tables.

Backends that work:

  • DuckDB (from 1.4.0)
  • Flink
  • Oracle (currently using a hack to work around "AS" getting added to MERGE statement)
  • MS SQL (currently throwing a ; onto the end of every statement 😅)
  • Postgres

Should work, need help to test:

  • Databricks
  • Snowflake
  • BigQuery

Backends that don't work:

  • PySpark ("MERGE INTO TABLE is not supported temporarily.")
  • Clickhouse
  • DataFusion
  • SQLite (supports the nonstandard UPSERT statement)
  • Impala ("The MERGE statement is only supported for Iceberg tables.")
  • MySQL
  • Polars
  • RisingWave
  • Athena ("MERGE INTO is transactional and is supported only for Apache Iceberg tables in Athena engine version 3.")
  • Trino ("connector does not support modifying table rows")

Issues closed

@github-actions github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Sep 16, 2025
@deepyaman deepyaman added the feature Features or general enhancements label Sep 16, 2025
@github-actions github-actions bot added the oracle The Oracle backend label Sep 17, 2025
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 5a27b34 to 21994eb Compare September 17, 2025 23:14
@deepyaman deepyaman requested a review from cpcloud September 18, 2025 00:52
@deepyaman
Copy link
Contributor Author

@cpcloud Requesting review for initial feedback while I try to improve backend support; I assume I could always mark a lot of them notyet if the approach is correct but some translations just need work.

Comment on lines 277 to 281
# TODO(deepyaman): Fix Oracle MERGE query generation in SQLGlot.
if "MERGE" in query:
assert "AS " in query
query = query.replace("AS ", "")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously unacceptable, but will wait on tobymao/sqlglot#5910 to try and figure out what's the best way to work arounds this (unless somebody else has ideas).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here tobymao/sqlglot#5911, will release soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgesittas Tested locally with SQLGlot 27.16.3, and removing this hack works perfectly. Thank you!!

(So impressed by the 0-issue inbox on SQLGlot BTW.)

expressions=[
sg.column(
col,
table=target_alias if self.name == "oracle" else None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These if statements can before backend-specific variables, something like https://github.com/ibis-project/ibis/blob/main/ibis/backends/flink/__init__.py#L65-L66? I wanted to try and see what the prevailing pattern is before finalizing the default.

@github-actions github-actions bot added the mssql The Microsoft SQL Server backend label Sep 18, 2025
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 46b0e4a to 8e47b79 Compare September 18, 2025 04:54
query = query.sql(self.dialect)

if "MERGE" in query:
query = f"{query};"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure how to better do this; I wasn't able to figure out how to add a semicolon to an expression in SQLGlot.

@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 2e1403d to b40103e Compare September 19, 2025 12:57
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 25061ad to d052b43 Compare September 20, 2025 16:45
@github-actions github-actions bot added the dependencies Issues or PRs related to dependencies label Sep 20, 2025
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 0ea427e to 9ae5426 Compare September 20, 2025 20:26
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch 2 times, most recently from ea32ee2 to d2b5922 Compare September 20, 2025 23:51
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from d2b5922 to 8b0d6fe Compare September 21, 2025 02:57
@github-actions github-actions bot added the bigquery The BigQuery backend label Sep 21, 2025
@deepyaman
Copy link
Contributor Author

@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to xfail/xpass two DuckDB tests in order to get DuckDB and Oracle tests working (need the newer DuckDB and SQLGlot releases). All of these changes are in the last 3 commits:

With that, all of the functionality to implement Backend.upsert() is working and ready for review. The remaining issues all also exist in #11637, so I won't duplicate solving them. Whenever you do merge #11637, I should be able to back out the last 3 commits and rebase this on top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery The BigQuery backend dependencies Issues or PRs related to dependencies feature Features or general enhancements mssql The Microsoft SQL Server backend oracle The Oracle backend sql Backends that generate SQL tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Backend.upsert
2 participants