Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source MSSQL: CDC commit logs are not fetched properly #25237

Open
uuboyscy opened this issue Apr 17, 2023 · 5 comments
Open

Source MSSQL: CDC commit logs are not fetched properly #25237

uuboyscy opened this issue Apr 17, 2023 · 5 comments
Labels
area/connectors Connector related issues community connectors/source/mssql connectors/sources-database frozen Not being actively worked on releaseStage/beta team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/bug Something isn't working

Comments

@uuboyscy
Copy link

uuboyscy commented Apr 17, 2023

Environment

  • Airbyte version: 0.43.1
  • OS Version / Instance: Container-Optimized OS with containerd (cos_containerd) / GKE
  • Deployment: Google Kubernetes Engine
  • Source Connector and version: airbyte/1.0.11
  • Destination Connector and version: airbyte/destination-bigquery:1.2.20
  • Step where error happened: Sync job

Current Behavior

I integrate data from SQL Server 2017 with CDC enable to BigQuery, and configure "Incremental Deduped+History" on Airbyte.
Some of data are not integrated correctly.
These data would be modified multiple times in SQL Server in almost the same time, and the wrong order of commit would be kept in target destination.
As shown in following BigQuery photo, I rerun the SQLs generated when Airbyte syncing, found that the order would lead to wrong row of data kept.

Expected Behavior

Keep correct data in target destination.
As Bigquery screenshot shown below, the correct data is the row with TC027 = 'Y', but the one with TC027 = 'U' is kept now.
Further, maybe Airbyte does not use column __$command_id in MSSQL CDC log screenshot to order the latest commit, so that the older commit is update to target destination.

Logs

  • MSSQL CDC log screenshot
    cdc logs

  • Bigquery screenshot
    bigquery transformation temp table

Steps to Reproduce

As shown in above Bigquery screenshot, I rerun the SQLs generated when Airbyte syncing, found that the order would lead to wrong row of data kept (The data with TC027 = 'U' is kept, but TC027 = 'Y' is correct row).

@HonzaPavel
Copy link

HonzaPavel commented Apr 21, 2023

I have a very similar problem. I have a connection between MS SQL Server and BigQuery. I have set up CDC on MS SQL Server. I am using Incremental Sync - Deduped History. SCD historization is not working correctly because CDC on MS SQL Server uses local time zone instead of UTC. In BigQuery the difference is 2 hours. Isn't that the reason for your problems? Is it not possible to set the time zone change to UTC in Airbyte?

image

@uuboyscy
Copy link
Author

uuboyscy commented Apr 27, 2023

Hi @HonzaPavel

It is one of problems. This also cause that timezone of the first update would be your local timezone, but later incremental update become UTC. I'm not sure why it happen, I think it is a bug of Debezium.

To solve this problem for short-term, as you can see my reply in this post, using k8tz to simply set global timezone for all newly created pods can solve the timezone issue if you establish Airbyte on K8s.
But the key problem is source-mssql-connector. Add timezone configuration to this Dockerfile, and execute gradle command here to build custom source-connector image, then you can solve the key problem.

For another problem you can refer to this PR. Source-mssql-connector missed change_lsn, which you can find in Debezium document, so that the basic-normalization fail to sort the latest and correct CDC commit. If you also have this problem, try changes in the PR and re-build a custom source-mssql-connector image via gradle command mentioned above.

Even if you fix the problems mentioned above, destination data may still incorrect because of basic-normalization not using change_lsn to sort data. Then all you need to do is modify stream_processor.py so that the basic-normalization process can generate correct DBT model.

@HonzaPavel
Copy link

HonzaPavel commented May 1, 2023

Hi @uuboyscy,
It is more complicated than I thought. Changing the time zone in the Dockerfile makes sense. Thank you. I will try it.
Shouldn't the other problems be fixed in Airbyte? I think they're bugs. I'd like to help, but I don't have enough experience contributing to open source projects. Have you tried adding these changes to the Airbyte code?

@igrankova igrankova changed the title MSSQL CDC commit logs are not fetched properly Source MSSQL: CDC commit logs are not fetched properly Jun 6, 2023
@igrankova igrankova added area/connectors Connector related issues connectors/sources-database connectors/source/mssql team/db-dw-sources Backlog for Database and Data Warehouse Sources team releaseStage/beta and removed needs-triage team/tse Technical Support Engineers autoteam labels Jun 6, 2023
@visma-henna-pekkala
Copy link

Hey! I just found this issue and I think it might be the same that has caused trouble for us, which I have described here!
Has the fix you described above worked properly for the time being? Unfortunately, the docker link cannot be found.

@plenti-jacob-roe
Copy link
Contributor

Hi @HonzaPavel

It is one of problems. This also cause that timezone of the first update would be your local timezone, but later incremental update become UTC. I'm not sure why it happen, I think it is a bug of Debezium.

To solve this problem for short-term, as you can see my reply in this post, using k8tz to simply set global timezone for all newly created pods can solve the timezone issue if you establish Airbyte on K8s. But the key problem is source-mssql-connector. Add timezone configuration to this Dockerfile, and execute gradle command here to build custom source-connector image, then you can solve the key problem.

For another problem you can refer to this PR. Source-mssql-connector missed change_lsn, which you can find in Debezium document, so that the basic-normalization fail to sort the latest and correct CDC commit. If you also have this problem, try changes in the PR and re-build a custom source-mssql-connector image via gradle command mentioned above.

Even if you fix the problems mentioned above, destination data may still incorrect because of basic-normalization not using change_lsn to sort data. Then all you need to do is modify stream_processor.py so that the basic-normalization process can generate correct DBT model.

in case people are still having this issue we had a similar issue not including change_lsn is the cause of the problem since multiple records share the same commit_lsn and same timestamp. #24206 had the fix for the problem and we are running this code in a custom connector and it solves the problem. Unfortunately it was closed and no fix has been implemented so if you have a transaction in sql server with multiple changes to the same row you can get incorrect data in the destination. @prateekmukhedkar is there a planned fix for this, your last comment on the pr seems to leave it unanswered.

@bleonard bleonard added the frozen Not being actively worked on label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues community connectors/source/mssql connectors/sources-database frozen Not being actively worked on releaseStage/beta team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants