-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source MSSQL: CDC commit logs are not fetched properly #25237
Comments
I have a very similar problem. I have a connection between MS SQL Server and BigQuery. I have set up CDC on MS SQL Server. I am using Incremental Sync - Deduped History. SCD historization is not working correctly because CDC on MS SQL Server uses local time zone instead of UTC. In BigQuery the difference is 2 hours. Isn't that the reason for your problems? Is it not possible to set the time zone change to UTC in Airbyte? |
Hi @HonzaPavel It is one of problems. This also cause that timezone of the first update would be your local timezone, but later incremental update become UTC. I'm not sure why it happen, I think it is a bug of Debezium. To solve this problem for short-term, as you can see my reply in this post, using k8tz to simply set global timezone for all newly created pods can solve the timezone issue if you establish Airbyte on K8s. For another problem you can refer to this PR. Source-mssql-connector missed Even if you fix the problems mentioned above, destination data may still incorrect because of basic-normalization not using |
Hi @uuboyscy, |
Hey! I just found this issue and I think it might be the same that has caused trouble for us, which I have described here! |
in case people are still having this issue we had a similar issue not including change_lsn is the cause of the problem since multiple records share the same commit_lsn and same timestamp. #24206 had the fix for the problem and we are running this code in a custom connector and it solves the problem. Unfortunately it was closed and no fix has been implemented so if you have a transaction in sql server with multiple changes to the same row you can get incorrect data in the destination. @prateekmukhedkar is there a planned fix for this, your last comment on the pr seems to leave it unanswered. |
Environment
Current Behavior
I integrate data from SQL Server 2017 with CDC enable to BigQuery, and configure "Incremental Deduped+History" on Airbyte.
Some of data are not integrated correctly.
These data would be modified multiple times in SQL Server in almost the same time, and the wrong order of commit would be kept in target destination.
As shown in following BigQuery photo, I rerun the SQLs generated when Airbyte syncing, found that the order would lead to wrong row of data kept.
Expected Behavior
Keep correct data in target destination.
As Bigquery screenshot shown below, the correct data is the row with TC027 = 'Y', but the one with TC027 = 'U' is kept now.
Further, maybe Airbyte does not use column __$command_id in MSSQL CDC log screenshot to order the latest commit, so that the older commit is update to target destination.
Logs
MSSQL CDC log screenshot
Bigquery screenshot
Steps to Reproduce
As shown in above Bigquery screenshot, I rerun the SQLs generated when Airbyte syncing, found that the order would lead to wrong row of data kept (The data with TC027 = 'U' is kept, but TC027 = 'Y' is correct row).
The text was updated successfully, but these errors were encountered: