-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the auto schema detection does not order the columns in the same way for different ables #17451
Comments
I tried an example, say I created two tables in MYSQL:
They have the same set of columns, but the order of columns is different.
The schema discovery library we use will order the columns based on the column ordinals: https://github.com/SeaQL/sea-schema/blob/master/src/mysql/query/column.rs#L72 The ordering produced by this 3rd library will be inherited by RW: https://github.com/risingwavelabs/risingwave/blob/release-1.10/src/connector/src/source/cdc/external/mysql.rs#L102-L118 I don't know if we should do anything on our end to avoid this inconvenience. cc: @StrikeW @neverchanje |
IIUC, this is determined by their business, there exist tables have same columns but in different order, am I right? It is surprised to me that |
Yes, and the upstream tables are managed by a different team. The user is not allowed to alter the orderings of the columns in the upstream table.
Thanks, this is a good idea |
Sounds like Without it, the user can easily workaround by manually reordering the columns before union. Or for ease of reuse, create a Furthermore, the original issue description stated:
The example in #17451 (comment) above does not satisfy the even when part. Does auto-schema detection guarantee the column orders in RisingWave is always the same as upstream? |
Manual re-ordering by specifying the names of columns again makes the auto schema detection essentially meaningless because we want to avoid typing the names in the first place. But we now need to do it again. There are different levels of "willing". I believe the fact that auto-schema detection is implemented in RW and other systems indicates that people are not really that willing. For the change of the upstream table, the producer just needs to do it once.
As the two lines of code (one in sea and one in RW) suggest, "column ordinal", aka the order of how you specify columns in the table definition determines the orderings we get when auto-detecting the schema from the upstream database to RW. |
Another approach is that we just sort the columns by their name on our side, is there any concern? |
However, users are also not able to directly |
I don't get the logic here But what exactly prevents us from helping in this case? |
It seems we agreed to help via |
It would be a surprise to user that the order of columns in cdc table is different from its upstream. |
Why not one step further, automatically order the columns of the CDC table according to the name. I think users can live with the "UNION BY NAME" or "UNION CORRESPONDING" approach. I am not against it. But I don't see why ordering the columns of the table by name is bad.
But this is a big assumption. I cannot imagine why the user would care about the order of columns in the table other than the If there is a meaningful example, case closed. I don't have any counter-arguments anymore. |
I don't understand what you mean by "automatically order". How to order? 👀
|
If the upstream table is
This has to be discussed under the context of auto schema detection for the CDC tables. It's weird to see that people would do this to trick themselves on purpose (never mind if you thought I was referring to automatically re-ordering in This issue occurs again when we later implement the auto-union feature for CDCing multiple upstream databases/tables. |
cc @xiangjinwu @neverchanje might want to chime in |
I agree with @lmatz that if the upstream schema is automatically mapped to RW, we don't have to preserve the original column order, especially considering that we'll support CDC from sharded tables and auto schema change in future. In general, the alignment of column orders will be a nice-to-have, but not a must. We don't have guarantees for it and users should not rely on this behavior. On the other hand, I also think enforcing the order of both sides of UNION is unnecessary. I tried running a query that unions different column orders in duckdb and mysql, and they both produce results successfully. Say, giving a query Therefore, here is my proposal:
Things could be simplified a little bit if we completely ignore the column order of CDC tables, as proposed by Martin. But user will still encounter such inconvenience when dealing with UNION of views, materialized views, Kafka tables, etc. |
Sharing my thoughts:
But a SQL
Did not really get what you mean. I tried |
Thanks. I got it wrong.
My updated proposal:
Do you think it's easy enough to be implemented correctly? |
Addendum: The reason DuckDB uses |
I don't get it. The idea of re-ordering columns by us does not alter the behavior of "union". |
Describe the bug
The user reported that
even when the upstream table's schemas are the same,
the ordering of the columns is "sometimes" not the same.
Therefore,
union
cannot union them by default via*
.Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: