-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add non-unique u16 Id to ColumnDefinition #25388
Conversation
I'm having a test fail that's not my fault and can't get it to not flake:
I wonder if some address conflict is happening. I think for some tests we bind the address but then drop it and free it up for others to take. Possibly a race condition. |
Yeah one of the troubles with our integration tests is that we are launching the service from the CLI, but checking for an open port before starting. So yeah, there may be a race condition between selecting the available port, dropping the port, then starting the CLI with that port- or it is leaving a gap there where it could conflict with something else on the CI server. Since we are launching with the CLI, we could pass in the |
This commit adds the column_id field to ColumnDefinition so that the output for a Catalog will contain the id of that column. This is non unique, whereas TableIds and DbIds will be unique. The column_id corresponds to it's index in the schema. Closes #25386
0c3e4c3
to
c6da459
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about using enumerate
for generating the ID, but otherwise, I think some tests added to check serialization before and after a column is added would be useful.
influxdb3_catalog/src/serialize.rs
Outdated
@@ -135,10 +138,12 @@ impl<'a> From<&'a TableDefinition> for TableSnapshot<'a> { | |||
let cols = def | |||
.schema() | |||
.iter() | |||
.map(|(col_type, f)| { | |||
.enumerate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if using enumerate
to determine the column ID is a good idea. I think that generally, columns are always appended, in which case, it is okay, but in the event that we allow for dropping columns, then this would change their order and mess up the IDs.
We probably need some way to generate the IDs, based on what was the largest already used ID for a given table, and then ensure that that ID remains fixed for the column it is applied to for all time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that's a good catch. They should have a well defined ID that remains static regardless of what schema changes later happen to the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I've taken a look at this more @pauldix and @hiltontj and I've come to the conclusion we either use enumerate or we don't even bother with a column id. Here's why:
- Here's where we define a new TableDefinition, it's what ends up in our Catalog when serialized
- When we create a new one here is where we see what we use to make a new Schema
- This becomes a
Schema
- Which is just a wrapper around arrow::datatypes::SchemaRef
- This has no way to add IDs for columns and the way most arrow stuff works is by indexing on a column essentially. There is no stable Id for a column.
I don't think this is viable beyond doing enumerate or we'd have to upstream changes to arrow itself and that seems like it wouldn't be worth the trade off or something that would make sense upstream. I don't think this change is worth it given how everything in arrow works off indexing or column name not id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have our own TableSchema
which includes the SchemaRef
and also includes a map of column name to id. Then the TableSchema
is what we serialize in the catalog. We want to have the ids and not the string identifiers in the WALContents because that's much cheaper to serialize and deserialize. Also cheaper to index when inserting the data into the WriteBuffer
.
So we don't need to update arrow, we just need a wrapper around the arrow struct where we can add our own stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Is the follow up to modify the WalContents
so that it uses column IDs rather than column names?
Yeah. I'll open a separate issue for it |
This commit adds the column_id field to ColumnDefinition so that the output for a Catalog will contain the id of that column. This is non unique, whereas TableIds and DbIds will be unique. The column_id corresponds to it's index in the schema.
Closes #25386