feat: Add non-unique u16 Id to ColumnDefinition #25388

mgattozzi · 2024-09-24T17:27:29Z

This commit adds the column_id field to ColumnDefinition so that the output for a Catalog will contain the id of that column. This is non unique, whereas TableIds and DbIds will be unique. The column_id corresponds to it's index in the schema.

Closes #25386

mgattozzi · 2024-09-26T16:37:43Z

I'm having a test fail that's not my fault and can't get it to not flake:

Serve command failed: Failed to bind address

I wonder if some address conflict is happening. I think for some tests we bind the address but then drop it and free it up for others to take. Possibly a race condition.

hiltontj · 2024-09-30T13:19:47Z

Yeah one of the troubles with our integration tests is that we are launching the service from the CLI, but checking for an open port before starting. So yeah, there may be a race condition between selecting the available port, dropping the port, then starting the CLI with that port- or it is leaving a gap there where it could conflict with something else on the CI server.

Since we are launching with the CLI, we could pass in the --http-bind 0.0.0.0:0 arg directly, which would have it pick a random port and use it to start the server directly, but I'm not sure how we would then report the selected port back to the test harness.

This commit adds the column_id field to ColumnDefinition so that the output for a Catalog will contain the id of that column. This is non unique, whereas TableIds and DbIds will be unique. The column_id corresponds to it's index in the schema. Closes #25386

hiltontj

I have some concerns about using enumerate for generating the ID, but otherwise, I think some tests added to check serialization before and after a column is added would be useful.

hiltontj · 2024-10-08T14:18:32Z

influxdb3_catalog/src/serialize.rs

@@ -135,10 +138,12 @@ impl<'a> From<&'a TableDefinition> for TableSnapshot<'a> {
        let cols = def
            .schema()
            .iter()
-            .map(|(col_type, f)| {
+            .enumerate()


I don't know if using enumerate to determine the column ID is a good idea. I think that generally, columns are always appended, in which case, it is okay, but in the event that we allow for dropping columns, then this would change their order and mess up the IDs.

We probably need some way to generate the IDs, based on what was the largest already used ID for a given table, and then ensure that that ID remains fixed for the column it is applied to for all time.

Ah yes, that's a good catch. They should have a well defined ID that remains static regardless of what schema changes later happen to the table.

Okay I've taken a look at this more @pauldix and @hiltontj and I've come to the conclusion we either use enumerate or we don't even bother with a column id. Here's why:

Here's where we define a new TableDefinition, it's what ends up in our Catalog when serialized

When we create a new one here is where we see what we use to make a new Schema

This becomes a Schema

Which is just a wrapper around arrow::datatypes::SchemaRef

This has no way to add IDs for columns and the way most arrow stuff works is by indexing on a column essentially. There is no stable Id for a column.

I don't think this is viable beyond doing enumerate or we'd have to upstream changes to arrow itself and that seems like it wouldn't be worth the trade off or something that would make sense upstream. I don't think this change is worth it given how everything in arrow works off indexing or column name not id.

We could have our own TableSchema which includes the SchemaRef and also includes a map of column name to id. Then the TableSchema is what we serialize in the catalog. We want to have the ids and not the string identifiers in the WALContents because that's much cheaper to serialize and deserialize. Also cheaper to index when inserting the data into the WriteBuffer.

So we don't need to update arrow, we just need a wrapper around the arrow struct where we can add our own stuff.

mgattozzi · 2024-10-10T20:30:16Z

@pauldix and @hiltontj the changes in d202994 should hopefully address your concerns.

pauldix

Looks good. Is the follow up to modify the WalContents so that it uses column IDs rather than column names?

mgattozzi · 2024-10-10T20:58:56Z

Yeah. I'll open a separate issue for it

mgattozzi requested review from pauldix, hiltontj and praveen-influx September 24, 2024 17:27

hiltontj approved these changes Oct 1, 2024

View reviewed changes

mgattozzi force-pushed the mgattozzi/column-id branch from 0c3e4c3 to c6da459 Compare October 7, 2024 15:17

mgattozzi requested a review from hiltontj October 7, 2024 15:18

hiltontj reviewed Oct 8, 2024

View reviewed changes

refactor: Have TableDefinition use a TableSchema

d202994

mgattozzi requested a review from hiltontj October 10, 2024 20:28

pauldix approved these changes Oct 10, 2024

View reviewed changes

mgattozzi merged commit 724a7e9 into main Oct 10, 2024
11 of 12 checks passed

mgattozzi deleted the mgattozzi/column-id branch October 10, 2024 20:59

mgattozzi added a commit that referenced this pull request Oct 10, 2024

fix: lint fixes for #25388

0a32215

mgattozzi mentioned this pull request Oct 10, 2024

fix: lint fixes for #25388 #25451

Merged

mgattozzi added a commit that referenced this pull request Oct 10, 2024

fix: lint fixes for #25388 (#25451)

eb24b3b

mgattozzi mentioned this pull request Oct 11, 2024

Switch field in the influxdb3_wal crate to use ColumnId instead of a Arc<str> #25461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add non-unique u16 Id to ColumnDefinition #25388

feat: Add non-unique u16 Id to ColumnDefinition #25388

mgattozzi commented Sep 24, 2024

mgattozzi commented Sep 26, 2024

hiltontj commented Sep 30, 2024

hiltontj left a comment

hiltontj Oct 8, 2024

pauldix Oct 8, 2024

mgattozzi Oct 10, 2024

pauldix Oct 10, 2024

mgattozzi commented Oct 10, 2024

pauldix left a comment

mgattozzi commented Oct 10, 2024

feat: Add non-unique u16 Id to ColumnDefinition #25388

feat: Add non-unique u16 Id to ColumnDefinition #25388

Conversation

mgattozzi commented Sep 24, 2024

mgattozzi commented Sep 26, 2024

hiltontj commented Sep 30, 2024

hiltontj left a comment

Choose a reason for hiding this comment

hiltontj Oct 8, 2024

Choose a reason for hiding this comment

pauldix Oct 8, 2024

Choose a reason for hiding this comment

mgattozzi Oct 10, 2024

Choose a reason for hiding this comment

pauldix Oct 10, 2024

Choose a reason for hiding this comment

mgattozzi commented Oct 10, 2024

pauldix left a comment

Choose a reason for hiding this comment

mgattozzi commented Oct 10, 2024