Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add non-unique u16 Id to ColumnDefinition #25388

Merged
merged 2 commits into from
Oct 10, 2024
Merged

Conversation

mgattozzi
Copy link
Contributor

This commit adds the column_id field to ColumnDefinition so that the output for a Catalog will contain the id of that column. This is non unique, whereas TableIds and DbIds will be unique. The column_id corresponds to it's index in the schema.

Closes #25386

@mgattozzi
Copy link
Contributor Author

I'm having a test fail that's not my fault and can't get it to not flake:

Serve command failed: Failed to bind address

I wonder if some address conflict is happening. I think for some tests we bind the address but then drop it and free it up for others to take. Possibly a race condition.

@hiltontj
Copy link
Contributor

Yeah one of the troubles with our integration tests is that we are launching the service from the CLI, but checking for an open port before starting. So yeah, there may be a race condition between selecting the available port, dropping the port, then starting the CLI with that port- or it is leaving a gap there where it could conflict with something else on the CI server.

Since we are launching with the CLI, we could pass in the --http-bind 0.0.0.0:0 arg directly, which would have it pick a random port and use it to start the server directly, but I'm not sure how we would then report the selected port back to the test harness.

This commit adds the column_id field to ColumnDefinition so that the
output for a Catalog will contain the id of that column. This is non
unique, whereas TableIds and DbIds will be unique. The column_id
corresponds to it's index in the schema.

Closes #25386
Copy link
Contributor

@hiltontj hiltontj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about using enumerate for generating the ID, but otherwise, I think some tests added to check serialization before and after a column is added would be useful.

@@ -135,10 +138,12 @@ impl<'a> From<&'a TableDefinition> for TableSnapshot<'a> {
let cols = def
.schema()
.iter()
.map(|(col_type, f)| {
.enumerate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if using enumerate to determine the column ID is a good idea. I think that generally, columns are always appended, in which case, it is okay, but in the event that we allow for dropping columns, then this would change their order and mess up the IDs.

We probably need some way to generate the IDs, based on what was the largest already used ID for a given table, and then ensure that that ID remains fixed for the column it is applied to for all time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that's a good catch. They should have a well defined ID that remains static regardless of what schema changes later happen to the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I've taken a look at this more @pauldix and @hiltontj and I've come to the conclusion we either use enumerate or we don't even bother with a column id. Here's why:

I don't think this is viable beyond doing enumerate or we'd have to upstream changes to arrow itself and that seems like it wouldn't be worth the trade off or something that would make sense upstream. I don't think this change is worth it given how everything in arrow works off indexing or column name not id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have our own TableSchema which includes the SchemaRef and also includes a map of column name to id. Then the TableSchema is what we serialize in the catalog. We want to have the ids and not the string identifiers in the WALContents because that's much cheaper to serialize and deserialize. Also cheaper to index when inserting the data into the WriteBuffer.

So we don't need to update arrow, we just need a wrapper around the arrow struct where we can add our own stuff.

@mgattozzi
Copy link
Contributor Author

@pauldix and @hiltontj the changes in d202994 should hopefully address your concerns.

Copy link
Member

@pauldix pauldix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Is the follow up to modify the WalContents so that it uses column IDs rather than column names?

@mgattozzi
Copy link
Contributor Author

Yeah. I'll open a separate issue for it

@mgattozzi mgattozzi merged commit 724a7e9 into main Oct 10, 2024
11 of 12 checks passed
@mgattozzi mgattozzi deleted the mgattozzi/column-id branch October 10, 2024 20:59
mgattozzi added a commit that referenced this pull request Oct 10, 2024
mgattozzi added a commit that referenced this pull request Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add uint16 ID to Column in the Catalog
3 participants