Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a field 'write-id' to track table change. Because when data files … #4460

Closed
wants to merge 1 commit into from

Conversation

huleilei
Copy link

@huleilei huleilei commented Apr 1, 2022

…compact, we don't know which data file(s) don't have update-to-date using the sequence number.

See: https://docs.google.com/document/d/17q0pukixKR2a2BESJWykz4ENhCKSf8yQr751P2i7PF8/edit#

For example: when data files are compact, the value of sequence_number does not increase, use writer_id to track file change.

{"status":1,"snapshot_id":{"long":7161067251255095053},
"sequence_number":{"long":1},
"writer_id":{"long":2},"
data_file":{"content":0,"file_path":"file:/var/folders/lw/pf18s0dj1lv67sthrh4089lh0000gp/T/hive5259352165618126566/table/data/c2=foo/00000-14-922f96a7-b88e-43dd-afa0-d33f56471fef-00001.parquet","file_format":"PARQUET","partition":{"c2":{"string":"foo"}},"record_count":5,"file_size_in_bytes":951,"column_sizes":{"array":[{"key":1,"value":94},{"key":2,"value":97},{"key":3,"value":49}]},"value_counts":{"array":[{"key":1,"value":5},{"key":2,"value":5},{"key":3,"value":5}]},"null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0},{"key":3,"value":5}]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[{"key":1,"value":"\u0001\u0000\u0000\u0000"},{"key":2,"value":"foo"}]},"upper_bounds":{"array":[{"key":1,"value":"\u0001\u0000\u0000\u0000"},{"key":2,"value":"foo"}]},"key_metadata":null,"split_offsets":{"array":[4]},"equality_ids":null,"sort_order_id":{"int":0}}}

…compact, we don't know which datafile(s) don't have update-to-date using sequence number
@rdblue
Copy link
Contributor

rdblue commented Apr 3, 2022

I don't think a "write ID" is correct. For secondary indexing, we need a second sequence number that is more of a physical sequence number (that if fixed for a file) rather than a logical one (that can be changed by operations).

Can you add a second sequence number instead? I think it should use the same default mechanism.

@huleilei
Copy link
Author

huleilei commented Apr 4, 2022

@rdblue
The function of write-id is not only to track file change when data files are compact. The old name of write-id is file-sequence-name. We just want to distinguish the name of sequence-number. The write-id is fixed for a data file, because the data file is unchanged. For example in the insert operation, the info in manifest-entry is:

{"status":1,"snapshot_id":{"long":4470921710704596635},"sequence_number":null,"writer_id":{"long":1},"data_file":{"content":0,"file_path":"file:/var/folders/lw/pf18s0dj1lv67sthrh4089lh0000gp/T/hive5259352165618126566/table/data/c2=foo/00009-11-a8106282-5ddb-400b-a4c1-bdd10889fc31-00001.parquet","file_format":"PARQUET","partition":{"c2":{"string":"foo"}},"record_count":1,"file_size_in_bytes":841,"column_sizes":{"array":[{"key":1,"value":51},{"key":2,"value":54},{"key":3,"value":49}]},"value_counts":{"array":[{"key":1,"value":1},{"key":2,"value":1},{"key":3,"value":1}]},"null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0},{"key":3,"value":1}]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[{"key":1,"value":"\u0001\u0000\u0000\u0000"},{"key":2,"value":"foo"}]},"upper_bounds":{"array":[{"key":1,"value":"\u0001\u0000\u0000\u0000"},{"key":2,"value":"foo"}]},"key_metadata":null,"split_offsets":{"array":[4]},"equality_ids":null,"sort_order_id":{"int":0}}}

After performing update and delete operations on the table, the value of write-id in the data file(00009-11-a8106282-5ddb-400b-a4c1-bdd10889fc31-00001.parquet) is still '"writer_id":{"long":1}'.

@huleilei huleilei closed this Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants