-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc: add a page to explain row-level deletes #3432
Conversation
site/docs/cow-and-mor.md
Outdated
## Copy-on-write | ||
|
||
As the definition states, given a user's update/delete requirement, the CoW write process would search for all the affected data files and perform rewrite. | ||
Spark supports CoW `DELETE`, `UPDATE` and `MERGE` operations through Spark extensions. More details can be found in [Spark Writes](../spark-writes) page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have only one SQL statement. But we support both COW and MOR (row level deletes).
So, I am not clear about how to switch between COW to MOR and MOR to COW.
Also what is the default behaviour, is it MOR or COW ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a point of clarification; its not users who chose whether to use merge-on-read or copy-on-write but the engines. E.g: Spark probably writes using copy-on-write whereas Flink writes assuming the data will merge-on-read.
Unless I am mistaken and there is indeed a way for the user to influence this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, I expect to have a table which clarifies about which engines supports CoW and which engine supports MoR. If an engine supports both, we should have a property to control when to use which one from that engine.
I can see that V2 spec supports merge on read (row level deletes), that doesn't mean all the engines which creates V2 table support merge on read ?
site/docs/cow-and-mor.md
Outdated
In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: | ||
|
||
- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. | ||
- **merge-on-read (MoR)**: update/delete information is encoded in the form of delete files. The table reader can apply all delete information at read time. A compaction process takes care of merging delete files into data files asynchronously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is really hard to read using CoW
and MoR
in place of copy-on-write and merge-on-read. I think that we should not shorten them, but I'd like to hear if other people share this opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I am always not fan of abbreviation, they sometimes tend to add more confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, and the articles are tricky as well depending on how you read the abbreviations. For example "a MoR" vs. "an MoR", which one sounds right depends on if you read it as the acronym or expand it to the full "merge-on-read" as you're reading.
site/docs/cow-and-mor.md
Outdated
|
||
Also note that because row-level delete files are valid Iceberg data files, each file must define the partition it belongs to. | ||
If the file belongs to `Unpartitioned` (the partition spec has no partition field), then the delete file is called a **global delete**. | ||
Otherwise, it is called a **partition delete**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth calling out that only an equality delete file can be a global delete file.
site/docs/cow-and-mor.md
Outdated
|
||
In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: | ||
|
||
- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this even more explicit
"When rows within a data file are deleted or updated, all rows within the original file will be rewritten into new files containing all of the original data but with the effected rows removed or modified."
Feel free to take or leave :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I like explicit.
Nit: "effected" -> "affected".
site/docs/cow-and-mor.md
Outdated
|
||
In Iceberg, update is modeled as a delete with an insert within the same transaction, so there is no concept of an "update file". | ||
During a MoR write transaction, new data files and delete files are committed with the same sequence number. | ||
During a MoR read process, delete files are applied to data files of strictly lower sequence numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the spec, position deletes are applied to data files with sequence number less than OR EQUAL TO their sequence number. So this statement is not correct, I think.
site/docs/cow-and-mor.md
Outdated
|
||
During MoR read time, the Iceberg reader indexes all the delete files and determines the associated delete files for each data file. Typically, | ||
|
||
- as described before, delete files are only applied to data files that has a strictly lower sequence number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, please make it consistent with the spec which says position deletes should be applied to data files with the same sequence number.
site/docs/cow-and-mor.md
Outdated
In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: | ||
|
||
- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. | ||
- **merge-on-read (MoR)**: update/delete information is encoded in the form of delete files. The table reader can apply all delete information at read time. A compaction process takes care of merging delete files into data files asynchronously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment to above : When rows within a data file are modified separate additional "delete" files are created containing information about which rows were modified in the original file. These delete files represent the delta between the original file and the actual state of the table. When the table is read in the future the "delete" files are read and their information is merged with the original data files to create the modified version of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the elaboration. However, I think that while that is a good explanation for deletes, it does not give the full picture for updates. Later in this doc, it is stated that an update is represented as a delete plus an insert. Should we state it here up front?
site/docs/cow-and-mor.md
Outdated
- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. | ||
- **merge-on-read (MoR)**: update/delete information is encoded in the form of delete files. The table reader can apply all delete information at read time. A compaction process takes care of merging delete files into data files asynchronously. | ||
|
||
Clearly, CoW is more efficient in reading data, but MoR is more efficient in writing data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would just skip the word Clearly here, computers are hard so it may not be clear :)
Perhaps something like
Copy on write provides better read performance:
Copy on write provides better performance on reading because no additional files need to be read and merged to get the current state of the table but this comes at the cost of worse performance at write time. At write time Copy on Write must rewrite an entire file even if only a single row is changed within that file. Data was previously written and unmodified still must be rewritten if any adjacent row was modified.
Merge on Read provides better write performance:
Merge on Read only needs to write new files with the data that has changed in the table which means writing significantly less information than Copy on Write. Write performance is increased, but every read now requires reading not just the requested data files but also all delete files which apply to those data files.
site/docs/cow-and-mor.md
Outdated
- **merge-on-read (MoR)**: update/delete information is encoded in the form of delete files. The table reader can apply all delete information at read time. A compaction process takes care of merging delete files into data files asynchronously. | ||
|
||
Clearly, CoW is more efficient in reading data, but MoR is more efficient in writing data. | ||
Users can choose to use **BOTH** CoW and MoR against the same Iceberg table based on different situations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the lead is buried a bit in this sentence.
On a per operation (merge/update) basis a user can choose whether to use Copy on Write or Merge on Read depending on which would be better for the table at that moment.
site/docs/cow-and-mor.md
Outdated
|
||
## Merge-on-read | ||
|
||
In the next few sections, we provide more details around the Iceberg MoR design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just skip this sentence
site/docs/cow-and-mor.md
Outdated
|
||
In the next few sections, we provide more details around the Iceberg MoR design. | ||
|
||
### Row-Level Delete File Spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the word "Row-Level" here, I feel like it becomes slightly confusing when talking about "equality" deletes vs "position" deletes just because one literally has row information and the other does not. Maybe just Delete File Spec?
site/docs/cow-and-mor.md
Outdated
|
||
However, there is one more feature that data compaction does not provide but MERGE provides, | ||
that is to rewrite data files while preserving the sequence number. | ||
In data compaction, data files are always rewritten with a new sequence number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does data compaction have to rewrite with a new sequence number? Why can't it also do what MERGE is doing with respect to sequence numbers.
site/docs/cow-and-mor.md
Outdated
As documented in the [Spec](../spec/#row-level-deletes) page, Iceberg supports 2 different types of row-level delete files: **position deletes** and **equality deletes**. | ||
If you are unfamiliar with these concepts, please read the related sections in the spec for more information before proceeding. | ||
|
||
Also note that because row-level delete files are valid Iceberg data files, each file must define the partition it belongs to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a little confusing since we say "Must define the partition it belongs to" then follow up with "It can be unpartitioned"
Maybe just instead say "each file reports the partition it was written for".
site/docs/cow-and-mor.md
Outdated
### MoR Update as Delete + Insert | ||
|
||
In Iceberg, update is modeled as a delete with an insert within the same transaction, so there is no concept of an "update file". | ||
During a MoR write transaction, new data files and delete files are committed with the same sequence number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept of sequence number is used here for the first time in this doc. Probably needs an explanation.
For this section I think I would probably elaborate at the beginning, something like
An update in Merge on Read mode consists of two sets of files, Deletes and Inserts. Delete files are created to mark all existing data rows which have been updated as having been deleted in their original data files. Insert files are normal Iceberg data files consisting of the new updated rows. On read, the delete files cause the original records to not appear and only the new updated rows will appear.
When creating data and delete files they are associated with a monotonically increasing "sequence number" which increases with every operation. Delete files can only apply to data files written with an earlier sequence number than the delete file is written with preventing a delete file from modifying future data.
site/docs/cow-and-mor.md
Outdated
|
||
### Delete File Writer | ||
|
||
From the end user's perspective, it is very rare that they could directly request deletion of a specific row of a specific file given the abstraction provided by Iceberg. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would skip this sentence, not sure it helps too much and I think the examples below are pretty clear.
site/docs/cow-and-mor.md
Outdated
A delete requirement almost always comes as a predicate such as `id = 5` or `date < '2000-01-01'`. | ||
Given the predicate, a delete writer can write delete files in one or some combinations of the following ways: | ||
|
||
1. **partition position deletes**: perform a scan \[1\] to know the data files and row positions affected by the predicate and then write partition \[2\] position deletes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to know" -> "to determine"
site/docs/cow-and-mor.md
Outdated
- limitations under the License. | ||
--> | ||
|
||
# Copy-on-Write and Merge-on-Read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this title just be # Merge-on-Read
? Almost the entire file is covering merge-on-read and feels like it may only be mentioning copy-on-write in context of merge-on-read, in other words to help readers grasp the concept of merge-on-read. We could probably remove the ## Copy-on-write
sub-section entirely and place that one sentence that links to the spark-writes in the introduction.
site/docs/cow-and-mor.md
Outdated
|
||
Clearly, CoW is more efficient in reading data, but MoR is more efficient in writing data. | ||
Users can choose to use **BOTH** CoW and MoR against the same Iceberg table based on different situations. | ||
A common example is that, for a time-partitioned table, newer partitions with more frequent updates are maintained in the MoR approach through a CDC streaming pipeline, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think the comma is needed after streaming pipeline
site/docs/cow-and-mor.md
Outdated
|
||
## Copy-on-write | ||
|
||
As the definition states, given a user's update/delete requirement, the CoW write process would search for all the affected data files and perform rewrite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/perform rewrite/perform a rewrite
site/docs/cow-and-mor.md
Outdated
## Copy-on-write | ||
|
||
As the definition states, given a user's update/delete requirement, the CoW write process would search for all the affected data files and perform rewrite. | ||
Spark supports CoW `DELETE`, `UPDATE` and `MERGE` operations through Spark extensions. More details can be found in [Spark Writes](../spark-writes) page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/found in Spark Writes page/found on the Spark Writes page
(or alternatively "in the Spark Writes section")
site/docs/cow-and-mor.md
Outdated
- partition deletes are pruned based on their statistics and secondary index information so that each data file is associated with the minimum number of necessary delete files possible. | ||
|
||
Because position deletes must be sorted by file path and row position, applying position deletes to data files can be done by streaming the rows in position deletes. | ||
Therefore, there is not too much burden on memory side, but the number of IO increases as the number of position delete files increases, so it's desirable to have a low number of position deletes for each data file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/on memory side/on the memory side
site/docs/cow-and-mor.md
Outdated
#### Rewrite position deletes (REWRITE) | ||
|
||
As we discussed in the [reader](#data-with-delete-reader) section, having fewer position deletes per data file is computationally more efficient. | ||
This compaction optimizes the positions deletes layout, such as combining all position deletes in a single partition to a single file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about this sentence I can't put my finger on, but I would maybe restate it as:
"This compaction includes optimizations on the position deletes layout, such as combining all position deletes for an individual partition into a single file."
site/docs/cow-and-mor.md
Outdated
2. because of that, requests could potentially be processed in batch | ||
2. the data to delete spans across multiple partitions in a table | ||
|
||
CoW is a viable option for users with GDPR requirements, but MoR can also be used for such use case if there are specific SLA requirements around aspects such as logical delete latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/such use case/such use cases
site/docs/cow-and-mor.md
Outdated
|
||
#### Use case 2: CDC Streaming | ||
|
||
In CDC streaming, for each new event, writer writes a new data row in memory (or temp storage like RocksDB) with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/writer writes/the writer writes
site/docs/cow-and-mor.md
Outdated
|
||
In CDC streaming, for each new event, writer writes a new data row in memory (or temp storage like RocksDB) with: | ||
|
||
- a new equality delete row to remove any duplicated rows from existing Iceberg data files in storage, the equality constraint is formulated based on the primary key (Iceberg identifier fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe hard stop after storage and start "The equality constraint is..." as a new sentence?
site/docs/cow-and-mor.md
Outdated
With these critical characteristics, MERGE with preserved sequence number could safely remove delete files without commit conflicts, and any update commit can safely retry optimistically. | ||
|
||
If users would like to use CDC streaming and GDPR updates at the same time, it is recommended that for hot partitions with very frequent CDC updates, | ||
equality deletes is used instead of position deletes to minimize MERGE compaction conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/is used/are used
thanks for all the feedbacks! For anyone interested, we are discussing about preserving sequence number in Slack: https://apache-iceberg.slack.com/archives/C02JDAQF81E/p1635922079047800 I will update this doc with all the comments after that discussion is finalized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is very helpful to me (new to Iceberg). Thanks for working on this.
site/docs/row-level-deletes.md
Outdated
|
||
1. **partition position deletes**: perform a scan \[1\] to know the data files and row positions affected by the predicate and then write partition position deletes \[2\] | ||
2. **partition equality deletes**: convert input predicate to equality predicates \[3\] for each affected partition and write partition equality deletes | ||
3. **partition global deletes**: convert input predicate to equality predicates and write global equality deletes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so.
|
||
During Iceberg's scan file planning phase, a delete file index is constructed to filter the delete files associated with any data file, with the following rules: | ||
|
||
1. equality delete files are applied to data files of strictly lower sequence numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These rules are covered in the spec. I think we can cite the rules verbatim, or link to that section?
If a delete file belongs to `Unpartitioned` (the partition spec has no partition field), then the delete file is called a **global delete**. | ||
Otherwise, it is called a **partition delete**. | ||
|
||
### Writing delete files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to mention in this doc that the delete manifests are separate from data manifests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to not mention it here, it does not affect user's understanding of the write flow. I can mention it in the end when talking about why we want to have a procedure to expire deletes.
@RussellSpitzer Yes I agree, but it's also a bit awkward to leave it unexplained... please let me know if you have any better idea, or we can try to improve that section over time. |
Updated the document by removing the comparison between position and equality deletes, and instead just discuss their tradeoffs in the specific sections. Also removed separation of @rdblue @RussellSpitzer @liuml07 please let me know if you have any more comments, otherwise I hope to merge this documentation because it is referenced quite a lot recently. |
|
||
Iceberg supports 2 different types of row-level delete files: **position deletes** and **equality deletes**. | ||
The **sequence number** concept is also needed to describe the relative age of data and delete files. | ||
If you are unfamiliar with these concepts, please read the [row-level deletes](../spec/#row-level-deletes) and [sequence numbers](../spec/#sequence-numbers) sections in the Iceberg spec for more information before proceeding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to open two links 'row-level deletes' and 'sequence numbers', but I got 404, the path configuration should be '../spec.md#row-level-deletes' and '../spec.md# sequence-numbers'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed now! It was actually an issue with the theme.
After the planning phase, each data file to read is associated with a set of delete files to merge with. | ||
In general, position deletes are easier to merge, because they are already sorted by file path and row position when writing. | ||
Applying position deletes to a data file can be viewed as merging two sorted lists, which can be done efficiently. | ||
In contrast, applying equality deletes to a data files requires loading all rows to memory and checking every row in a data file against every equality predicate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By equality predicate
do you mean equality delete predicate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In contrast, applying equality deletes to a data files requires loading all rows to memory and checking every row in a data file against every equality predicate. | |
In contrast, applying equality deletes to a data files requires loading all rows to memory and checking every row in a data file against every equality delete predicate. |
@jackye1995 : Can we take this PR forward? One of my PR (#4223) depends on it 😁 |
@@ -0,0 +1,190 @@ | |||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!-- | |
--- | |
url: row-level-deletes | |
aliases: | |
- "tables/row-level-deletes" | |
--- | |
<!-- |
This will need this front-matter added to the top and the file should be relocated to the docs/tables directory.
Copy-on-write and merge-on-read are two different approaches to handle row-level delete operations. Here are their definitions in Iceberg: | ||
|
||
- **copy-on-write**: a delete directly rewrites all the affected data files. | ||
- **merge-on-read**: delete information is encoded in the form of _delete files_. The table reader can apply all delete information at read time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may just be me, but I strongly prefer the terms eager and lazy to describe these differences rather than copy-on-write or merge-on-read. Even if we clarify in the definitions here that we mean copy-on-write and merge-on-read, I think it would be better so that we can to the behavior with more descriptive terms.
For example:
In the eager approach, given a user's delete requirement, the write process will search for all the affected data files and perform a rewrite operation.
I think that makes it really clear what is happening and why.
@aokolnychyi, what do you think?
- limitations under the License. | ||
--> | ||
|
||
# Row-Level Deletes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most pages start with ##
and break up further sections with ###
or more.
Please check out the documentation of the specific compute engines to see the details of their capabilities related to row-level deletes. | ||
This article will focus on explaining Iceberg's core design of copy-on-write and merge-on-read. | ||
|
||
!!!Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires a space after !!!
|
||
# Row-Level Deletes | ||
|
||
Iceberg supports metadata-based deletion through the `DeleteFiles` interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is very clear. I think the main distinction is that the DeleteFiles
API deletes whole data files. Saying "metadata-based" doesn't make much sense to someone trying to understand why you'd need different operations.
# Row-Level Deletes | ||
|
||
Iceberg supports metadata-based deletion through the `DeleteFiles` interface. | ||
It allows you to quickly delete a specific file or any file that might match a given expression without the need to read or write any data in the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this sentence deserves a "but ..." that notes that deleting whole files may not be what you want.
Iceberg supports metadata-based deletion through the `DeleteFiles` interface. | ||
It allows you to quickly delete a specific file or any file that might match a given expression without the need to read or write any data in the table. | ||
|
||
Row-level deletes target more complicated use cases such as general data protection regulation (GDPR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to have an example to make this more clear: You probably don't have a file per user ID so if you need to delete a specific user by ID then you have to either rewrite the file or note the rows in it that have been deleted. Those are the eager and lazy approaches.
- **copy-on-write**: a delete directly rewrites all the affected data files. | ||
- **merge-on-read**: delete information is encoded in the form of _delete files_. The table reader can apply all delete information at read time. | ||
|
||
Overall, copy-on-write is more efficient in reading data, whereas merge-on-read is more efficient in writing deletes, but requires more maintenance and tuning to be performant in reading data with deletes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too vague. What does "requires more maintenance and tuning to be performant" mean?
How about:
The eager approach requires the least amount of work for readers, but frequent changes can become expensive if the same files are rewritten over and over. The lazy approach is the quickest way to complete the write operation, but requires work to remove deleted rows at read time that can cause queries to slow down if there are too many deletes to apply. Both approaches are useful and use cases often combine them.
For example, a time-partitioned table can have newer partitions maintained with the merge-on-read approach through a streaming pipeline, | ||
and older partitions maintained with the copy-on-write approach to apply less frequent GDPR deletes from batch ETL jobs. | ||
|
||
There are use cases that could only be supported by one approach such as change data capture (CDC). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this statement is true. A large MERGE INTO
can definitely be used. You probably mean low latency CDC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Since this is the primary explainer we should be careful in how absolute of terms we explain things as many people might not read more than just this.
|
||
There are use cases that could only be supported by one approach such as change data capture (CDC). | ||
There are also limitations for different compute engines that lead them to prefer one approach over another. | ||
Please check out the documentation of the specific compute engines to see the details of their capabilities related to row-level deletes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Please" is unnecessary, we would typically just say "For the behavior and capabilities of specific engines, check the engine's documentation."
``` | ||
file A: (1, 'c1', 'data1'), (2, 'c1', 'data2') | ||
file B: (3, 'c2', 'data1'), (4, 'c2', 'data2') | ||
file C: (5, 'c3', 'data3'), (6, 'c3', 'data2') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fake data like c1
, c2
, etc. is really, really hard to follow. I suggest using real words or names so that it is clear what's happening. The spec has an example of different bears that you can use here. That is easy to follow:
1: id | 2: category | 3: name |
---|---|---|
1 | marsupial | Koala |
2 | toy | Teddy |
3 | ursidae | Grizzly |
4 | ursidae | Polar |
5 | candy | Gummi |
There are use cases that could only be supported by one approach such as change data capture (CDC). | ||
There are also limitations for different compute engines that lead them to prefer one approach over another. | ||
Please check out the documentation of the specific compute engines to see the details of their capabilities related to row-level deletes. | ||
This article will focus on explaining Iceberg's core design of copy-on-write and merge-on-read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also mention compaction and link to it.
@jackye1995 : I think this PR is very useful in clearing FAQ about copy-on-write vs merge-on-read. Do you have bandwidth to work on this? If not, I can take it forward and keep you as a co-author. |
This is great stuff and I think we should get this in. Let me know if @jackye1995 and @ajantha-bhat you need help to take this forward. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
@rdblue @samredai @openinx @RussellSpitzer @aokolnychyi @chenjunjiedada
Adding a page to explain concepts related to Merge-on-read, delete file and delete compaction based on our past design docs, meetings and discussions.