|
2 | 2 |
|
3 | 3 | ## 1. Summary |
4 | 4 |
|
5 | | -This specification describes the internals of the documents soft-deletion algorithm. |
| 5 | +This specification describes the internals of the document soft-deletion algorithm. |
6 | 6 |
|
7 | 7 | ## 2. Motivation |
8 | 8 |
|
9 | 9 | Deleting documents is extremely slow and can happen when; |
10 | | -- A user delete a single document. |
11 | | -- A user delete a batch of documents. |
12 | | -- A user update one or multiple documents (i.e., the primary key is the same, but the document's content is not the same). |
| 10 | +- A user deletes a single document. |
| 11 | +- A user deletes a batch of documents. |
| 12 | +- A user updates one or multiple documents (i.e., the primary key is the same, but the document's content is not the same). |
13 | 13 |
|
14 | | -The purpose of the documents soft-deletion feature is to make the deletion of documents almost instantaneous by **not** deleting the document when asked. |
| 14 | +The purpose of the document soft-deletion feature is to make the deletion of documents almost instantaneous by **not** deleting the document when asked. |
15 | 15 |
|
16 | 16 | ## 3. Functional Specification |
17 | 17 |
|
18 | | -Instead of deleting the documents, Meilisearch mark them internally as deleted and then exclude them from all the other algorithms of the engine. |
19 | | -That's fast but takes space; thus, at some point, we need to _really_ delete the soft deleted documents. |
| 18 | +Instead of deleting the documents, Meilisearch marks them internally as deleted and then excludes them from all the other algorithms of the engine. |
| 19 | +That's fast but takes up space; thus, at some point, we need to _really_ delete the soft-deleted documents. |
20 | 20 |
|
21 | 21 | This can happen for two reasons; |
22 | | -- When 90% of the total available space is used. |
23 | | -- When 10% of the total space is dedicated to the soft deleted documents. |
| 22 | +1. when there are more soft-deleted documents than regular documents in the database, or |
| 23 | +2. when the soft-deleted documents occupy more disk space than a fixed threshold. |
24 | 24 |
|
25 | | -The idea is good, but there are two technical issues; |
26 | | - |
27 | | -1. We don't know the size a document really occupies. |
28 | | - This means we don't know the size used by the soft deleted documents. |
29 | | - That can be imprecise in the case of a really heterogeneous dataset with large and small documents. |
30 | | -2. We don't know the total available space. The only information available to meilisearch is the `max-index-size` which is by default at 100GB, but meilisearch could be deployed on a smaller disk. |
31 | | - |
32 | | -The second point could be a real issue for the case of someone who has very few documents but update them frequently on a small disk without updating the `max-index-size` parameter. |
33 | | -The soft-deleted documents would grow until they use 10GB of disk even though the user only has like 100MB of documents. |
| 25 | +Reason (2) presents the drawback that we don't know the precise disk space taken by a document, for technical reasons. Since the information we have is the total size taken by all documents (soft-deleted or not) and the number of documents, we approximate the size of a document to the average size of a document. |
| 26 | +This means that if a few outliers are updated/deleted, they can take up much more disk space than the fixed threshold. |
34 | 27 |
|
35 | 28 | ## 4. Future Possibilities |
36 | 29 |
|
37 | 30 | - Work again on the way to get the size of the disk the `data.ms` is currently running on. This would improve the analytics as well. |
38 | | -- Provide a cli parameter to select how much space can be used to store the soft deleted documents. |
| 31 | +- Provide a CLI parameter to select how much space can be used to store the soft deleted documents. |
39 | 32 | - It could be expressed as a real size or in terms of percentage. |
40 | | -- Provide a route to delete the soft deleted documents. |
41 | | - - It could be useful if a user **know** he will have a lot of updates during the day but nothing around midnight, for example. |
42 | | - - It would allow a user to clear the soft deleted when meilisearch is not under pressure to ensure all your updates stay fast during the day. |
| 33 | +- Provide a route to delete the soft-deleted documents. |
| 34 | + - It could be useful if a user **knows** they will have a lot of updates during the day but nothing around midnight, for example. |
| 35 | + - It would allow a user to clear the soft-deleted when Meilisearch is not under pressure to ensure all your updates stay fast during the day. |
0 commit comments