Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent retention service creating orphaned shard files #24530

Merged
merged 5 commits into from
Jan 4, 2024

Conversation

gwossum
Copy link
Member

@gwossum gwossum commented Dec 21, 2023

Under certain circumstances, the retention service can fail to delete shards from the store in a timely manner. When the shard groups are pruned based on age, this leaves orphaned shard files on the disk. The retention service will then not attempt to remove the obsolete shard files because the meta store does not know about them. This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed from the meta store.

Closes #24529

  • I've read the contributing section of the project README.
  • Signed CLA (if not already signed).

Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

fixes: #24529
Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

fixes: #24529
Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is complex enough I think I need a walk-through. Sorry that I am not fully grasping how this works.

@davidby-influx
Copy link
Contributor

This is complex enough I think I need a walk-through. Sorry that I am not fully grasping how this works.

What I am missing is how the scenario where the retention service leaves shards on disk but deletes them from the meta-data occurs, and how these changes fix that scenario. You explained the scenario to me once, but I don't remember enough details to walk through it with the new code....

Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comment typos, and a suggestion to simplify a little code.

services/meta/data.go Show resolved Hide resolved
services/meta/data.go Show resolved Hide resolved
cmd/influxd/run/server.go Show resolved Hide resolved
services/retention/service.go Show resolved Hide resolved
Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more change requested, an additional log message.

services/retention/service.go Show resolved Hide resolved
for _, id := range s.TSDBStore.ShardIDs() {
if info, ok := deletedShardIDs[id]; ok {
delete(deletedShardIDs, id)
if err := s.TSDBStore.DeleteShard(id); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more change - what do you think about printing one more message before calling DeleteShard:

  • attempting deletion of shard X, database Y, retention policy Z

We have no visibility into why the retention service hangs for so long, and this might give us more clues; we can look at the size and cardinality of a shard that prints the message and then never prints a success or failure message.

Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

@gwossum gwossum merged commit 7bd3f89 into master-1.x Jan 4, 2024
9 checks passed
@gwossum gwossum deleted the gw_retention_fix branch January 4, 2024 17:59
gwossum added a commit that referenced this pull request Jan 4, 2024
Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

Backport via clean cherry-pick of #24530.

fixes: #24543

(cherry picked from commit 7bd3f89)
gwossum added a commit that referenced this pull request Jan 4, 2024
Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

Backport via clean cherry-pick of #24530.

fixes: #24543

(cherry picked from commit 7bd3f89)
davidby-influx pushed a commit that referenced this pull request Jan 4, 2024
Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

fixes: #24529
(cherry picked from commit 7bd3f89)

closes #24545
davidby-influx added a commit that referenced this pull request Jan 5, 2024
…#24547)

Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

fixes: #24529
(cherry picked from commit 7bd3f89)
closes #24545

Co-authored-by: Geoffrey Wossum <gwossum@influxdata.com>
davidby-influx added a commit that referenced this pull request Jan 5, 2024
…#24547)

Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

fixes: #24529
(cherry picked from commit 7bd3f89)
closes #24545

Co-authored-by: Geoffrey Wossum <gwossum@influxdata.com>
(cherry picked from commit 0dc48b1)
closes #24546
davidby-influx added a commit that referenced this pull request Jan 5, 2024
…#24547) (#24548)

Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

fixes: #24529
(cherry picked from commit 7bd3f89)
closes #24545

Co-authored-by: Geoffrey Wossum <gwossum@influxdata.com>
(cherry picked from commit 0dc48b1)
closes #24546
gwossum added a commit that referenced this pull request Jun 28, 2024
…#24547)

Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

closes: #25116
(cherry picked from commit 7bd3f89)
closes #24545

Co-authored-by: Geoffrey Wossum <gwossum@influxdata.com>
(cherry picked from commit 0dc48b1)
chengshiwen pushed a commit to chengshiwen/influxdb that referenced this pull request Aug 27, 2024
…ata#24530)

* fix: prevent retention service creating orphaned shard files

Under certain circumstances, the retention service can fail to delete shards from
the store in a timely manner. When the shard groups are pruned based on age, this
leaves orphaned shard files on the disk. The retention service will then not attempt
to remove the obsolete shard files because the meta store does not know about them.
This can cause excessive disk space usage for some users.

This corrects that by requiring shards files be deleted before they can be removed
from the meta store.

fixes: influxdata#24529
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants