Skip to content

pageserver: accounting error on secondary tenant resident size #9628

Open

Description

This testing assertion fired during test_storage_controller_many_tenants on PR #8613 (alure).
There's a Slack thread with a bit of context here.

Spent some time looking at it, but didn't spot the issue. It happened shortly after a live migration and the error makes me think
we somehow counted the only existing layer twice.

2024-11-01T15:41:55.639814Z ERROR secondary_download{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}:panic{thread=background op worker location=pageserver/src/tenant/secondary/downloader.rs:778:13}: assertion `left == right` failed
  left: 2605056
 right: 5210112

<backtrace snipped>

2024-11-01T15:41:55.873307Z ERROR secondary_download_scheduler:panic{thread=background op worker location=pageserver/src/tenant/secondary/scheduler.rs:307:36}: Panic in background task: JoinError::Panic(Id(1360288), ...)

<backtrace snipped>

2024-11-01T15:41:55.883651Z ERROR Task panicked, exiting process: Any { .. } task_name="secondary tenant downloads"

One observation was that the panic completely killed the pageserver process. This was unexpected to me, so panic handling on that code path should be checked as well.

Todo:

  • Root cause the bug
  • Fix
  • Check that panic handling is correct on the secondary download code path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions