Open
Description
openedon Nov 4, 2024
This testing assertion fired during test_storage_controller_many_tenants
on PR #8613 (alure).
There's a Slack thread with a bit of context here.
Spent some time looking at it, but didn't spot the issue. It happened shortly after a live migration and the error makes me think
we somehow counted the only existing layer twice.
2024-11-01T15:41:55.639814Z ERROR secondary_download{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}:panic{thread=background op worker location=pageserver/src/tenant/secondary/downloader.rs:778:13}: assertion `left == right` failed
left: 2605056
right: 5210112
<backtrace snipped>
2024-11-01T15:41:55.873307Z ERROR secondary_download_scheduler:panic{thread=background op worker location=pageserver/src/tenant/secondary/scheduler.rs:307:36}: Panic in background task: JoinError::Panic(Id(1360288), ...)
<backtrace snipped>
2024-11-01T15:41:55.883651Z ERROR Task panicked, exiting process: Any { .. } task_name="secondary tenant downloads"
One observation was that the panic completely killed the pageserver process. This was unexpected to me, so panic handling on that code path should be checked as well.
Todo:
- Root cause the bug
- Fix
- Check that panic handling is correct on the secondary download code path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment