Description
Describe the bug
The compactor somehow failed to upload the block's index file to Swift, but still deleted the source blocks. There are warnings in the logs, but the compactor does not seem to be aware of them. We lost one day of metrics for our main tenant. (I was hoping to be able to re-generate the index file from the chunks, but that doesn't seem possible as the chunk files only have samples, not the labels themselves.)
We opened a bug in Thanos (thanos-io/thanos#3958), but we're wondering if Cortex would be the more relevant place for it?
To Reproduce
We're not sure how it happens, so here's our best attempt at recollection:
Running Cortex 1.7.0, the Compactor compacted a series of blocks. It then uploaded all resulting files to Swift, but the index file never made it to Swift. In Swift's own logs, there are no traces of the index file ever being uploaded. We /think/ an error might have been detected by "CloseWithLogOnErr", but never made its way back to the Compactor (since it runs as deferred) and thus ignored.
See logs below.
Expected behavior
The Compactor would retry sending a file if there is an error.
Environment:
- Infrastructure: Kubernetes
- Deployment tool: helmfile
Storage Engine
- Blocks
- Chunks
Additional Context
Compactor logs:
{
"caller": "runutil.go:124",
"err": "upload object close: Timeout when reading or writing data",
"level": "warn",
"msg": "detected close error",
"ts": "2021-03-20T05:12:44.877771796Z"
}
{
"bucket": "tracing: cortex-tsdb-prod04",
"caller": "objstore.go:159",
"component": "compactor",
"dst": "01F16ZRT8TYA08VJQR1ZPCC5EP/index",
"from": "data/compact/0@14583055817248146110/01F16ZRT8TYA08VJQR1ZPCC5EP/index",
"group": "0@{__org_id__=\"1\"}",
"groupKey": "0@14583055817248146110",
"level": "debug",
"msg": "uploaded file",
"org_id": "1",
"ts": "2021-03-20T05:12:44.877834603Z"
}
{
"caller": "compact.go:810",
"component": "compactor",
"duration": "4m41.662527735s",
"group": "0@{__org_id__=\"1\"}",
"groupKey": "0@14583055817248146110",
"level": "info",
"msg": "uploaded block",
"org_id": "1",
"result_block": "01F16ZRT8TYA08VJQR1ZPCC5EP",
"ts": "2021-03-20T05:12:45.140243007Z"
}
{
"caller": "compact.go:832",
"component": "compactor",
"group": "0@{__org_id__=\"1\"}",
"groupKey": "0@14583055817248146110",
"level": "info",
"msg": "marking compacted block for deletion",
"old_block": "01F15H6D6CXE1ASE788HQECHM4",
"org_id": "1",
"ts": "2021-03-20T05:12:45.627586825Z"
}
$ openstack object list cortex-tsdb-prod04 --prefix 1/01F16ZRT8TYA08VJQR1ZPCC5EP
+--------------------------------------------+
| Name |
+--------------------------------------------+
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000001 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000002 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000003 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000004 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000005 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000006 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000007 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000008 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000009 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000010 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000011 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000012 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000013 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000014 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000015 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000016 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000017 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000018 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000019 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000020 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000021 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000022 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000023 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/meta.json |
+--------------------------------------------+