Influxdb 2.0 Retention service hang trying to delete shardΒ #20729
Open
Description
Steps to reproduce:
- Create two buckets:
- Metrics1h: 1 hour retention policy
- Metrics1d: 1 day retention policy
- Run python script to generate metrics, each point contains 1000 fields, generate 100 points every 10 seconds for every service tag value.
for i in range(100):
service = 'service_{}'.format(i)
p = influxdb_client.Point('service_stats')
p.tag('service', service)
for idx in range (1000):
field = 'field_{}'.format(idx)
p = p.field(field, idx)
p.time(now, write_precision=WritePrecision.S)
res = write_api.write(bucket=bucket, org=org, record=p)
- Add a task to do downsampling with from Metrics1h to Metrics1d.
- Not sure if this matters, but add here to present the full picture.
option v = {timeRangeStart: -20m, timeRangeStop: -10m}
option task = {name: "MetricsDownsample1hTo1d", every: 10m}
from(bucket: "Metrics1h")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) =>
(r["_measurement"] == "service_stats"))
|> aggregateWindow(every: 5m, fn: sum, createEmpty: false)
|> yield(name: "sum")
|> to(bucket: "Metrics1d", org: "Test")
Expected behavior:
Retention service to run periodically, start and ends quickly.
Actual behavior:
ts=2021-02-07T14:01:43.058796Z lvl=info msg="Retention policy deletion check (start)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=start
Saw the start log, but it never ends.
From /debug/pprof/goroutine?debug=2:
goroutine 200 [semacquire, 2058 minutes]:
sync.runtime_Semacquire(0xc00dc84710)
/usr/local/go/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc00dc84708)
/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(*TSMReader).Close(0xc00dc84700, 0x0, 0x0)
/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/reader.go:365 +0x4a
github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(*FileStore).Close(0xc00b9c2dd0, 0x407600, 0xc0061adeb8)
/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.go:615 +0xd0
github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(*Engine).Close(0xc001f3a8c0, 0x0, 0x0)
/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:780 +0x9e
github.com/influxdata/influxdb/v2/tsdb.(*Shard).close(0xc003033680, 0xc001e326c0, 0x0)
/go/src/github.com/influxdata/influxdb/tsdb/shard.go:378 +0x47
github.com/influxdata/influxdb/v2/tsdb.(*Shard).Close(0xc003033680, 0x0, 0x0)
/go/src/github.com/influxdata/influxdb/tsdb/shard.go:368 +0x7b
github.com/influxdata/influxdb/v2/tsdb.(*Store).DeleteShard(0xc00035b000, 0xc1, 0x0, 0x0)
/go/src/github.com/influxdata/influxdb/tsdb/store.go:804 +0x325
github.com/influxdata/influxdb/v2/v1/services/retention.(*Service).run(0xc001552af0, 0x3a46ae0, 0xc0052fbec0)
/go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:141 +0x209
github.com/influxdata/influxdb/v2/v1/services/retention.(*Service).Open.func1(0xc001552af0, 0x3a46ae0, 0xc0052fbec0)
/go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:55 +0x6c
created by github.com/influxdata/influxdb/v2/v1/services/retention.(*Service).Open
/go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:53 +0x1ee
The delete stuck waiting on a delete reference of TSMReader object.
I am guessing that the downsampling task may be contributing to it.
goroutine 4613564 [select, 1714 minutes]:
github.com/influxdata/flux/execute.(*result).Do(0xc0057492c0, 0x3724850, 0xc00daaf840, 0xc0057492c0)
/go/pkg/mod/github.com/influxdata/flux@v0.99.0/execute/result.go:61 +0xd9
github.com/influxdata/influxdb/v2/query/control.(*errorCollectingTableIterator).Do(0xc00daaf840, 0x3724850, 0xc00daaf840, 0xc00daaf800)
/go/src/github.com/influxdata/influxdb/query/control/controller.go:947 +0x42
github.com/influxdata/influxdb/v2/task/backend/executor.exhaustResultIterators(0x3a212a0, 0xc00daaf800, 0xc00daaf800, 0xc0042ea9b0)
/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:616 +0x51
github.com/influxdata/influxdb/v2/task/backend/executor.(*worker).executeQuery(0xc019b3ae80, 0xc004dc4d80)
/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:527 +0x57e
github.com/influxdata/influxdb/v2/task/backend/executor.(*worker).work(0xc019b3ae80)
/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:421 +0x545
github.com/influxdata/influxdb/v2/task/backend/executor.(*Executor).startWorker.func1(0xc0001581c0, 0xc019b3ae80)
/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:291 +0x76
created by github.com/influxdata/influxdb/v2/task/backend/executor.(*Executor).startWorker
/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:288 +0xb6
On Crtl-C, I see a bunch of piled up tasks failed and shortly after it, the retention service run was unblocked.
ts=2021-02-09T00:26:17.108700Z lvl=info msg=Stopping log_id=0SAze8nW000 service=nats
ts=2021-02-09T00:26:17.109767Z lvl=info msg=Stopping log_id=0SAze8nW000 service=bolt
ts=2021-02-09T00:26:17.109808Z lvl=info msg=Stopping log_id=0SAze8nW000 service=query
ts=2021-02-09T00:26:17.110168Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.121881Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.121922Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.122255Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.122349Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.122498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.128444Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.128477Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.128498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.129889Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc23cd0234000 error="database not open"
ts=2021-02-09T00:26:17.130362Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.148484Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc91aa0234000 error="database not open"
ts=2021-02-09T00:26:17.148550Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcb6490234000 error="database not open"
ts=2021-02-09T00:26:17.148657Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcdae80234000 error="database not open"
ts=2021-02-09T00:26:17.148735Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcff870234000 error="database not open"
ts=2021-02-09T00:26:17.148806Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd24260234000 error="database not open"
ts=2021-02-09T00:26:17.148876Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd48c50234000 error="database not open"
ts=2021-02-09T00:26:17.148958Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd6d640234000 error="database not open"
ts=2021-02-09T00:26:17.149026Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd92030234000 error="database not open"
ts=2021-02-09T00:26:17.149089Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bdb6a20234000 error="database not open"
ts=2021-02-09T00:26:17.149156Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bddb410234000 error="database not open"
ts=2021-02-09T00:26:17.149235Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b4fcbf0234000 error="database not open"
ts=2021-02-09T00:26:17.149316Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bab5971234000 error="database not open"
ts=2021-02-09T00:26:17.149399Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b28e300e34000 error="database not open"
ts=2021-02-09T00:26:17.153145Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b822690234000 error="database not open"
ts=2021-02-09T00:26:17.153281Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b5d8790234000 error="database not open"
ts=2021-02-09T00:26:17.153373Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc6d0b0234000 error="database not open"
ts=2021-02-09T00:26:17.153454Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bbda8f0234000 error="database not open"
ts=2021-02-09T00:26:17.197384Z lvl=info msg="Deleted shard" log_id=0SAze8nW000 service=retention op_name=retention_delete_check db_instance=958a746c4d6ae0d4 db_shard_id=193 db_rp=autogen
ts=2021-02-09T00:26:17.197621Z lvl=info msg="Retention policy deletion check (end)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=end op_elapsed=123874138.839ms
Environment info:
Linux 4.15.0-55-CUSTOMIZED x86_64
InfluxDB 2.0.3 (git: fe04d34) build_date: 2020-12-15T01:00:16Z
Running inside podman container
Config:
N/A