Skip to content

Influxdb 2.0 Retention service hang trying to delete shardΒ #20729

Open
@WeiliangLuo

Description

Steps to reproduce:

  1. Create two buckets:
  • Metrics1h: 1 hour retention policy
  • Metrics1d: 1 day retention policy
  1. Run python script to generate metrics, each point contains 1000 fields, generate 100 points every 10 seconds for every service tag value.
        for i in range(100):
            service = 'service_{}'.format(i)
            p = influxdb_client.Point('service_stats')
            p.tag('service', service)
            for idx in range (1000):
                field = 'field_{}'.format(idx)
                p = p.field(field, idx)
            p.time(now, write_precision=WritePrecision.S)
            res = write_api.write(bucket=bucket, org=org, record=p)
  1. Add a task to do downsampling with from Metrics1h to Metrics1d.
  • Not sure if this matters, but add here to present the full picture.
option v = {timeRangeStart: -20m, timeRangeStop: -10m}
option task = {name: "MetricsDownsample1hTo1d", every: 10m}

from(bucket: "Metrics1h")
	|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
	|> filter(fn: (r) =>
		(r["_measurement"] == "service_stats"))
	|> aggregateWindow(every: 5m, fn: sum, createEmpty: false)
	|> yield(name: "sum")
	|> to(bucket: "Metrics1d", org: "Test")

Expected behavior:
Retention service to run periodically, start and ends quickly.

Actual behavior:

ts=2021-02-07T14:01:43.058796Z lvl=info msg="Retention policy deletion check (start)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=start

Saw the start log, but it never ends.

From /debug/pprof/goroutine?debug=2:

goroutine 200 [semacquire, 2058 minutes]:
sync.runtime_Semacquire(0xc00dc84710)
	/usr/local/go/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc00dc84708)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(*TSMReader).Close(0xc00dc84700, 0x0, 0x0)
	/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/reader.go:365 +0x4a
github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(*FileStore).Close(0xc00b9c2dd0, 0x407600, 0xc0061adeb8)
	/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.go:615 +0xd0
github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(*Engine).Close(0xc001f3a8c0, 0x0, 0x0)
	/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:780 +0x9e
github.com/influxdata/influxdb/v2/tsdb.(*Shard).close(0xc003033680, 0xc001e326c0, 0x0)
	/go/src/github.com/influxdata/influxdb/tsdb/shard.go:378 +0x47
github.com/influxdata/influxdb/v2/tsdb.(*Shard).Close(0xc003033680, 0x0, 0x0)
	/go/src/github.com/influxdata/influxdb/tsdb/shard.go:368 +0x7b
github.com/influxdata/influxdb/v2/tsdb.(*Store).DeleteShard(0xc00035b000, 0xc1, 0x0, 0x0)
	/go/src/github.com/influxdata/influxdb/tsdb/store.go:804 +0x325
github.com/influxdata/influxdb/v2/v1/services/retention.(*Service).run(0xc001552af0, 0x3a46ae0, 0xc0052fbec0)
	/go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:141 +0x209
github.com/influxdata/influxdb/v2/v1/services/retention.(*Service).Open.func1(0xc001552af0, 0x3a46ae0, 0xc0052fbec0)
	/go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:55 +0x6c
created by github.com/influxdata/influxdb/v2/v1/services/retention.(*Service).Open
	/go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:53 +0x1ee

The delete stuck waiting on a delete reference of TSMReader object.

I am guessing that the downsampling task may be contributing to it.

goroutine 4613564 [select, 1714 minutes]:
github.com/influxdata/flux/execute.(*result).Do(0xc0057492c0, 0x3724850, 0xc00daaf840, 0xc0057492c0)
	/go/pkg/mod/github.com/influxdata/flux@v0.99.0/execute/result.go:61 +0xd9
github.com/influxdata/influxdb/v2/query/control.(*errorCollectingTableIterator).Do(0xc00daaf840, 0x3724850, 0xc00daaf840, 0xc00daaf800)
	/go/src/github.com/influxdata/influxdb/query/control/controller.go:947 +0x42
github.com/influxdata/influxdb/v2/task/backend/executor.exhaustResultIterators(0x3a212a0, 0xc00daaf800, 0xc00daaf800, 0xc0042ea9b0)
	/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:616 +0x51
github.com/influxdata/influxdb/v2/task/backend/executor.(*worker).executeQuery(0xc019b3ae80, 0xc004dc4d80)
	/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:527 +0x57e
github.com/influxdata/influxdb/v2/task/backend/executor.(*worker).work(0xc019b3ae80)
	/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:421 +0x545
github.com/influxdata/influxdb/v2/task/backend/executor.(*Executor).startWorker.func1(0xc0001581c0, 0xc019b3ae80)
	/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:291 +0x76
created by github.com/influxdata/influxdb/v2/task/backend/executor.(*Executor).startWorker
	/go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:288 +0xb6

On Crtl-C, I see a bunch of piled up tasks failed and shortly after it, the retention service run was unblocked.

ts=2021-02-09T00:26:17.108700Z lvl=info msg=Stopping log_id=0SAze8nW000 service=nats
ts=2021-02-09T00:26:17.109767Z lvl=info msg=Stopping log_id=0SAze8nW000 service=bolt
ts=2021-02-09T00:26:17.109808Z lvl=info msg=Stopping log_id=0SAze8nW000 service=query
ts=2021-02-09T00:26:17.110168Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.121881Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.121922Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.122255Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.122349Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.122498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.128444Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.128477Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.128498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.129889Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc23cd0234000 error="database not open"
ts=2021-02-09T00:26:17.130362Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8
ts=2021-02-09T00:26:17.148484Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc91aa0234000 error="database not open"
ts=2021-02-09T00:26:17.148550Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcb6490234000 error="database not open"
ts=2021-02-09T00:26:17.148657Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcdae80234000 error="database not open"
ts=2021-02-09T00:26:17.148735Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcff870234000 error="database not open"
ts=2021-02-09T00:26:17.148806Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd24260234000 error="database not open"
ts=2021-02-09T00:26:17.148876Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd48c50234000 error="database not open"
ts=2021-02-09T00:26:17.148958Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd6d640234000 error="database not open"
ts=2021-02-09T00:26:17.149026Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd92030234000 error="database not open"
ts=2021-02-09T00:26:17.149089Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bdb6a20234000 error="database not open"
ts=2021-02-09T00:26:17.149156Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bddb410234000 error="database not open"
ts=2021-02-09T00:26:17.149235Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b4fcbf0234000 error="database not open"
ts=2021-02-09T00:26:17.149316Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bab5971234000 error="database not open"
ts=2021-02-09T00:26:17.149399Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b28e300e34000 error="database not open"
ts=2021-02-09T00:26:17.153145Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b822690234000 error="database not open"
ts=2021-02-09T00:26:17.153281Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b5d8790234000 error="database not open"
ts=2021-02-09T00:26:17.153373Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc6d0b0234000 error="database not open"
ts=2021-02-09T00:26:17.153454Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bbda8f0234000 error="database not open"
ts=2021-02-09T00:26:17.197384Z lvl=info msg="Deleted shard" log_id=0SAze8nW000 service=retention op_name=retention_delete_check db_instance=958a746c4d6ae0d4 db_shard_id=193 db_rp=autogen
ts=2021-02-09T00:26:17.197621Z lvl=info msg="Retention policy deletion check (end)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=end op_elapsed=123874138.839ms

Environment info:
Linux 4.15.0-55-CUSTOMIZED x86_64
InfluxDB 2.0.3 (git: fe04d34) build_date: 2020-12-15T01:00:16Z
Running inside podman container

Config:
N/A

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions