Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative WaitGroup counter during the analyze cause TiDB is killed #52601

Closed
hawkingrei opened this issue Apr 15, 2024 · 5 comments · Fixed by #52634 · May be fixed by #52604
Closed

Negative WaitGroup counter during the analyze cause TiDB is killed #52601

hawkingrei opened this issue Apr 15, 2024 · 5 comments · Fixed by #52634 · May be fixed by #52604
Labels
affects-6.5 affects-7.1 affects-7.5 affects-8.1 report/customer Customers have encountered this bug. severity/moderate sig/planner SIG: Planner type/bug The issue is confirmed as a bug.

Comments

@hawkingrei
Copy link
Member

hawkingrei commented Apr 15, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Analyze table with the large column such as blob

2. What did you expect to see? (Required)

success

3. What did you see instead (Required)

image

4. What is your TiDB version? (Required)

v6.5.5

@elsa0520
Copy link
Contributor

elsa0520 commented Apr 16, 2024

Triggering conditions

  1. auto analyze
  2. enable analyze memory limit control https://docs.pingcap.com/zh/tidb/v6.5/system-variables#tidb_mem_quota_analyze-%E4%BB%8E-v610-%E7%89%88%E6%9C%AC%E5%BC%80%E5%A7%8B%E5%BC%95%E5%85%A5
    tidb_mem_quota_analyze
  3. analyze job use more memory than quota. For example, use big type column such as blob, json, text etc

Even if the above three conditions are met, it does not mean that the restart of tidb will definitely be triggered. This is a problem of probability triggering.

@elsa0520
Copy link
Contributor

elsa0520 commented Apr 16, 2024

Phenomenon and Judgment Evidence

  1. TiDB restarted with panic std_out
panic: Out Of Global Analyze Memory Limit! [recovered]
        panic: sync: negative WaitGroup counter

goroutine 168489963 [running]:
sync.(*WaitGroup).Add(0x4672ca0?, 0xc0801be170?)
        /usr/local/go/src/sync/waitgroup.go:83 +0xda
sync.(*WaitGroup).Done(...)
        /usr/local/go/src/sync/waitgroup.go:108
github.com/pingcap/tidb/executor.(*AnalyzeColumnsExecV2).subMergeWorker.func1()
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:589 +0x3f4
panic({0x4672ca0, 0xc0801be170})
        /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/pingcap/tidb/executor.(*globalPanicOnExceed).Action(0x212524e?, 0xc000377f80?)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:186 +0x9f
github.com/pingcap/tidb/util/memory.(*Tracker).Consume.func1(0xc0012ce1f0, 0xc0d515c280?)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:440 +0x10b
github.com/pingcap/tidb/util/memory.(*Tracker).Consume(0xc06ccf5680?, 0x11264fb)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/memory/tracker.go:484 +0x332
github.com/pingcap/tidb/statistics.(*baseCollector).FromProto(0xc0235a2000, 0xc0c7ed6080, 0x1177ba6?)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/row_sampler.go:307 +0x617
github.com/pingcap/tidb/executor.(*AnalyzeColumnsExecV2).subMergeWorker(0xc1027b4108, 0xc00d602b40, 0xc00d602ba0, 0x10, 0x0)
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:624 +0x6f1
created by github.com/pingcap/tidb/executor.(*AnalyzeColumnsExecV2).buildSamplingStats
        /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:260 +0xa37
  1. Before tidb is killed, the log include the "analyze table xxx has failed" and "analyze worker panicked" "Out Of Global Analyze Memory Limit!"
[2024/04/15 22:09:27.477 +08:00] [INFO] [update.go:1156] ["[stats] auto analyze triggered"] [sql="analyze table `zwcdb`.`plat_trans_rsp`"] [reason="too many modifications(67
5962/275669>0.1)"]
[2024/04/15 22:09:34.415 +08:00] [INFO] [analyze.go:575] ["analyze table `zwcdb`.`plat_trans_rsp` has failed"] [partition=] ["job info"="auto analyze table all columns with
256 buckets, 500 topn, 0.17925586003702448 samplerate"] ["start time"=2024/04/15 22:09:27.486 +08:00] ["end time"=2024/04/15 22:09:34.410 +08:00] [cost=6.924309188s]
[2024/04/15 22:09:40.927 +08:00] [ERROR] [analyze_col_v2.go:578] ["analyze worker panicked"] [recover="Out Of Global Analyze Memory Limit!"] [stack="github.com/pingcap/tidb/
executor.(*AnalyzeColumnsExecV2).subMergeWorker.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/analyze_col_v2.go:578\nruntime.go
panic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/pingcap/tidb/executor.(*globalPanicOnExceed).Action\n\t/home/jenkins/agent/workspace/build-common/go/src/github.c
om/pingcap/tidb/executor/executor.go:186\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingca
p/tidb/util/memory/tracker.go:440\ngithub.com/pingcap/tidb/util/memory.(*Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/m
emory/tracker.go:484\ngithub.com/pingcap/tidb/statistics.(*baseCollector).FromProto\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/statistics/r
ow_sampler.go:307\ngithub.com/pingcap/tidb/executor.(*AnalyzeColumnsExecV2).subMergeWorker\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/execu
tor/analyze_col_v2.go:624"]
  1. After the time of this log [2024/04/15 22:09:34.415 +08:00] [INFO] [analyze.go:575] ["analyze table `zwcdb`.`plat_trans_rsp` has failed"] , there is immediately a new analyze job is created

image

  1. The analyze job
  • is running state, end_time is NULL ,
  • start_time is later than the time of "analyze failed log"
  • start_time is earler than tidb restarted time

@elsa0520
Copy link
Contributor

elsa0520 commented Apr 16, 2024

Root cause and analysis

Root cause

Analyze job panic when it retry job and exceed the memory limit. (There is a certain probability that it will happen)

Analysis

  1. Analyze job will retry job when the "analyzeColumnsPushDownV2()" is error
  2. The retry job and the first job share the same param "samplingMergeWg *util.WaitGroupWrapper" which will cause the "panic: sync: negative WaitGroup counter"
func (e *AnalyzeColumnsExecV2) analyzeColumnsPushDownWithRetryV2() *statistics.AnalyzeResults {
	analyzeResult := e.analyzeColumnsPushDownV2()
	// do not retry if succeed / not oom error / not auto-analyze / samplerate not set
	if analyzeResult.Err == nil || analyzeResult.Err != errAnalyzeOOM ||
		!e.ctx.GetSessionVars().InRestrictedSQL ||
		e.analyzePB.ColReq == nil || *e.analyzePB.ColReq.SampleRate <= 0 {
		return analyzeResult
	}
	finishJobWithLog(e.ctx, analyzeResult.Job, analyzeResult.Err)
	......
	prepareV2AnalyzeJobInfo(e.AnalyzeColumnsExec, true)
	AddNewAnalyzeJob(e.ctx, e.job)
	StartAnalyzeJob(e.ctx, e.job)
	return e.analyzeColumnsPushDownV2()
}
func (e *AnalyzeColumnsExecV2) subMergeWorker(resultCh chan<- *samplingMergeResult, taskCh <-chan []byte, l int, index int) {
	isClosedChanThread := index == 0
	defer func() {
		if r := recover(); r != nil {
			logutil.BgLogger().Error("analyze worker panicked", zap.Any("recover", r), zap.Stack("stack"))
			metrics.PanicCounter.WithLabelValues(metrics.LabelAnalyze).Inc()
			resultCh <- &samplingMergeResult{err: getAnalyzePanicErr(r)}
		}
		// Consume the remaining things.
		for {
			_, ok := <-taskCh
			if !ok {
				break
			}
		}
		e.samplingMergeWg.Done()
		if isClosedChanThread {
			e.samplingMergeWg.Wait()
			close(resultCh)
		}
	}()
	failpoint.Inject("mockAnalyzeSamplingMergeWorkerPanic", func() {
		panic("failpoint triggered")
	})
	failpoint.Inject("mockAnalyzeMergeWorkerSlowConsume", func(val failpoint.Value) {
		times := val.(int)
		for i := 0; i < times; i++ {
			e.memTracker.Consume(5 << 20)
			time.Sleep(100 * time.Millisecond)
		}
	})
	retCollector := statistics.NewRowSampleCollector(int(e.analyzePB.ColReq.SampleSize), e.analyzePB.ColReq.GetSampleRate(), l)
	for i := 0; i < l; i++ {
		retCollector.Base().FMSketches = append(retCollector.Base().FMSketches, statistics.NewFMSketch(maxSketchSize))
	}
	for {
		data, ok := <-taskCh
		if !ok {
			break
		}
		dataSize := int64(cap(data))
		colResp := &tipb.AnalyzeColumnsResp{}
		err := colResp.Unmarshal(data)
		if err != nil {
			resultCh <- &samplingMergeResult{err: err}
			return
		}
		colRespSize := int64(colResp.Size())
		e.memTracker.Consume(colRespSize)
		subCollector := statistics.NewRowSampleCollector(int(e.analyzePB.ColReq.SampleSize), e.analyzePB.ColReq.GetSampleRate(), l)
		subCollector.Base().FromProto(colResp.RowCollector, e.memTracker)
		UpdateAnalyzeJob(e.ctx, e.job, subCollector.Base().Count)
		........
	}
	resultCh <- &samplingMergeResult{collector: retCollector}
}
  1. Since both jobs share the same parameter “e.samplingMergeWg.Done()”, the second job overwrites this parameter before the subMergeWorker() of the first job is completed, resulting in a negative number when done.
defer func() {
		if r := recover(); r != nil {
			logutil.BgLogger().Error("analyze worker panicked", zap.Any("recover", r), zap.Stack("stack"))
			metrics.PanicCounter.WithLabelValues(metrics.LabelAnalyze).Inc()
			resultCh <- &samplingMergeResult{err: getAnalyzePanicErr(r)}
		}
		......
		e.samplingMergeWg.Done()
		if isClosedChanThread {
			e.samplingMergeWg.Wait()
			close(resultCh)
		}
	}()
  1. The panic happened inside of defer function also inside of go routine. This kind of panic could not be recovered although we have recover defer function in the outside logic

I write a demo test of panic inside of defer function and go routine. This kind of panic couldn't be recovered and cause process is killed.

package main

import (
	"fmt"
	"sync"
)

func main() {
	defer func() {
		if r := recover(); r != nil {
			fmt.Println("Recovered in main", r)
		}
	}()
	for i := 0; i < 2; i++ {
		go f()
	}
	fmt.Println("Returned normally from f.")
}

func f() {
	var samplingMergeWg *sync.WaitGroup
	samplingMergeWg = &sync.WaitGroup{}
	samplingMergeWg.Add(1)
	defer func() {
		if r := recover(); r != nil {
			fmt.Println("Recovered in f", r)
		}
		samplingMergeWg.Add(-2)
	}()
	fmt.Println("Calling g.")
	g(0)
	fmt.Println("Returned normally from g.")
}

func g(i int) {
	if i > 3 {
		fmt.Println("Panicking!")
		panic(fmt.Sprintf("%v", i))
	}
	defer fmt.Println("Defer in g", i)
	fmt.Println("Printing in g", i)
	g(i + 1)
}

The panic result of my demo code

Calling g.
Printing in g 0
Printing in g 1
Printing in g 2
Printing in g 3
Panicking!
Defer in g 3
Defer in g 2
Defer in g 1
Returned normally from f.
Defer in g 0
Recovered in f 4
panic: 4 [recovered]
        panic: sync: negative WaitGroup counter

goroutine 18 [running]:
sync.(*WaitGroup).Add(0x104f4b098?, 0x140000a0008?)
        /Users/elsa/Package/go/src/sync/waitgroup.go:62 +0x108
main.f.func1()
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:42 +0x84
panic({0x104f374c0?, 0x14000010050?})
        /Users/elsa/Package/go/src/runtime/panic.go:914 +0x218
main.g(0x4)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:52 +0x190
main.g(0x3)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:56 +0xec
main.g(0x2)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:56 +0xec
main.g(0x1)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:56 +0xec
main.g(0x0)
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:56 +0xec
main.f()
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:45 +0xa8
created by main.main in goroutine 1
        /Users/elsa/Workspace/pingcap/tidb/testpanic.go:29 +0x4c

Process finished with the exit code 2

  1. When I tested, I also found that it is uncertain whether this type of panic can be caught. There are certain operations that can be caught, but sometimes not.
    So this is undefined behavior of the language go

  2. And tidb happened to have this kind of panic, which caused the outer layer to be unable to catch this exception, causing tidb to exit abnormally.

@elsa0520
Copy link
Contributor

Thinking

  1. Streamline defer logic to avoid panic in defer logic, especially the defer logic within go routine.
  2. Avoid internal reset operations when two different go routines share the same wg.

@elsa0520 elsa0520 changed the title negative WaitGroup counter during the analyze negative WaitGroup counter during the analyze cause TiDB is killed Apr 16, 2024
@elsa0520 elsa0520 changed the title negative WaitGroup counter during the analyze cause TiDB is killed Negative WaitGroup counter during the analyze cause TiDB is killed Apr 16, 2024
@elsa0520 elsa0520 added the sig/planner SIG: Planner label Apr 16, 2024
ti-chi-bot bot pushed a commit that referenced this issue Apr 16, 2024
@YangKeao YangKeao removed the sig/sql-infra SIG: SQL Infra label Apr 19, 2024
@seiya-annie
Copy link

/found customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 affects-7.1 affects-7.5 affects-8.1 report/customer Customers have encountered this bug. severity/moderate sig/planner SIG: Planner type/bug The issue is confirmed as a bug.
Projects
None yet
4 participants