net/http: potential client transport lock contention on high CPU core count machines

### What version of Go are you using (`go version`)?

<pre>
$ go version
go version go1.15.1 linux/amd64
</pre>

### Does this issue reproduce with the latest release?
Yes


### What operating system and processor architecture are you using (`go env`)?
OS: Fedora/YUM based OS (RHEL, CentOS, OEL, etc)
Architecture: amd64
Kernel: Reproduced on anything from 4.1 to 5.8
<details><summary><code>go env</code> Output</summary><br><pre>
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/root/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/root/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build513552128=/tmp/go-build -gno-record-gcc-switches"
</pre></details>

### What did you do?
While writing an extremely basic HTTP load tool, I noticed some funky behavior when re-using the *http.Client (and it's embedded *http.Transport).
The tool prepares a *http.Client and a *http.Request, spins up multiple goroutines that clone the *http.Request (as *http.Requests aren't supposed to be used in this concurrent fassion) and uses the *http.Client (which is re-usable concurrently) to send as many requests per second as possible at a server of the user's choice.

Here is a hacky, simplified and shortened but functioning example that tries to showcase my observation. It sends requests to a URL specified by the command line and goes through 2 passes of 30sec each where the first pass is using a shared client/transport and the second pass is using a new client/transport for each worker.

<details><summary>Code Example</summary><br><pre>
package main

import (
	"context"
	"crypto/tls"
	"flag"
	"fmt"
	"io"
	"io/ioutil"
	"net/http"
	"sync"
	"time"
)

func fire(ctx context.Context, wg *sync.WaitGroup, r *http.Request, c *http.Client, chGo chan struct{}, chResult chan bool) {
	defer wg.Done()
	<-chGo

	for {
		select {
		case <-ctx.Done():
			return
		default:
			chResult <- true

			resp, err := c.Do(r)
			if err != nil {
				chResult <- false
				continue
			}

			_, err = io.Copy(ioutil.Discard, resp.Body)
			if err != nil {
				chResult <- false
			}
			resp.Body.Close()
		}
	}
}

func main() {
	flagURL := flag.String("url", "", "URL for the request")
	flag.Parse()

	if *flagURL == "" {
		panic(fmt.Errorf("URL can not be empty"))
	}

	req, err := http.NewRequest("GET", *flagURL, nil)
	if err != nil {
		panic(err)
	}

	tp := http.DefaultTransport.(*http.Transport).Clone()
	// In the case of HTTPS, disable verification for testing purposes
	tp.TLSClientConfig = &tls.Config{InsecureSkipVerify: true}

	// Bump up connection limits
	tp.MaxConnsPerHost = 200
	tp.MaxIdleConns = 200
	tp.MaxIdleConnsPerHost = 200

	// Disable HTTP/2
	tp.TLSNextProto = make(map[string]func(string, *tls.Conn) http.RoundTripper)

	// First pass is with a shared transport
	hClt := &http.Client{Transport: tp}

	chGo := make(chan struct{}, 0)
	chResult := make(chan bool, 200)

	ctx, cancel := context.WithCancel(context.Background())
	d := time.Duration(30 * time.Second)

	var wgFire sync.WaitGroup
	for i:=1; i<=200; i++ {
		wgFire.Add(1)

		// Need to clone request
		//
		// See https://github.com/golang/go/issues/21780
		go fire(ctx, &wgFire, req.Clone(context.Background()), hClt, chGo, chResult)
	}

	var wgStat sync.WaitGroup
	wgStat.Add(1)
	go func() {
		defer wgStat.Done()
		var reqs uint64
		var errs uint64

		for {
			v, more := <-chResult
			if !more {
				break
			}
			switch v {
			case true:
				reqs++
			case false:
				errs++
			}
		}
		fmt.Println("\nWorker sharing HTTP transport")
		fmt.Printf("Duration: %v\nRequests: %v\nRequests per second: %v\nErrors: %v\n\n", d, reqs, reqs / uint64(d.Seconds()), errs)
	}()

	close(chGo)
	time.Sleep(d)
	cancel()
	wgFire.Wait()
	close(chResult)
	wgStat.Wait()

	// Let's go again but with non-shared transports
	chGo = make(chan struct{}, 0)
	chResult = make(chan bool, 200)

	ctx, cancel = context.WithCancel(context.Background())

	for i:=1; i<=200; i++ {
		wgFire.Add(1)

		// Here each worker gets their own transport
		hClt := &http.Client{Transport: tp.Clone()}

		// Need to clone request
		//
		// See https://github.com/golang/go/issues/21780
		go fire(ctx, &wgFire, req.Clone(context.Background()), hClt, chGo, chResult)
	}

	wgStat.Add(1)
	go func() {
		defer wgStat.Done()
		var reqs uint64
		var errs uint64

		for {
			v, more := <-chResult
			if !more {
				break
			}
			switch v {
			case true:
				reqs++
			case false:
				errs++
			}
		}
		fmt.Println("Each worker getting a HTTP transport copy")
		fmt.Printf("Duration: %v\nRequests: %v\nRequests per second: %v\nErrors: %v\n", d, reqs, reqs / uint64(d.Seconds()), errs)
	}()

	close(chGo)
	time.Sleep(d)
	cancel()
	wgFire.Wait()
	close(chResult)
	wgStat.Wait()
}
</pre></details>

### What did you expect to see?
The same RPS throughput for both scenarios in any environment

### What did you see instead?
When running this on my local DEV machine (Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz) against a server (Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz), I get the expected results:
```
$ ./http_perf_compare -url http://perf.domain.local

Worker sharing HTTP transport
Duration: 30s
Requests: 1640557
Requests per second: 54685
Errors: 0

Each worker getting a HTTP transport copy
Duration: 30s
Requests: 1553561
Requests per second: 51785
Errors: 0
```
Both approaches result in a similar RPS.

However, when I run the same code on a client that is the same as the server (Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz), I see a ~90% increase of RPS throughput when using a separate client/transport for each worker goroutine:
```
$ ./http_perf_compare -url http://perf.domain.local

Worker sharing HTTP transport
Duration: 30s
Requests: 2292353
Requests per second: 76411
Errors: 0

Each worker getting a HTTP transport copy
Duration: 30s
Requests: 4442219
Requests per second: 148073
Errors: 0
```
This happens independent of plain HTTP or HTTP-over-TLS (HTTPS).

Is this expected?

Thank you in advance for the time anyone takes to look into this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

net/http: potential client transport lock contention on high CPU core count machines #41238

What version of Go are you using (`go version`)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (`go env`)?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

net/http: potential client transport lock contention on high CPU core count machines #41238

Description

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?