Skip to content

Commit 466d309

Browse files
authored
fix(bigquery): make additional errors retriable: tcp timeout and http2 client connection lost (#13269)
#### Description The [cloud.google.com/go/bigquery](http://cloud.google.com/go/bigquery) client does not automatically retry API calls that fail with a dial tcp: i/o timeout error. This type of error is a common transient network failure, especially in distributed cloud environments, and often occurs when initiating a connection. The underlying Go error wrapper correctly identifies this as a retryable error (as seen by retryable: true in the error message), but the BigQuery client's internal retry predicate fails to catch it, immediately propagating the error to the user. This forces developers to build their own complex retry wrappers around the client, which should ideally be handled by the library's built-in resilience mechanisms. #### Expected Behavior When an API call (such as `jobs.insert` or `jobs.query`) fails with a `dial tcp: i/o timeout`, the client library should recognize this as a transient, retryable error and automatically retry the operation using its built-in exponential backoff strategy. #### Actual Behavior The API call fails immediately and returns the i/o timeout error directly to the caller. No retry is attempted by the library. The full error message is similar to the following: ``` update table from struct: Post \"https://.../bigquery/v2/projects/.../jobs?alt=json&prettyPrint=false&uploadType=multipart\": dial tcp 34.50.146.6:443: i/o timeout (type: wrapError, retryable: true): Post \"https://.../bigquery/v2/projects/.../jobs?alt=json&prettyPrint=false&uploadType=multipart\": dial tcp 34.50.146.6:443: i/o timeout ``` #### Code Snippet The issue can be observed with any standard API call that initiates a network request. For example, when using a Loader to start a job: ```go package main import ( "context" "log" "cloud.google.com/go/bigquery" ) func main() { ctx := context.Background() projectID := "your-project-id" client, err := bigquery.NewClient(ctx, projectID) if err!= nil { log.Fatalf("bigquery.NewClient: %v", err) } defer client.Close() // Assume 'gcsRef' is a *bigquery.GCSReference pointing to a file. // This call to Run() initiates a jobs.insert API call. loader := client.Dataset("my_dataset").Table("my_table").LoaderFrom(gcsRef) job, err := loader.Run(ctx) if err!= nil { // When a "dial tcp: i/o timeout" occurs, the error is returned here // immediately without any retry attempts from the library. log.Fatalf("Failed to start load job: %v", err) } //... wait for job completion } ``` #### Additional Context & Analysis The root cause appears to be in the library's internal `retryableError` predicate. This function does not check for errors that satisfy the `net.Error Timeout()` method. The current implementation checks for `interface{ Temporary() bool }`: ```go //... case interface{ Temporary() bool }: if e.Temporary() { return true } //... ``` However, a `dial tcp: i/o timeout` is a `net.Error` where `Timeout()` returns `true`, but `Temporary()` may not. The `Temporary()` method was deprecated in Go 1.18 because its definition was ambiguous and ill-defined. Most errors that were once "temporary" are now more accurately classified as timeouts. Because the library's predicate relies on this deprecated method and omits a check for the `Timeout()` method, it fails to identify one of the most common types of transient network errors. The proposed fix in this PR is to update the `retryableError` predicate to also include a check for timeout errors, for example: ```go //... case interface{ Timeout() bool }: if e.Timeout() { return true } case interface{ Temporary() bool }: //... ``` Adding this case will improve the client's resilience and align its behavior with the expectation that transient network timeouts are handled automatically. Hoping for positive feedback on this one and that we can get it merged quickly. Cheers!
1 parent 29b3b28 commit 466d309

File tree

2 files changed

+22
-0
lines changed

2 files changed

+22
-0
lines changed

bigquery/bigquery.go

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,9 @@ func retryableError(err error, allowedReasons []string) bool {
268268
if err.Error() == "http2: stream closed" {
269269
return true
270270
}
271+
if err.Error() == "http2: client connection lost" {
272+
return true
273+
}
271274

272275
switch e := err.(type) {
273276
case *googleapi.Error:
@@ -293,6 +296,10 @@ func retryableError(err error, allowedReasons []string) bool {
293296
return true
294297
}
295298
}
299+
case interface{ Timeout() bool }:
300+
if e.Timeout() {
301+
return true
302+
}
296303
case interface{ Temporary() bool }:
297304
if e.Temporary() {
298305
return true

bigquery/bigquery_test.go

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ package bigquery
1717
import (
1818
"errors"
1919
"io"
20+
"net"
2021
"net/http"
2122
"net/url"
2223
"testing"
@@ -42,6 +43,20 @@ func TestRetryableErrors(t *testing.T) {
4243
in: errors.New("http2: stream closed"),
4344
wantRetry: true,
4445
},
46+
{
47+
description: "http client connection lost",
48+
in: errors.New("http2: client connection lost"),
49+
wantRetry: true,
50+
},
51+
{
52+
description: "tcp timeout error",
53+
in: &net.DNSError{
54+
Err: "timeout",
55+
56+
IsTimeout: true,
57+
},
58+
wantRetry: true,
59+
},
4560
{
4661
description: "io ErrUnexpectedEOF",
4762
in: io.ErrUnexpectedEOF,

0 commit comments

Comments
 (0)