Skip to content

fix(bigquery): make additional errors retriable: tcp timeout and http2 client connection lost#13269

Merged
shollyman merged 2 commits intogoogleapis:mainfrom
alvindotai:bigquery-tcp-timeout-fix
Nov 14, 2025
Merged

fix(bigquery): make additional errors retriable: tcp timeout and http2 client connection lost#13269
shollyman merged 2 commits intogoogleapis:mainfrom
alvindotai:bigquery-tcp-timeout-fix

Conversation

@MartinSahlen
Copy link
Contributor

Description

The cloud.google.com/go/bigquery client does not automatically retry API calls that fail with a dial tcp: i/o timeout error. This type of error is a common transient network failure, especially in distributed cloud environments, and often occurs when initiating a connection.

The underlying Go error wrapper correctly identifies this as a retryable error (as seen by retryable: true in the error message), but the BigQuery client's internal retry predicate fails to catch it, immediately propagating the error to the user. This forces developers to build their own complex retry wrappers around the client, which should ideally be handled by the library's built-in resilience mechanisms.

Expected Behavior

When an API call (such as jobs.insert or jobs.query) fails with a dial tcp: i/o timeout, the client library should recognize this as a transient, retryable error and automatically retry the operation using its built-in exponential backoff strategy.

Actual Behavior

The API call fails immediately and returns the i/o timeout error directly to the caller. No retry is attempted by the library.
The full error message is similar to the following:

update table from struct: Post \"https://.../bigquery/v2/projects/.../jobs?alt=json&prettyPrint=false&uploadType=multipart\": dial tcp 34.50.146.6:443: i/o timeout (type: wrapError, retryable: true): Post \"https://.../bigquery/v2/projects/.../jobs?alt=json&prettyPrint=false&uploadType=multipart\": dial tcp 34.50.146.6:443: i/o timeout

Code Snippet

The issue can be observed with any standard API call that initiates a network request. For example, when using a Loader to start a job:

package main

import (
	"context"
	"log"

	"cloud.google.com/go/bigquery"
)

func main() {
	ctx := context.Background()
	projectID := "your-project-id"

	client, err := bigquery.NewClient(ctx, projectID)
	if err!= nil {
		log.Fatalf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	// Assume 'gcsRef' is a *bigquery.GCSReference pointing to a file.
	// This call to Run() initiates a jobs.insert API call.
	loader := client.Dataset("my_dataset").Table("my_table").LoaderFrom(gcsRef)
	
	job, err := loader.Run(ctx)
	if err!= nil {
		// When a "dial tcp: i/o timeout" occurs, the error is returned here
		// immediately without any retry attempts from the library.
		log.Fatalf("Failed to start load job: %v", err)
	}

	//... wait for job completion
}

Additional Context & Analysis

The root cause appears to be in the library's internal retryableError predicate. This function does not check for errors that satisfy the net.Error Timeout() method.
The current implementation checks for interface{ Temporary() bool }:

//...
	case interface{ Temporary() bool }:
		if e.Temporary() {
			return true
		}
//...

However, a dial tcp: i/o timeout is a net.Error where Timeout() returns true, but Temporary() may not. The Temporary() method was deprecated in Go 1.18 because its definition was ambiguous and ill-defined. Most errors that were once "temporary" are now more accurately classified as timeouts.

Because the library's predicate relies on this deprecated method and omits a check for the Timeout() method, it fails to identify one of the most common types of transient network errors.

The proposed fix in this PR is to update the retryableError predicate to also include a check for timeout errors, for example:

//...
	case interface{ Timeout() bool }:
		if e.Timeout() {
			return true
		}
	case interface{ Temporary() bool }:
//...

Adding this case will improve the client's resilience and align its behavior with the expectation that transient network timeouts are handled automatically.

Hoping for positive feedback on this one and that we can get it merged quickly. Cheers!

@MartinSahlen MartinSahlen requested review from a team and Linchin October 30, 2025 15:02
@google-cla
Copy link

google-cla bot commented Oct 30, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Oct 30, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @MartinSahlen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the resilience of the cloud.google.com/go/bigquery client by modifying its error retry mechanism. It addresses a critical gap where dial tcp: i/o timeout errors, common in distributed cloud environments, were not being automatically retried. By incorporating a check for Timeout() errors, the client will now transparently handle these transient network issues with its built-in exponential backoff strategy, improving application stability and reducing the need for manual retry implementations.

Highlights

  • BigQuery Client Retry Logic: The BigQuery client's internal retryableError predicate has been updated to correctly identify dial tcp: i/o timeout errors as retryable. Previously, these transient network failures were not retried automatically, forcing users to implement custom retry logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a valuable fix for an important resilience issue in the BigQuery client. The detailed description accurately identifies that dial tcp: i/o timeout errors were not being automatically retried. The implemented change, which adds a check for timeout errors using the Timeout() method in the retryableError function, is the correct and modern approach in Go for handling such transient network issues. This small and targeted modification will significantly improve the client's robustness and user experience by handling common network timeouts automatically. The change is well-implemented and I have no further suggestions. It is ready for merging.

@alvarowolfx alvarowolfx requested review from alvarowolfx and shollyman and removed request for Linchin October 30, 2025 15:06
@alvarowolfx alvarowolfx changed the title fix(bigquery): BigQuery client does automatic retry on "dial tcp: i/o timeout" errors fix(bigquery): retry on "dial tcp: i/o timeout" errors Oct 30, 2025
@alvarowolfx alvarowolfx changed the title fix(bigquery): retry on "dial tcp: i/o timeout" errors fix(bigquery): retry on tcp timeout errors Oct 30, 2025
@alvarowolfx
Copy link
Contributor

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@MartinSahlen can you sign the CLA ?

@MartinSahlen
Copy link
Contributor Author

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@MartinSahlen can you sign the CLA ?

Yes I did shortly after submitting the PR and also retriggered the test which passed already.

@shollyman
Copy link
Contributor

Thanks for the PR and the detailed writeup!

One minor request: Could you add a testcase for this to TestRetryableErrors in bigquery_test.go? It looks like https://pkg.go.dev/net#DNSError gives us an easy to use example error for this case.

@MartinSahlen
Copy link
Contributor Author

Thanks for the PR and the detailed writeup!

One minor request: Could you add a testcase for this to TestRetryableErrors in bigquery_test.go? It looks like https://pkg.go.dev/net#DNSError gives us an easy to use example error for this case.

Thanks! I can give it a go. Firstly I need to understand the test structure and how to set up some mock errors/responses, will give a shout if I get stuck.

@MartinSahlen
Copy link
Contributor Author

Hi @shollyman , so I added a test. However in the meantime we have observed one more error "http2: client connection lost", which I decided to also add to this PR with a corresponding test. Perhaps the PR title should change to reflect this when squash-merging.

@MartinSahlen MartinSahlen force-pushed the bigquery-tcp-timeout-fix branch from 9d64469 to add8366 Compare November 7, 2025 20:13
@MartinSahlen
Copy link
Contributor Author

hi @shollyman and @alvarowolfx, any next steps here? Or any idea of a timeline on your end?

@MartinSahlen MartinSahlen changed the title fix(bigquery): retry on tcp timeout errors fix(bigquery): additional retriable errors: tcp timeout and http2 client connection lost Nov 11, 2025
@MartinSahlen MartinSahlen changed the title fix(bigquery): additional retriable errors: tcp timeout and http2 client connection lost fix(bigquery): make additional errors retriable: tcp timeout and http2 client connection lost Nov 11, 2025
@joshk0
Copy link

joshk0 commented Nov 11, 2025

By the way, the storage client handles it through an extension point that gives the user the ability to supply a custom method to determine whether a given error is retryable, in addition to the built-in logic.

https://github.com/googleapis/google-cloud-go/blob/storage/v1.57.1/storage/storage.go#L2528

I'm not sure what the overall SDK strategy is, but for bigquery, it might be nice to add some of these errors we have seen in the wild (as this change accomplishes), as well as future proofing with an extension point that lets users easily add their own retry cases.

@shollyman
Copy link
Contributor

Apologies, I've been OOO and playing catchup. Taking another look now.

@shollyman shollyman added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 14, 2025
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 14, 2025
@shollyman shollyman merged commit 466d309 into googleapis:main Nov 14, 2025
147 of 148 checks passed
@shollyman
Copy link
Contributor

Thanks again for the contribution!

@MartinSahlen
Copy link
Contributor Author

Thanks again for the contribution!

No worries! That being said, I think @joshk0's suggestion might be one to consider. We don't see these errors in the python clients, most likely because it represents 95% (or more) of users' interaction with BigQuery APIs, and such it has more robust error handling. Until we see "everything" and can make all errors retryable, having some ability for users to manually tell the library what should be retriable seems like a good stop-gap.

bhshkh added a commit that referenced this pull request Feb 4, 2026
PR created by the Librarian CLI to initialize a release. Merging this PR
will auto trigger a release.

Librarian Version: v0.8.0
Language Image:
us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/librarian-go@sha256:01189c9771ac4150742aed38eb52e19a008018889066002742034b7f82db070f
<details><summary>bigquery: 1.73.0</summary>

##
[1.73.0](bigquery/v1.72.0...bigquery/v1.73.0)
(2026-02-04)

### Features

* add Stored Procedure Sharing support for analyticshub listings
(PiperOrigin-RevId: 827828462)
([185951b](185951b3))

* add tags support for Pub/Sub subscriptions (PiperOrigin-RevId:
827828462)
([185951b](185951b3))

* Support picosecond timestamp precision in BigQuery Storage API
(PiperOrigin-RevId: 829486853)
([185951b](185951b3))

* add timestamp precision support to schema (#13421)
([52020af](52020af5))

* transition format options (#13422)
([59efe32](59efe323))

### Bug Fixes

* make additional errors retriable: tcp timeout and http2 client
connection lost (#13269)
([466d309](466d309d))

* roundtrip readonly fields (#13370)
([9e84705](9e847052))

### Documentation

* change comment indicating `enable_gemini_in_bigquery` field for
BigQuery Reservation Assignments is deprecated (PiperOrigin-RevId:
850121797)
([35d7578](35d75787))

</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the BigQuery API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants