Skip to content

fix(remote): use OS DNS resolver in pyqwest transport#1077

Open
EngHabu wants to merge 6 commits into
mainfrom
haytham/use-system-dns
Open

fix(remote): use OS DNS resolver in pyqwest transport#1077
EngHabu wants to merge 6 commits into
mainfrom
haytham/use-system-dns

Conversation

@EngHabu
Copy link
Copy Markdown
Contributor

@EngHabu EngHabu commented May 16, 2026

Motivation

Every Flyte SDK RPC goes through ConnectRPC's pyqwest HTTP transport.
pyqwest.HTTPTransport defaults to use_system_dns=False, which routes
all name lookups through the bundled Rust trust-dns resolver instead
of the OS's getaddrinfo. This subtly breaks on a common real-world
network condition.

The break

On networks that advertise IPv6 via RA but don't actually have a usable
v6 default route — typical of tethered phones, hotel Wi-Fi captive
portals after handoff, some corporate VPNs, and most mobile hotspots —
the bundled resolver can return AAAA records that the kernel then refuses
to route. Every RPC hangs the connect timeout and eventually fails with:

client error (Connect): dns error: proto error: io error:
No route to host (os error 65)

curl against the same hostname succeeds on the same machine at the
same time, because curl uses getaddrinfo, which honors the OS resolver
policy and address selection. So users see a confusing "Flyte CLI is
broken but my browser/curl works fine" report.

Repro on a hotspot:

$ curl -sI https://demo.hosted.unionai.cloud >/dev/null && echo curl-ok
curl-ok

$ python -c "import flyte; from flyte.remote import Run; \
  flyte.init_from_config(); list(Run.listall(limit=1))"
ConnectError: ... No route to host (os error 65)

Change

Pass use_system_dns=True to pyqwest.HTTPTransport in
_build_pyqwest_client by default. This routes lookups through
getaddrinfo, matching curl's and the rest of the OS's behavior, and
eliminates the spurious EHOSTUNREACH failures on flaky / tethered
networks.

Server deployments that prefer application-owned DNS behavior can opt
back into pyqwest's bundled resolver with:

_FLYTE_USE_PYQWEST_DNS_RESOLVER=true

No new dependencies.

Test plan

  • Reproduced the failure on a mobile hotspot before the fix
    (Run.listall fails in ~3s with EHOSTUNREACH while curl to the
    same host succeeds)
  • Verified the fix on the same network — Run.listall(limit=3)
    completes in ~1s
  • Added unit coverage for the default system resolver behavior and
    the _FLYTE_USE_PYQWEST_DNS_RESOLVER=true opt-in
  • uv run python -m pytest tests/flyte/remote/test_session.py
  • make fmt
  • make mypy
  • GitHub CI green on 9286007d975bed8567e5b073108010c0ca0311a1

EngHabu added 2 commits May 15, 2026 19:44
pyqwest defaults to use_system_dns=False, which routes all RPCs through
the bundled trust-dns resolver. trust-dns happily returns AAAA records
even on hosts with no usable IPv6 default route (e.g. tethered mobile
hotspots that advertise IPv6 via RA but don't actually route it). The
result is every RPC hangs and eventually fails with:

    client error (Connect): dns error: proto error: io error:
    No route to host (os error 65)

curl works on the same network because it uses getaddrinfo, which
honors AI_ADDRCONFIG and suppresses AAAA records when there's no v6
default route.

Setting use_system_dns=True on the HTTPTransport routes lookups through
getaddrinfo, matching curl's behavior and eliminating the spurious
EHOSTUNREACH failures on flaky/tethered networks.

Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
pyqwest defaults to use_system_dns=False, which routes all RPCs through
the bundled trust-dns resolver. trust-dns happily returns AAAA records
even on hosts with no usable IPv6 default route (e.g. tethered mobile
hotspots that advertise IPv6 via RA but don't actually route it). The
result is every RPC hangs and eventually fails with:

    client error (Connect): dns error: proto error: io error:
    No route to host (os error 65)

curl works on the same network because it uses getaddrinfo, which
honors AI_ADDRCONFIG and suppresses AAAA records when there's no v6
default route.

Setting use_system_dns=True on the HTTPTransport routes lookups through
getaddrinfo, matching curl's behavior and eliminating the spurious
EHOSTUNREACH failures on flaky/tethered networks.

Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu EngHabu force-pushed the haytham/use-system-dns branch from 5ecf678 to cf2cdf9 Compare May 20, 2026 12:51
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu EngHabu force-pushed the haytham/use-system-dns branch from cf2cdf9 to 9286007 Compare May 20, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants