Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube::runtime::watcher exits with The resourceVersion for the provided watch is too old #1615

Closed
yorik opened this issue Oct 24, 2024 · 5 comments · Fixed by #1616
Closed
Labels
discussions possibly more of a discussion piece than an issue

Comments

@yorik
Copy link

yorik commented Oct 24, 2024

Current and expected behavior

When I'm trying to watch k8s events, watcher exits with error ErrorResponse { status: "Failure", message: "The resourceVersion for the provided watch is too old.", reason: "Expired", code: 410 } instead of re-trying connection.

Looks like this happens only for Event and probably when there is nothing going on in a namespace. The namespace from the example has bunch of deployments, but all of them with 0 replicas.

Possible solution

No response

Additional context

Code to prove the bug:

use anyhow::Result;
use futures::{StreamExt, TryStreamExt};
use k8s_openapi::api::core::v1::Event;
use kube::Client;
use kube::{
    api::Api,
    runtime::{watcher, WatchStreamExt},
};

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt()
        .with_max_level(tracing::Level::DEBUG)
        .with_target(false)
        .init();

    let client = Client::try_default().await?;
    let namespaces = vec!["prover-blue"];
    let mut tasks = vec![];

    for ns in namespaces {
        let client_clone = client.clone();
        let task = tokio::spawn(async move {
            let events: Api<Event> = Api::namespaced(client_clone, ns);

            let mut stream = watcher(events, Default::default())
                .applied_objects()
                .boxed();

            while let Some(event) = stream.try_next().await.unwrap() {
                println!(
                    "Event in namespace {}: reason: {:?}, message: {:?}",
                    ns, event.reason, event.message
                );
            }

            Ok::<_, anyhow::Error>(())
        });

        tasks.push(task);
    }

    futures::future::try_join_all(tasks).await?;
    Ok(())
}

Log before exits:

2024-10-24T16:13:57.172693Z DEBUG service.ready=true processing request
2024-10-24T16:13:57.172726Z DEBUG service.ready=true message=processing request    
2024-10-24T16:13:57.172914Z DEBUG ++ HTTP; http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client" span=274877906945    
2024-10-24T16:13:57.172934Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: requesting
2024-10-24T16:13:57.172962Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: requesting    
2024-10-24T16:13:57.173036Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: reuse idle connection for ("https", 10.0.0.1)
2024-10-24T16:13:57.173056Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: reuse idle connection for ("https", 10.0.0.1)    
2024-10-24T16:13:57.294799Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client" http.status_code=200}: HTTP; http.status_code=200 span=274877906945    
2024-10-24T16:18:47.297228Z DEBUG pooling idle connection for ("https", 10.0.0.1)
2024-10-24T16:18:47.297350Z DEBUG pooling idle connection for ("https", 10.0.0.1)    
2024-10-24T16:18:47.297887Z DEBUG service.ready=true processing request
2024-10-24T16:18:47.298043Z DEBUG service.ready=true message=processing request    
2024-10-24T16:18:47.298409Z DEBUG ++ HTTP; http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client" span=549755813889    
2024-10-24T16:18:47.298448Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: requesting
2024-10-24T16:18:47.298492Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: requesting    
2024-10-24T16:18:47.298641Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: reuse idle connection for ("https", 10.0.0.1)
2024-10-24T16:18:47.298684Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: reuse idle connection for ("https", 10.0.0.1)    
2024-10-24T16:18:47.459567Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client" http.status_code=200}: HTTP; http.status_code=200 span=549755813889    
2024-10-24T16:18:47.527182Z DEBUG error watchevent error: ErrorResponse { status: "Failure", message: "The resourceVersion for the provided watch is too old.", reason: "Expired", code: 410 }
2024-10-24T16:18:47.527297Z DEBUG error watchevent error: ErrorResponse { status: "Failure", message: "The resourceVersion for the provided watch is too old.", reason: "Expired", code: 410 }    
thread 'tokio-runtime-worker' panicked at src/main.rs:153:61:
called `Result::unwrap()` on an `Err` value: WatchError(ErrorResponse { status: "Failure", message: "The resourceVersion for the provided watch is too old.", reason: "Expired", code: 410 })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-10-24T16:18:47.528049Z DEBUG pooling idle connection for ("https", 10.0.0.1)
2024-10-24T16:18:47.528115Z DEBUG pooling idle connection for ("https", 10.0.0.1)    
Error: task 14 panicked with message "called `Result::unwrap()` on an `Err` value: WatchError(ErrorResponse { status: \"Failure\", message: \"The resourceVersion for the provided watch is too old.\", reason: \"Expired\", code: 410 })"

Environment

$ kubectl version        
Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.31.1-gke.1000000

GKE
Debian sid

Configuration and features

k8s-openapi = { version = "0.23.0", features = ["latest"] }
kube = { version = "0.96.0", features = ["runtime"] }

Affected crates

kube-runtime

Would you like to work on fixing this bug?

maybe

@yorik yorik added the bug Something isn't working label Oct 24, 2024
@clux
Copy link
Member

clux commented Oct 24, 2024

Perhaps this is a bit counter-intuitive, but watcher::Error is an error type that we mostly propagate for logging and our own recovery. The errors may occur, but the errors in that enum are generally recoverable.

The problem is your unwrap(). You can run it with a for_each (as e.g. this) and it should recover (the watcher will relist from rv=0).

@nightkr
Copy link
Member

nightkr commented Oct 24, 2024

Hey, there are really several separate issues going on here:

  1. Panicking on errors: stream.try_next().await.unwrap()

This will panic on the first error received. watcher is designed to retry in the face of errors, but will still propagate them to you as well. However, unwrap() will cause it to panic (and implicitly terminate the watch due to unwinding).

Instead, you could do something like:

            while let Some(event) = stream.try_next().await {
                match event {
                    Ok(event) => println!(
                        "Event in namespace {}: reason: {:?}, message: {:?}",
                        ns, event.reason, event.message
                    ),
                    Err(err) => eprintln!("Error during watch: {err:?}"),
                }
            }
  1. The connection cannot be resumed

When reconnecting, watcher will try to pick up the stream based on the timestamp (resourceVersion) of last message it received (resourceVersion).

However, Kubernetes only keeps the replay cache for so long, and will reject attempts to reconnect if the resourceVersion is too old. This can happen if the newest message is older than the size of the replay cache. kube tries to work around this using BOOKMARK messages (https://docs.rs/kube/latest/kube/runtime/watcher/struct.Config.html#structfield.bookmarks), which requests that the API server should send periodic empty messages with just the latest timestamp during idle periods. It may be worth investigating whether these bookmark messages are not being sent/interpreted for some reason.

@clux clux added discussions possibly more of a discussion piece than an issue and removed bug Something isn't working labels Oct 24, 2024
@yorik
Copy link
Author

yorik commented Oct 24, 2024

I didn't expect that I should just continue stream.try_next(), but it works!
Thank you so much for the explanation!

Maybe https://github.com/kube-rs/kube/blob/main/examples/event_watcher.rs#L48 should be updated with the match?

nightkr added a commit to nightkr/kube-rs that referenced this issue Oct 24, 2024
Fixes kube-rs#1615

Signed-off-by: Natalie Klestrup Röijezon <nat@nullable.se>
@nightkr
Copy link
Member

nightkr commented Oct 24, 2024

Updated the example in #1616.

@nightkr nightkr linked a pull request Oct 24, 2024 that will close this issue
@yorik
Copy link
Author

yorik commented Oct 25, 2024

Thank you! Looks good to me.

@clux clux closed this as completed in 4a63e81 Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussions possibly more of a discussion piece than an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants