Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List/Watch/Get of objects associated with node #443

Merged
merged 1 commit into from
Jul 7, 2017

Conversation

wojtek-t
Copy link
Member

Ref kubernetes/kubernetes#40476

@lavalamp @liggitt @kubernetes/sig-api-machinery-misc
@dchen1107

@shyamjvs
Copy link
Member

shyamjvs commented Mar 10, 2017

I have a (probably silly) high-level question about this mechanism. Since we are aiming for read/access of secrets at the node level, that iiuc means that we are fine with pods within a node being able to read each other's secrets potentially. If that is the case, then consider the following scenario:

  • Pods A and B (needing secrets P and Q respectively) were running on Node-1 initially. Since the secret is authorized to view at node level, potentially both A and B can read each other's secrets
  • Now pod B gets evicted for some reason and is rescheduled onto Node-2
  • The backend gets updated with these changes and assuming everything goes fine, Node-1 should have access only to secret P now. But Q could still be possibly read from its cache, or pod A would be probably be knowing it already from before.

Does this mean that we either need to make secrets authorized at pod level?
Otherwise, if we totally ignore the risk of pods from within the same node being able to read each other's secrets, then why really bother about pods from different nodes being able to read each other's secrets (as it's just a matter of scheduling)?

@liggitt
Copy link
Member

liggitt commented Mar 10, 2017

Since we are aiming for read/access of secrets at the node level, that iiuc means that we are fine with pods within a node being able to read each other's secrets potentially.

Pods do not get to make API calls with the node's identity. PodSecurityPolicy can be used to disallow privileged pods, host IPC, and volume mounts that would let a pod access the node host environment.

@deads2k
Copy link
Contributor

deads2k commented Mar 10, 2017

Have I missed the bit describing how the resourceVersion is honored on the watch? This will effectively be a watch that a resourceVersion that is a tuple of multiple resource versions since the determination of which secrets have been added since RV=6 is dependent upon the state of the observed bound pods when you asked. You also end up with odd behavior where you see an ADD with RV=10, then a different pod is bound and you see an ADD with RV=6. That messes with reflector behavior which uses the lastSyncVersion.

We've done this in openshift with the projects resource and found the semantics to be challenging.

@wojtek-t
Copy link
Member Author

@deads2k - thanks for pointing to it; I though about the former and I thought it's not a big deal. But I didn't fully realize the problem of the second one (which is kind of under the TODO in the doc). I will think more about it.

@deads2k
Copy link
Contributor

deads2k commented Mar 10, 2017

@deads2k - thanks for pointing to it; I though about the former and I thought it's not a big deal.

I think it's still a big deal since your list isn't synchronized to your watch. Normal resources allow a coherent and consistent view when doing

  1. list and get RV
  2. watch using RV

With the situation as described, your watch will produce different output depending on whether its called one millisecond later or one second later, which is a big semantic difference for list/watch behavior.

@wojtek-t
Copy link
Member Author

yeah - makes sense.

@deads2k - is there any place where I can read what you did in openshift for projects ? Just for my own education?

@deads2k
Copy link
Contributor

deads2k commented Mar 10, 2017

@deads2k - is there any place where I can read what you did in openshift for projects? Just for my own education?

I didn't solve it. I found the problem and just punted. The watch for projects doesn't respect resource version at all. It just starts a "watch from now-ish" (yeah, it's not even "now"). Since its consumer is for human interaction, it didn't much matter. I think its a bigger issue for the thing driving our workloads.

@wojtek-t
Copy link
Member Author

@deads2k - got it. Thanks I will think about it and hopefully will get back with something next week.

@wojtek-t wojtek-t changed the title List/Watch/Get of objects associated with node [1.7] [WIP] List/Watch/Get of objects associated with node [1.7] Mar 13, 2017
@wojtek-t
Copy link
Member Author

wojtek-t commented Mar 13, 2017

I though about it, and so far don't have a good solution for it. One thing that we can potentially do (that reduces the amount of bad scenarios, but doesnt eliminate all of them is):

  1. add a precondition to NodeSelector that requires "PodInformer to be synced to at least RV=X"
  2. Change (define differently) semantic for the watch so that:
    "if bounding a pod to a node results in bounding some object (secret/...) to it, this doesn't result in sending ADD event for this object (similarly for deleting a pod that unbounds an object).
    It is responsibility of a watcher to grab the current object version when it is newly bound to a node.
    We only guarantee to send add/update/delete events for secrets that were already bound to a node".

With those changes:

  1. It should be pretty easy to modify the watchers in a node to issue GET for a secret/... for secret/... whenever a pod spec changes (and it's only a constant number of requests per pod modification, so it's not a big deal). I will describe in more details if/when we solve all other issues, but that's not very difficult.

  2. Calling list/watch with a precondition for RV of pod informer in watch equal to current RV of pods in the node (kubelet) will result in returning all secrets/... of pods that kubelet is already aware. Which solves a bunch of issues.

  3. I think that the only remaining problem that I don't yet know how to solve is scenario like this:
    "bound a pod to node X that is referencing secret Y, a moment later create secret Y"

If the PodInformer in the watch implementation is lagging, we can potentially observe a secret creation before bounding a pod to node, which would be that we will not deliver the "add secret" event to the watcher.
That is probably solvable but it would require doing "multi-object-type" watch on the etcd level" (otherwise we would potentially do the serialization on our side, but it woudln't work in non-HA setups).

@liggitt
Copy link
Member

liggitt commented Mar 13, 2017

add a precondition to NodeSelector that requires "PodInformer to be synced to at least RV=X"

the issue is that there are lots of informers (at least pods, pvcs, pvs, secrets) that determine the objects a node should be able to see, and all of those factor in to what shows up in the list or watch...

Change (define differently) semantic for the watch

I don't think different semantics are a good idea... it means the existing list/watch utilities could not be used to keep a cache store up to date.

@wojtek-t
Copy link
Member Author

the issue is that there are lots of informers (at least pods, pvcs, pvs, secrets) that determine the objects a node should be able to see, and all of those factor in to what shows up in the list or watch...

but secrets, pvcs and config maps are kind of independent. What matters for them is only pods. I.e. all of them depends on pods, and nothing else.

I don't think different semantics are a good idea... it means the existing list/watch utilities could not be used to keep a cache store up to date.

I think we can reuse most of them, probably with some small layer on top of it. But wanted to avoid describing it, as long as not all issues are resolved. We need to solve everything to make it useful.

@liggitt
Copy link
Member

liggitt commented Mar 13, 2017

but secrets, pvcs and config maps are kind of independent. What matters for them is only pods. I.e. all of them depends on pods, and nothing else.

A node can gain access to a secret for the following reasons:

  • a pod bound to the node references the secret directly (image pull secret, secret volume mount, secret used by other volume mount, env, or envfrom)
  • a pod bound to the node references a PVC which references a PV which references the secret needed to mount the PV (e.g. RBD, Ceph, Gluster)

For the node->pod->pvc->pv->secret case, all items in the chain matter, and changes can add/remove a secret from the set a node is allowed to get.

@wojtek-t
Copy link
Member Author

@liggitt - thanks for pointing on it. I wasn't aware of it.

i think I have some ideas how to solve it - I will try to update the doc today.

@wojtek-t
Copy link
Member Author

@liggitt @deads2k - I've significantly rewrite the doc and changed the design. The new approach has some assumptions, but should address all the concerns that you mentioned above.

PTAL when you will find some time.


Fortunately, we can solve it in much simpler way, with one additional assumption:

1. All object types necessary to determine the in-memory mapping share the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement means that we're explicitly giving up on being able to shard etcd for any resource used by a node and that any new backend would have to provide the same ordering guarantee. That's significant departure from our previous posture of "we'll shard on resource types first".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is limitation. Though it doesn't block sharding completely - it's still possible to shard a bunch of different resource types into separate etcd (e.g. Nodes, Services, Endpoints, ...).

So maybe it's something we can leave with? [obviously, if you have any better suggestion how to solve it, I'm more than happy to hear it.]

detail hidden in the code (see more details in the next section).


### Implementation details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet thought about whether I think this works. The "pod references PVC which references PV which references secret" and I update the resource links worries me for ordering concerns. I think that in such a case, you have to keep getting the secret until the watch finally returns it you. Even so, a later update to a PV could remove the secret (you should get a DELETE I think) and a subsequent update to the PV would expose it to you again and we wouldn't be able to show you the ADD because it could be moving back in time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet thought about whether I think this works. The "pod references PVC which references PV which references secret" and I update the resource links worries me for ordering concerns.

Sorry, I don't understand this part.

I think that in such a case, you have to keep getting the secret until the watch finally returns it you.

Why? If at the moment with rv=X you call GET for a secret and retrieve its current version (because you already have access to it, so there is a pod that bounds it to the node via some path), the when the resource graph will reach rv=X too and this secret will change the modification of it will be delivered to the user. Or did i misunderstand your concern?

Even so, a later update to a PV could remove the secret (you should get a DELETE I think).

I'm not sure it is that crucial. Even we deliver the delete to the watcher, it potentially may still keep it in memory. So it's more important to disallow it from getting it again.

and a subsequent update to the PV would expose it to you again and we wouldn't be able to show you the ADD because it could be moving back in time.

Are we allowing for such operations? Is every PV/PVC update a valid one? Should it be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we allowing for such operations? Is every PV/PVC update a valid one? Should it be?

I think changing a secret reference is a pretty reasonable thing to do and allow. Particularly on a resource where the cost for deletion is high.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can't really drop the assumptions that RVs of objects returned via watch are in non-decreasing order, because it would break all the machinery we have.

That said, one potential thing that comes to my mind is to say that we deliver all those ADD and DELETE events being result of pod/pv/.. modification (that I described that we don't deliver) but with the following caveat:

  1. if a pod change in rv=X (or any other object) is causing add or delete of secrets (or anything else) s1, s2, s3, then we also deliver them as watch events, but all of them with the same RV=X (equal to the one of changed pod).

However, that has two main implications:

  1. we end up with objects that has RVs not being their real RVs (though I don't think this will have any bad implications)
  2. we end up with (potentially) multiple watch events with the same RV. This is a problem if the watch breaks/closes in between of sending them.

A workaround then would be to resend all of them when watch is renewed.
So when we get a request with rv=x, instead of starting from x, we start the watch by sending all events from x-1, and then user needs to protect himself for processing re-adding already added object or deleting the non-existing one. Though in thing like ThreadSafeStore it would just work.
But that is again change in the semantics.

Another potential modification to the above would be, instead of having a watch event to be a secret (or any other object), change it to actually be list of secrets. Then the above problem disappears, but we end up with a completely different API. So then it probably can't be extension to existing WATCH, but should be some different API endpoint (like WATCHLIST, though this name has already been used for a different purpose :)). And we end up in a situation when our machinery (e.g. reflector) again doesn't work (it's possible to make it work, but it requires work).

@deads2k WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my suggestion about touching objects and letting them go through the watch machinery again is much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutating objects in response to ACL changes that change visibility to them does not seem like a good idea (or even possible, depending on the authorization system in use)

@deads2k
Copy link
Contributor

deads2k commented Mar 20, 2017

we end up with objects that has RVs not being their real RVs (though I don't think this will have any bad implications)

This would concern me. It means that anyone using that resource version won't see the behavior they expect. As a for instance, a watch with it won't properly watch the particular resource. I think even a get which hits the cache would be suspect too.


1. There can be more objects referenced by a given pod (so we can't send all of
them with a rv corresponding to that pod add/update/delete)
2. If we decide for sending those with their original rv`s, then we could
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unmatched backtick


# Detailed design

We will introduce the following ```node selector ``` filtering mechanism:
Copy link
Member

@lavalamp lavalamp Mar 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative proposal:

  1. Set up a controller that actively gives kubelets read permission on objects they need to view.
  2. Make API server filter by permission when a user requests to watch across namespaces.
  3. Kubelets just watch everything.

This approach is extendable to help other users with the same problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lavalamp - thanks for the suggestion. However, I think there are few things we need to think about it in this solution. The main thing is:
"are we going to reflect this information in etcd"

If not (e.g. it will be just in-memory), then it might be tricky to synchronize it between apiservers in HA setups.

If it will be in etcd, then:

  1. having "allowed kubelets" be listed together with the object doesn't seem like a good idea
  2. if we are not going to store the "newly allowed/disallowed kubelet" together with the "contentless update", then how we are going to derive its purpose in apiserver?

Maybe these questions are stupid, but I guess I'm not following your thoughts here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it seems that @liggitt and @deads2k are against having "artificial" resource versions for the purpose of changing ACLs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recording third alternative here (the approach discussed by david and jordan):

  1. pods come from LIST / WATCH on a field selector
  2. all other resources from kubelet are direct GETs, possibly implemented by batch GET on the server, protected by ACL, retry after some window on 403 (maybe double the backoff)
  3. protect ACL via a separate authorizer, use the reference graph

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry of 403 is really no different than any other retry - you can't start unless you can access it, and malicious clients are the ones we reject.

In the future batch GET could be useful, but not strictly necessary here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think there are few things we need to think about it in this solution. The main thing is: "are we going to reflect this information in etcd"

E.g. RBAC permissions are stored in etcd afaik.

having "allowed kubelets" be listed together with the object doesn't seem like a good idea

We shouldn't add the permissions directly to the secret/configmap/etc objects.

if we are not going to store the "newly allowed/disallowed kubelet" together with the "contentless update", then how we are going to derive its purpose in apiserver?

The contentless update would just exist for the purpose of getting all existing watches to reevaluate the object. In this model, we've added a visibility-checking filter to the watches. The filter would have to do a consistent read of the permission source.

Recording third alternative here

I could live with that, but fixing visibility for watches seems more useful long-term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's a sig-auth discussion about allowing selective retrieval of secrets for the purposes of using them. In that model you'd have to say up front who you are and access them via a subresource (potentially so that you can get redirected to a third party system). So we're already considering decoupling secrets strongly, which makes watches impossible.

Part of the reason for direct access is because it keeps the core simple. Yes, we could do all sorts of crazy filtering, and watch for changing ACL rules, and match those up on the fly to access. The question is - why bother, if we can put in place a reasonably performant "lazy" access?

kubernetes/kubernetes#40403 had gone into some of the options here, but the sheer simplicity of having only two types of filter on LIST WATCH:

  1. field/label
  2. can you start watching X

means that the scale complexity is at least tractable - you can effectively detect when ACL changes and break the watch (which we need to start doing at some point in the future). But when you bring in linking and relationships, you have to do a much more complex join, and the potential security risks are much higher if you get the join ordering wrong.

That said, if we can define a maximum window for watch breaking, we could in theory do the ACL check on a single item watch every X minutes (at the window for maximum time open) and look at allowing bulk WATCH (watch these keys). Not sure how computationally tractable that is, but it separates WATCH from ACL still, which I think is a net win.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a much more complex join, and the potential security risks are much higher if you get the join ordering wrong.

That's a fair point.

can you start watching X

look at allowing bulk WATCH (watch these keys)

If we want a bulk watch it seems like we'd have to do the ACL check there, in which case doing it up front is redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry by bulk watch I mean the long standing request of multiplexing watches:

  1. open session
  2. add watch 1
  3. add watch 2
  4. add watch 3
  5. close watch 2
  6. close session

I think that can be efficiently optimized, but each watch has its own acl check - so we move the distribution of the check from inside the watch to outside the watch. I think we need to close watches when security rules change anyway (it's just a matter of time until that becomes broken), hence why I think we could potentially start down the ACL path, then do watch with ACL checks breaking watch, then do bulk watch.

But in the meantime that would mean that we have to do GET / poll / cache from the kubelet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton - thanks for explanation. The bulk watch that you suggested above makes a lot of sense - I like the idea.
Do you think that we can work on 'bulk watch" in parallel with other two things? I didn't think carefully about it, but it seems that "bulk watch" is supposed to be some wrapper (wrapper is probably not a good word here) around regular single-resource watches. If so, I think adding ACL to watches seems kind of independent to me.
Maybe I can help with designing and implementing bulk watch?

type NodeSelector struct {
// TODO: Should this be repeated field to allow for some fancy controllers
// that will have access to multiple nodes?
nodeName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very single-purpose.

of events is always the same (e.g. very slow laggin watch may not cause dropped
events).

To satisfy the above requirements, we can't really rely only on the existing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be really hard to get right, I think. An alternative that comes to mind:

  1. When the controller adds or removes a permission to view an object, it "touches" the object, e.g. increments a counter in a new field.
  2. Then the normal filtering mechanism will work with no changes whatsoever.


One potential solution would be to identify this watch by a resource version
combined from resource versions coming from different object kinds (e.g.
pods have rv = rv1, secrets have rv = rv2, ...). Then we could keep the history
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should produce a solution that depends on any particular fact about RVs. We don't let users do it, we don't do it in the GC even though it would be very nice, we shouldn't start doing it now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this one depend on any fact? I think it doesn't (though this was kind of "considered alternative" and I don't think we should proceed with this one anyway).

}
```

The NodeSelector field will be added to ```ListOptions ``` (next to label & field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really think this is the wrong direction to go, and it's much better to build reusable visibility rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am -1 on this - I want generic visibility, or selective ACL + field rules.

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


```
type BulkListOptions struct {
Selectors []ListSelector
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I really prefer this one. I would like to avoid having a single endpoint supporting different kind of operations.
Since it will still be alpha, I agree that if we learn that this is painful we will be able to change it.

So, I'm changing it to GetOperations and GetOption for now - we can revisit in the future (while still being in alpha).

}
```

We will create a dedicated admission plugin (or other filter mechanism) that
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton - I see. So i think that I inconsciously mixed two different things here.

Basically, if we ignore watch for a moment, we still need to be able to answer whether a given GET request is allowed or not. And this is what I was referring to as a "dedicated admission plugin" (though maybe it' won't be that dedicated).
The second thing is watch, and for that we also need to do periodic checking (or lazy checking as you suggested, which is a nice optimization). But as I wrote somewhere at the beginning of a doc, this isn't bulk-specific, we need pretty much the same mechanism for a regular watch. If we do it low enough in our machinery, we should be able to simply reuse it.

I just clarified this part of a doc to reflect those thoughts.

will be responsible for detecting whether a given user is allowed to list/watch
objects requested by a given `BulkdSelector` (the exact mechanism for it is out
of scope for this document) and either rejecting the whole request or letting it
go. We are not going to support partial rejections (e.g. you cannot proceed with
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be clarified now.


The the high level, the propocol will look:
1. client opens a new websocket connection to a bulk watch endpoint to the
server via gttp GET
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The the high level, the propocol will look:
1. client opens a new websocket connection to a bulk watch endpoint to the
server via gttp GET
1. this results in creating in creating a single channel that is used only
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

We will start with introducing `getoperation` resource and supporting the
following operation:
```
POST /apis/bulk.k8s.io/v1/getoperation <body defines filtering>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "subscriptions" as the resource name, if the only only thing one can do is subscribe and unsubscribe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I'm not thinking about it, subscriptions would be good for watch, but not necessarily for get in my opinion.
@smarterclayton - WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to call it subscriptions, we could just call it watches. However, I think BulkGetOperation is probably appropriate - it's not too generic, and because it supports multiple watches, the incremental "subscribe" style works as well, and leaves the door open. I originally worried about stuttering, but I think given the novelty of this operation BulkGetOperation (for the endpoint) and GetOperation (for the incremental watch) is appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify I understand correctly. You suggest the following change:

s/getoperation/bulkgetoperation/

or I misunderstood it? And if so, should i also change the structs in lines 140 into:
GetOperations -> BulkGetOperations
GetOperation -> BulkGetOperation

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top level object would be BulkGetOperation, the resource would be bulkgetoperations, and the thing users send over the watch would be GetOperation (singular)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification - makes perfect sense to me.
Fixed.

# Detailed design

As stated in above requirements, we need to make bulk operations work across
different resource types (e.g. watch pod P and secret S within a single watch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that we probably need to implement this in the aggregation layer. The first step in the implementation could just be issuing a watch for every subscription request. Later, we could optimize to also do bulk watch between aggregator and backing apiserver, but that is pure optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated this comment into "Implementation details" section.

1. to subscribe for a watch of a given (set of) objects, user sends `Watch`
object over the channel; in response a new channel is created and the message
with the channel identifier is send back to the user (we will be using integers
as channel identifiers).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: check if integers are mandatory or if we could do something like "ch1", "ch2"...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a TODO for now.

1. to stop watching for a given (set of) objects, user sends `CloseWatch`
object over the the channel; in response the corresponding watch is broken and
corresponding channel within websocket is closed
1. once done, user can close the whole websocket connection (this results in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should/must instead of can?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```
type Request struct {
// Only one of those is set.
Watch Watch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch *Watch (optional)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// Depending on the request, channel that was created or deleted.
type Response struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add some way of correlating the responses with the requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is to have an int that the client increments with every request which the server echos back in the response.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Channel Identifier
}
```
With the following structure we can guarantee that we only send and receive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/following/previous/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments applied. PTAL

We will start with introducing `getoperation` resource and supporting the
following operation:
```
POST /apis/bulk.k8s.io/v1/getoperation <body defines filtering>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I'm not thinking about it, subscriptions would be good for watch, but not necessarily for get in my opinion.
@smarterclayton - WDYT?

1. to stop watching for a given (set of) objects, user sends `CloseWatch`
object over the the channel; in response the corresponding watch is broken and
corresponding channel within websocket is closed
1. once done, user can close the whole websocket connection (this results in
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

1. to subscribe for a watch of a given (set of) objects, user sends `Watch`
object over the channel; in response a new channel is created and the message
with the channel identifier is send back to the user (we will be using integers
as channel identifiers).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a TODO for now.

```
type Request struct {
// Only one of those is set.
Watch Watch
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Channel Identifier
}
```
With the following structure we can guarantee that we only send and receive
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# Detailed design

As stated in above requirements, we need to make bulk operations work across
different resource types (e.g. watch pod P and secret S within a single watch
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated this comment into "Implementation details" section.

}

// Depending on the request, channel that was created or deleted.
type Response struct {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

As stated in above requirements, we need to make bulk operations work across
different resource types (e.g. watch pod P and secret S within a single watch
call). Spanning multiple resources, resource types or conditions will be more
and more important for large number of watches. As an example, deferation will
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/deferation/federation/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

As a result, we need another API for watch that will also support incremental
subscriptions - it will look as following:
```
websocket /apis/bulk.k8s.io/v1/getoperation?watch=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be v1alpha1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 98 above I have:
"In all text below, we are assuming v1 version of the API, but it will obviously go through alpha and beta stages before (it will start as v1alpha1)."

// Depending on the request, channel that was created or deleted.
type Response struct {
// Propagated from the Request.
RequestId Identified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id and requestID externally, ID and RequestID internally (style)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@smarterclayton
Copy link
Contributor

A few minor api quibbles, but overall I'm pretty happy with this.

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Clayton!

PTAL

As stated in above requirements, we need to make bulk operations work across
different resource types (e.g. watch pod P and secret S within a single watch
call). Spanning multiple resources, resource types or conditions will be more
and more important for large number of watches. As an example, deferation will
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

As a result, we need another API for watch that will also support incremental
subscriptions - it will look as following:
```
websocket /apis/bulk.k8s.io/v1/getoperation?watch=1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 98 above I have:
"In all text below, we are assuming v1 version of the API, but it will obviously go through alpha and beta stages before (it will start as v1alpha1)."

// Depending on the request, channel that was created or deleted.
type Response struct {
// Propagated from the Request.
RequestId Identified
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

We will start with introducing `getoperation` resource and supporting the
following operation:
```
POST /apis/bulk.k8s.io/v1/getoperation <body defines filtering>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify I understand correctly. You suggest the following change:

s/getoperation/bulkgetoperation/

or I misunderstood it? And if so, should i also change the structs in lines 140 into:
GetOperations -> BulkGetOperations
GetOperation -> BulkGetOperation

?

@smarterclayton
Copy link
Contributor

LGTM, squash and i'll merge

@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 7, 2017

@smarterclayton - thanks a lot! Commits squashed.

@smarterclayton
Copy link
Contributor

One more thing (we can do in a follow up). The web sockets impl must correctly support protocols, and also support base64 token encoding in the protocol (fixes a security hole) kubernetes/kubernetes#47967.

@smarterclayton smarterclayton merged commit 794c44c into kubernetes:master Jul 7, 2017
@wojtek-t wojtek-t deleted the kubelet_watch branch July 3, 2018 12:54
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.