-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add definitions for RDMA controller/cgroup of Linux kernel 4.11 #942
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this settles down, it would be good to get JSON Schema entries too.
config-linux.md
Outdated
@@ -169,7 +169,7 @@ In addition to any devices configured with this setting, the runtime MUST also s | |||
## <a name="configLinuxControlGroups" />Control groups | |||
|
|||
Also known as cgroups, they are used to restrict resource usage for a container and handle device access. | |||
cgroups provide controls (through controllers) to restrict cpu, memory, IO, pids and network for the container. | |||
cgroups provide controls (through controllers) to restrict cpu, memory, IO, pids, network and rdma resources for the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel docs aren't particularly consistent about this, but since RDMA is an acronym for remote direct memory access, I think we want to use RDMA here (and in other places, except for the literal rdma
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. yes, I will change rdma to RDMA.
config-linux.md
Outdated
|
||
* **`hca_device`** *(string, REQUIRED)* - specifies the device name whose resources limit to be configured | ||
* **`hca_handles`** *(uint32, REQUIRED)* - specifies the maximum number of hca_objects in the cgroup for a specified device | ||
* **`hca_objects`** *(uint32, REQUIRED)* - specifies the maximum number of hca_handles in the cgroup for a specified device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have “Default is "no limit".” in the Go comments, but REQUIRED here. Reading the kernel docs, I think hca_handles
and hca_objects
should both be OPTIONAL. If you like, you can add a requirement like this
You MUST specify at least one of
hca_handles
orhca_objects
in a given entry, and MAY specify both.
The kernel docs also make a distinction between leaving those unset (no change from the previous limit) or setting them to max
(clearing a possible previous limit). Since the spec supports joining and tweaking existing cgroups, I think we need a way to distinguish between the two cases. The current policy seems to be using signed integers and using -1 for “explicitly remove any limits”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I initially had it optional but than made it required to simplify the -1. But I think its fine as it already exist.
I will make both of them as optional.
So when user doesn't want to configure any one of the value, it should set to -1. Correct?
Because at kernel level they are just optional KV pair, but at cgroups spec level it is a structure. So no_limit should be initialized as -1. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when user doesn't want to configure any one of the value, it should set to -1. Correct?
No, leave it unset to not change the limit (nothing written to the control file for this parameter). Set to -1 to clear the limit (max
written to the cgroup control file for this parameter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so when cgroups.New() is invokes, there are 4 valid values for hca_objects/hca_handles.
- 0 - no allocation possible
- +ve - valid positive value
- -1 to clear the limits
- do not touch the limits
If we make these two parameters optional, what should user set in specs.LinuxRdma.HcaObjects to convey - 'do not touch the limits'.
Setting 0 will disable it (which is valid value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... what should user set in specs.LinuxRdma.HcaObjects to convey - 'do not touch the limits'.
nil
, which is why we want pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Sounds good. I will revise the spec.
specs-go/config.go
Outdated
// Maximum number of HCA handles that can be opened. Default is "no limit". | ||
HcaHandles uint32 `json:"hca_handles"` | ||
// Maximum number of HCA objects that can be created. Default is "no limit". | ||
HcaObjects uint32 `json:"hca_objects"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are zero values allowed (e.g. for “you can't use any HCA handles”)? If not, we should say so in the Markdown spec. If so, we should make these omitempty
pointers so we can distinguish between “set to zero” and “unset”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. setting zero is allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also on the “don't allow no-op noise” front would be requiring at least one of hca_handles
or hca_objects
to be set. If you add that contraint, no-op configs like "rdma": [{"hca_device": "m1x5_1"}]
become illegal. I'm not strongly against no-op noise, but I am weakly against it to maintain consistency with the existing block I/O pattern.
config-linux.md
Outdated
|
||
The following parameters can be specified to set up the controller: | ||
|
||
* **`hca_device`** *(string, OPTIONAL)* - specifies the device name whose resources limit to be configured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep this REQUIRED, because there's no point in supporting "rdma": [{}]
. It also lets you drop the later line about “You must specify a valid hca_device
…”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. so if I understand correctly, "rdma": [{}] can be treated as illegal config.
"rdma": [{"hca_device": "m1x5_1"}] is valid. And other two parameters are optional anyway.
So you are saying if end user passes hca_device only then we create the rdma controller?
Won't Docker or other engines create all empty cgroup controllers as "rdma": [{}]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if I understand correctly,
"rdma": [{}]
can be treated as illegal config
That's what you'd get if you REQUIRED hca_device
(which I think is a good idea for consistency with rlimits[].type
).
"rdma": [{"hca_device": "m1x5_1"}]
is valid…
As you have it now. But if you requiring at least one of hca_handles
or hca_objects
to be set, then that would be invalid too (and we'd be consistent with the current block I/O approach).
So you are saying if end user passes
hca_device
only then we create the rdma controller? Won't Docker or other engines create all empty cgroup controllers as"rdma": [{}]
?
If you have rdma
unset or set to [{}]
(or [{"hca_device": "m1x5_1"}]
, etc.), you're not adding new limits. The spec position on that is:
Runtimes MAY attach the container process to additional cgroup controllers beyond those necessary to fulfill the
resources
settings.
which means that, for a config that does not adjust limits via linux.resources.rdma
, the runtime can attach the container process to the rdma controller at cgroupsPath
or not as it sees fit. See #493, #834, and opencontainers/runc#861 for more background.
In case it helps clarify, I think this section of text should look like:
Each entry has the following structure:
* **`hca_device`** *(string, REQUIRED)* - specifies the device name whose resources limit to be configured
* **`hca_handles`** *(uint32, OPTIONAL)* - specifies the maximum number of hca_objects in the cgroup for a specified device
* **`hca_objects`** *(uint32, OPTIONAL)* - specifies the maximum number of hca_handles in the cgroup for a specified device
You MUST specify at least one of `hca_handles` or `hca_objects` in a given entry, and MAY specify both.
To close another no-op loophole we could follow this and require rdma
to, when set, contain at least one entry. That would be new wording though, since the spec doesn't currently do that for any OPTIONAL property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
section of text that you described looks good to me. If we follow the text you described, it closes the no-op loophole too, isn't it for rdma controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we follow the text you described, it closes the no-op loophole too, isn't it for rdma controller?
You could still have "rdma": []
unless you REQUIRE the array to have at least one entry. We don't close that loophole for any other OTIONAL properties at the moment (e.g. "mounts": []
is ok), so we'd need maintainer input on whether it was worth setting a new precedent.
Another issue is the possibility of duplicate hca_device
entries. My preferred approach to tgat would be to use an object:
"rdma": {
"m1x5_1": {
"hca_handles": 3
}
}
My attempt to do that for rlimits was rejected (#583), but I'm not clear on why. It's possible that maintainers would support an object for this new property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is case (1), where runtimes MAY create the cgroup (if necessary) and attach the container process to the cgroup. Apparently containerd/cgroups decides to do that. But runtimes don't have to. And not creating the cgroup (and not attaching the container process to the non-existent cgroup) is the lazy-cgroups approach discussed in opencontainers/runc#861. If containerd/cgroups decided to implement the lazy-cgroups approach on nil, their Create would instead look like:
I agree to what you described.
However current scheme (1) of runc-spec and its implementation sounds reasonble to me and fits the requirements of rdma too without #861.
When opencontainers/runc#861 scheme is adopted, rdma could possibly follow that.
In order to make progress, I propose to move forward with currently defined scheme and implementation of (1), (2) and (3).
(2) and (3) are already matches generic runc-spec definition.
(1) is implemented by all cgroups (not just pids) in https://github.com/containerd/cgroups.
If (1) needs change in runc-spec definition or wording, it is going to be different PR which will affect other resources too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If (1) needs change in runc-spec definition or wording, it is going to be different PR which will affect other resources too.
That makes sense to me. I had been under the impression (based on this comment), that you intended to start requiring cgroup creation (if necessary) and joining in the "rmda": {}
case, which is not a requirement that the current spec has for other controllers. Currently the spec, as I read it, allows runtimes to decide whether they want to create/join or not for both the “unset” (e.g. "resources": {}
) and “set empty” cases (e.g. "resources": {"memory": {}}
) because of this clause. If you're ok using that same logic (for now) for rmda
, then I'm back to preferring the wording I floated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. I had been under the impression (based on this comment), that you intended to start requiring cgroup creation (if necessary) and joining in the "rmda": {} case, which is not a requirement that the current spec has for other controllers. Currently the spec, as I read it, allows runtimes to decide whether they want to create/join or not for both the “unset” (e.g. "resources": {}) and “set empty” cases (e.g. "resources": {"memory": {}}) because of this clause. If you're ok using that same logic (for now) for rmda, then I'm back to preferring the wording I floated here.
Based on the clause you mention https://github.com/opencontainers/runtime-spec/blame/v1.0.1/config-linux.md#L197 and based on implementation at https://github.com/containerd/cgroups, which follows the MAY part, -> "rdma": [{}] and "rdma": [{"hca_device": "m1x5_1"}] is also valid similar to priorities array of objects in Network cgroup as better example than Pids. For rdma instead of array of objects, as you suggested, we will have map as
"rdma": {
"m1x5_1": {
"hca_handles": 3
}
}
Are you ok with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rdma instead of array of objects, as you suggested, we will have map as
"rdma": { "m1x5_1": { "hca_handles": 3 } }
Are you ok with that?
Ah, right, that would change the wording. I'm in favor of the object/map approach, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Great. So let me revise the spec to have treat them as optional and as map instead of array.
specs-go/config.go
Outdated
// LinuxRdma for Linux cgroup 'rdma' resource management (Linux 4.11) | ||
type LinuxRdma struct { | ||
// Hca device name whose resources to be restricted | ||
HcaDevice string `json:"hca_device,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you go back to REQUIRing hca_device
, you'll want to drop omitempty
here.
5b2e096
to
a1acc47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the updated changes. Reference implementation also updated to match these updates.
Once spec is approved, will send PR for https://github.com/paravmellanox/cgroups/blob/master/rdma.go
config-linux.md
Outdated
```json | ||
"rdma": { | ||
"limits": [ | ||
"mlx5_1": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not valid JSON. I think you want to drop limits
and just make rdma
an object with device-name keys and handles/objects-limit values.
specs-go/config.go
Outdated
type LinuxRdma struct { | ||
// Limits are a set of key value pairs that define RDMA resource limits, | ||
// where the key is device name and value is resource limits. | ||
Limits map[string]LinuxRdmaLimit `json:"limits,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is whatI think you want for rdma
; no need for a Limits
. Just use:
type LinuxRdma map[string]LinuxRdmaLimit
Or skip LinuxRdma entirely and just inline
map[string]LinuxRdmaLimitin
LinuxResources`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or skip LinuxRdma entirely and just inlinemap[string]LinuxRdmaLimitinLinuxResources`.
Simplification looks fine to me. Sending updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please approve the changes if they look good now, we need to get this going as users are requesting it.
Hi @wking , any more changes to be done or we are good to go. I believe we addressed all the comments.
We like to get this small part integrated so that we can make this available to users.
@wking any updates? Do you need any more information for this PR? |
On Tue, Jan 23, 2018 at 02:16:29PM +0000, Parav Pandit wrote:
@wking any updates?
I've filed paravmellanox/runtime-spec#1 with some suggested changes.
If those look good, feel free to squash them into your PR.
|
With paravmellanox#1 merged, the new content looks good to me. You might want to squash this PR down to a single commit. And passing tests are blocked on #945. Any maintainers want to look this over? |
I am not sure that if I squash and force push the new squashed commit, will the discussions happened here will be lost or not. |
On Wed, Jan 24, 2018 at 11:28:54PM +0000, Parav Pandit wrote:
I am not sure that if I squash and force push the new squashed
commit, will the discussions happened here will be lost or not.
Whatever happens here, discussion in the PRs will be preserved
(possibly under “Show outdated” accordions like [1]).
I was hoping that If maintainer does squash merge, than discussion
is preserved and final merged code also has single commit.
I don't do much squash-merging via GitHub's UI, so I'm not sure. But
one commit of mine that was squash-merged didn't end up with any
useful commit message [2] despite the original commits having commit
messages [3]. If you are comfortable squashing, I suggest you do so
and write up notes about alternatives you considered and why you
eventually went with what you did for the squashed commit's message.
That way you don't have to rely on the squash-merge to get something
sensible.
I little new in this area, so please guild me.
And I'm not a maintainer, so this is all just my 2¢. Do whatever
makes sense to you.
[1]: #942 (comment)
[2]: opencontainers/runtime-tools@7a4cb36
[3]: opencontainers/runtime-tools@a2725b4
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
846209f looks good to me. We'll see what the maintainers think ;).
@wking , thanks a lot for the inputs, fixup and additional JSON definitions. |
Hi, @vishh @vbatts @philips @mrunalp @hqhq @dqminh @tianon @crosbymichael @caniszczyk Can you please review this PR? @wking and I have went through several iterations. Feel that this is good start now and like to merge this PR at earliest to provide this feature to users waiting for it. |
config-linux.md
Outdated
The name of the device to limit is the entry key. | ||
Entry values are objects with the following properties: | ||
|
||
* **`hca_handles`** *(uint32, OPTIONAL)* - specifies the maximum number of hca_objects in the cgroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no _
in the json properties
Overall, this looks good to me once the |
846209f
to
75c847a
Compare
Removed the underscore and updated the PR. |
Why did you close? |
Closed by mistake. I am sorry. Hi @crosbymichael, would you please marked as reviewed/approved? |
c5cc655
to
5d9bbad
Compare
Got closed during rebase process, must be something wrong I did from the GUI interface. Reopening it again. |
After the rebase, travis-ci still seems to fail while executing: ERROR: invalid document-file path: stat test/config/good/invalid-json.json: no such file or directory testing test/config/bad/linux-rdma.json @wking or @crosbymichael, how to check which out of the 2 check failed the build and where to see what caused the failure if it is due to error in linux-rdma.json |
Looks like you haven't pushed the cherry-pick: $ git fetch origin
$ git log -1 --oneline --decorate origin/pr/942
5d9bbad (origin/pr/942, origin/pr/936) config: Dedent root paragraphs, since they aren't a list entry |
I followed below steps, likely not optimal.
This updated the PR with 4 patches (patches after rebase and new patch). As a result of this push command, travis-ci build triggered. schema/test/config/good directory contains only 3 files. minimal*.json and spec-example.json. |
On Fri, Feb 16, 2018 at 12:56:45PM -0800, Parav Pandit wrote:
3. git am <rdma-cgroup.patch>
Without the content of the patch, it's hard to know what this is
doing. It'd easier to tell with @crosbymichael's cherry-pick
suggestion [1]. Filling in a bit more detail:
$ git remote -v
origin git://github.com/opencontainers/runtime-spec.git (fetch)
origin git://github.com/opencontainers/runtime-spec.git (push)
paravmellanox https://github.com/paravmellanox/runtime-spec.git (fetch)
paravmellanox https://github.com/paravmellanox/runtime-spec.git (push)
I'm guessing there, you may be using other remote names for the
opencontainers repo and/or your fork. Just substitute your names
below. The scheme (git:// vs. https:// vs. ssh://) doesn't matter
either, as long as you have something pushable for paravmellanox's
push entry.
$ git fetch origin
$ git checkout master
$ git reset --hard origin/master
$ git cherry-pick 5f77847
$ git push -f paravmellanox master
[1]: #942 (comment)
|
c730cb3
to
f25d7b4
Compare
Hi @wking, I did follow the steps as described. However I continue to see the error: ERROR: invalid document-file path: stat test/config/good/invalid-json.json: no such file or directory As I mentioned, this file doesn't exist in https://github.com/opencontainers/runtime-spec/tree/master/schema/test/config/good directory where Travis-CI is looking for a file. Or may be the error is actually below one? testing test/config/bad/linux-rdma.json |
Hi @wking, @crosbymichael, Is there anything I can do to remove the bump of #947? I do not have much know-how of Travis-CI though, but I can attempt. |
On Thu, Feb 22, 2018 at 12:09:29AM +0000, Parav Pandit wrote:
Hi @wking, @crosbymichael, Is there anything I can do to remove the
bump of #947?
You can avoid the problem by dropping the schema/test/config/bad/…
additions. But #947 is a four-character fix, I'd just wait for the
maintainers to get around to it.
|
f25d7b4
to
51c4bec
Compare
In order to make progress to make use of this functionality, I updated PR to drop schema/test/config/bad/linux-rdma.json for now. @crosbymichael and others please continue to review and merge the PR. |
config-linux.md
Outdated
#### Example | ||
|
||
```json | ||
"rdmaLimits": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why you named this rdmaLimits
instead of just rdma
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why you named this
rdmaLimits
instead of justrdma
?
I like rdma
better, although I don't care much. Previously, @paravmellanox cited hugepageLimits
for precedent, although the spec obviously also contains limiting properties without a Limits
suffix.
#947 landed today, so you can restore the bad example. |
Linux kernel 4.11 adds support for RDMA cgroup resource controller. This allows limiting maximum number of open hca_handle and maximum number of hca_objects which can be created by processes. config-linux: Add documentation for Linux RDMA cgroup Add documentation, example and link to kernel documentation for Linux RDMA cgroup. additionalProperties is defined for the JSON Schema draft-04 in [1] with clearer documentation in draft-07 [2]. It is supportd by gojsonschema since xeipuuv/gojsonschema@0572d9d (added additionalProperties with inner schema, 2013-06-21). [1]: https://tools.ietf.org/html/draft-fge-json-schema-validation-00#section-5.4.4 [2]: https://tools.ietf.org/html/draft-handrews-json-schema-validation-00#section-6.5.6 Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: W. Trevor King <wking@tremily.us>
51c4bec
to
2e241f7
Compare
both changes done. @wking , @crosbymichael |
@crosbymichael, who else from the group should be reviewing these changes? |
ping @opencontainers/runtime-spec-maintainers |
This request is for the inclusion of RDMA controller/cgroup which is added in Linux kernel 4.11.
Background:
RDMA Networking adapters provide low latency, direct access to hardware (HCA) to user space applications to send/receive messages and data in a clustered systems.
With latest HCAs, application to application level messages delivery latency can be as low as < 1usec.
RDMA kernel subsystem achieves this using RDMA resources such as QPs, MRs CQs objects.
RDMA cgroup allows limiting resources of the such RDMA HCA networking adapters.
RDMA resource creations such as hca_handle and hca_objects must be controlled so that certain set of processes doesn't take away all hardware resources.
RDMA controller allows limiting such HCA resource on per HCA basis.
Example implementation of this controller is already available below.
paravmellanox/cgroups@469ac1c