Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: Add an "identifier" to AppId #1566

Merged
merged 14 commits into from
Mar 2, 2020
Merged

kernel: Add an "identifier" to AppId #1566

merged 14 commits into from
Mar 2, 2020

Conversation

bradjc
Copy link
Contributor

@bradjc bradjc commented Jan 27, 2020

Pull Request Overview

Currently a capsule can store an AppId to refer to a process, and there is not really a mechanism for ensuring that that AppId is still valid. Currently in Tock we are OK with this since we only create processes once, they don't restart, and we don't move them around in the processes array. However, as process management becomes more sophisticated, the idea that an AppId is always valid will no longer be true. Since AppId only stores the index of the relevant process in that array, if a process restarts, or if a process is removed and then a new one is added in the same slot, then the stored AppId will appear valid, even though it actually refers to a different process.

To enable verifying that AppIds are still valid, this PR adds a new field (identifier) to the AppId object. This allows AppIds to be used as before (i.e. they are still copyable), but now operations using AppId can verify that the identifier in the AppId matches the identifier assigned to the process at a particular index.

AppId retains the index field which does correspond to the index in the processes array (at least while the AppId is valid). This allows us to retain direct lookups, as well as simplifies the IPC implementation.

Changes to AppId API

The existing .idx() function has been removed, and .index() and .id() have been added. .index() checks that the AppId is still valid before returning the index in the array of the corresponding app. .id() just returns the identifier, which can be passed to userspace or over a UART channel, for example.

These changes resulted in a couple changes to capsules which need usize identifiers for apps, and now must call .id().

Changes to kernel APIs

The major change in the kernel (beyond AppId) is that functions which used to take an index now just take an AppId. Mostly this is a minor change, except for kernel.process_map_or(). This function runs a closure on a specific app if it is exists, and uses the index and identifier to verify the app is still valid.

Otherwise, IPC had to be updated quite a bit to handle the changes. I also took the opportunity to update how grant iters are implemented since this PR reduces the reliance on an app's identifier being its index in the process array.

How does this fix the capsule problem?

A capsule can still store an AppId, and that AppId can still become invalid. However, the main use of an AppId between a capsule and the kernel is to call grant.enter(appid). Without this PR, if appid refers to an old process, and a new process has started in the same array index, grant.enter(appid) would succeed (after initializing a new grant region for the new process). Now, when grant.enter() calls .kernel.process_map_or(appid), an error will be returned which will bubble back to the capsule.

What does this PR not do?

This PR just improves the mechanism for detecting app changes, but does not change the policy on how app identifiers are assigned. Identifiers remain equivalent to their index in the array.

I iterated on the design in this PR several times, and the changes became substantial without also changing how app identifiers are assigned.

This is part of #1082.

Testing Strategy

This pull request was tested by running the IPC apps. More testing is needed.

TODO or Help Wanted

n/a

Documentation Updated

  • Updated the relevant files in /docs, or no updates are required.

Formatting

  • Ran make formatall.

/// will be returned. Otherwise the closure will executed and passed a
/// reference to the process.
crate fn process_map_or<F, R>(&self, default: R, process_index: usize, closure: F) -> R
/// not exist (i.e. it is `None` in the `processes` array, or has stopped,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that if a process has been stopped this will return default? From looking through it seems that a stopped process still has a valid and unchanged AppId, and this function does not seem to call anything that checks if a process has been stopped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok good catch! This raises a case I hadn't really considered...what should happen if an app is in the StoppedFaulted state (or perhaps an Ended state if one were to exist)? The process is still there, and its the same process so its identifier probably shouldn't change, and certain capsules may want to be able to see that it exists (like process console). However, other capsules should be able to know it is no longer active (and will never be again). Perhaps process_map_or() should not change, but users of process_map_or() may need to check that the process is still active. In particular grant.enter() shouldn't succeed on an ended process.

In any case I will update the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment.

@bradjc
Copy link
Contributor Author

bradjc commented Jan 31, 2020

Added a commit to address the issue of what happens with a "stopped" process, in particular a process that cannot be executed again. I call this an "inactive" process, and this can happen if the process crashes and the kernel does not restart it, or (hypothetically) if the process exits itself and the kernel does not free its Process struct.

The issue in this case is to ensure that capsules, or anything that holds an AppId pointing to that app, can learn that the app is inactive and will not run again and should not hold any state for it. Since the process is still valid in the processes array, calling kernel.process_map_or() will find a valid match, so that will not signal the caller that the process is inactive. I thought about requiring callers to check if the process is active, but that seems easy to forget to todo.

Instead, I went through all of the functions in the ProcessType trait and made it clear that any functions that attempt to modify a process in any way will fail if the process is inactive. That allows something to still query the process (say to access its debug information or for process console to show that the app exists and is inactive), but any attempt by an AppId holder to modify the process will cause an error to propagate.

The canonical example of this is when accessing a grant. Capsules access a specific grant by calling grant.enter(appid). If the appid is completely invalid then when the grant module tries to find the process an error will be returned (as described above). However, if the appid is valid but the process is inactive, then the grant module will find the process but when it tries to manipulate the grant region an error will occur and that error will propagate back to the capsule.

@bradjc bradjc force-pushed the appid-no-index-enter branch from 70d4a4c to b046c51 Compare January 31, 2020 21:14
hudson-ayers
hudson-ayers previously approved these changes Feb 4, 2020
Copy link
Contributor

@hudson-ayers hudson-ayers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, and is a great step towards better handling restarted apps.

That said, a few thoughts I have:

  1. Whatever solution we choose for assigning the "identifier" field, we will probably need to be careful to avoid reuse of identifiers.

  2. It seems like this could potentially be implemented without storing the index field at all. This would require restructuring IPC to store tuples of (callbacks/identifiers) and (AppSlices/identifiers), and replacing the few places where we directly index into the process array with searches for the identifier. Given that no board currently supports more than 20 processes, this is not a huge overhead, though I realize this could be a limiting design decision if that were to change. I think the main overhead is that you would have to iterate through the list on every call to .enter(). The upside is that the resulting design is less confusing and entirely decouples processes from how the kernel stores them.

}

impl PartialEq for AppId {
fn eq(&self, other: &AppId) -> bool {
self.idx == other.idx
self.index == other.index && self.identifier == other.identifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just compare identifier?

Semantically, I think an app reference is a reference to the same app whether or not that app is alive or dead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the current implementation it is probably more correct to compare both fields. The reason is that using an appid with a valid identifier but the wrong index will not work, but using one with both fields correct will. It would be a little strange to have AppIdA == AppIdB, but then using AppIdA fails while AppIdB works.

At one point I did implement the logic for an AppId to automatically update its index, but since processes cannot currently move that seemed like unnecessary complexity right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure I'm convinced. The opposite argument here is that the .id() method is provided to

retrieve a simple identifier for the process that can be communicated over a UART bus or syscall interface. This call is guaranteed to return a suitable identifier for the AppId

So now we have a situation where it's possible that AppIdA.id() == AppIdB.id() but AppIdA != AppIdB.

I feel like the identifier is the identifier of an app, and the index is just an internal optimization that saves an otherwise expensive lookup.

What situation would arise where the identifiers match but the indexes don't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What situation would arise where the identifiers match but the indexes don't?

The kernel moves the application to a different index in the processes array. But, there isn't any functionality to do this, so it's a case that doesn't matter.

The fact that processes cannot move slots in the array (since there is currently no code that will do that) can support either implementation of eq. However, since just using identifier would be required if there ever was a reason to move processes around in the array, that implementation makes more sense. I will update.

/// originally referred to, even if that app no longer exists. For example,
/// the app may have restarted, or may have been ended or removed by the
/// kernel. Therefore, calling `id()` is _not_ a valid way to check
/// that an application still exists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about global uniqueness of identifiers -- since we're proposing this as a mechanism that labels a process possibly off-core (here tockloader is given as example, but one would also imagine using as an RPC identifier), I think part of the future design discussion for the actual identifiers should include whether we have an entropy source of some kind in the kernel to try to ensure that identifiers are unique even across reboots

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] whether we have an entropy source of some kind in the kernel to try to ensure that identifiers are unique even across reboots

Note that this would require more bits than usize from a cryptographic perspective, if you really want to avoid collision in all realistic scenarios (even considering that embedded CPUs are "slow" enough that one can only generate so many identifiers per second). A collision within a set of random 32-bit identifiers requires only around 2^16 identifiers.

I don't have much context on the use cases, so I guess it depends whether uniqueness is a nice to have property for e.g. debugging, or if a collision is to be avoided at all costs.

@bradjc
Copy link
Contributor Author

bradjc commented Feb 5, 2020

1. Whatever solution we choose for assigning the "identifier" field, we will probably need to be careful to avoid reuse of identifiers.

Yes. My current simple implementation is just to keep counting up, and hope we don't hit 2^32. But we can discuss this more on the next PR.

2. It seems like this could potentially be implemented without storing the `index` field at all. This would require restructuring IPC to store tuples of (callbacks/identifiers) and (AppSlices/identifiers), and replacing the few places where we directly index into the process array with searches for the identifier. Given that no board currently supports more than 20 processes, this is not a huge overhead, though I realize this could be a limiting design decision if that were to change. I think the main overhead is that you would have to iterate through the list on every call to `.enter()`. The upside is that the resulting design is less confusing and entirely decouples processes from how the kernel stores them.

Yes, definitely, and my original implementation did just that in two different forms. It definitely complicates IPC, and I tried to store tuples of <identifier, callback/slice>. While conceptually it is pretty straightforward, the rust code gets much more complicated because there is a lot more checking required, with more map()s, etc. It eventually became too much of a burden to actually implement in Rust. My second version stored an array of identifiers, and used the index in the array of the identifier as the lookup to the arrays for callbacks and app slices. This was much easier to actually implement, and worked, but was a little clumsy since there are these extra arrays and a scan of those arrays. Plus, process_map_or() then had to be a scan, and that could potentially lead to nested scans if implementors aren't careful. Complicating IPC and forcing an array scan on every process lookup seemed like a rather large cost just to eliminate the index that we already have in AppId, hence I settled on the design you see in this PR.

@hudson-ayers
Copy link
Contributor

I came across a pretty obscure piece of code affected by this change while looking into modifying the ProcessConsole. Currently, the ProcessConsole lists process in the form:

 PID    Name                Quanta  Syscalls  Dropped Callbacks  Restarts    State
  00    hello3                 298        60                  0         0  Running
  01    hello2                 280        54                  0         0  Running
  02    hello_loop             262        99                  0         0  Running

where the PID is not actually the index of the process in the process array, but the index of the iterator through the processes array (as output by .iter().enumerate() in process_each_capability() in grant.rs). This means that if there are empty slots in the process array prior to an process, the PID assigned by the process console will match neither the identifier of the process nor its index in the process array.

This is unchanged by this PR, but probably it would make more sense to have the ProcessConsole print the actual identifier of the process now that we have them.

@bradjc
Copy link
Contributor Author

bradjc commented Feb 18, 2020

I've tested this with various apps on Hail, including the tutorials/05 IPC apps, the Hail test app, and other collections of apps. I also tried these five:

 PID    Name                Quanta  Syscalls  Dropped Callbacks  Restarts    State
  00	hello_loop               0       359                  0         0  Yielded
  01	button_print             0         7                  0         0  Yielded
  02	adc_continuous           0       675                  0         0  Yielded
  03	rot13_client             0       829                  0         0  Yielded
  04	org.tockos.ex.rot13      0       131                  0         0  Yielded

and all 5 seem to work.

}

impl PartialEq for AppId {
fn eq(&self, other: &AppId) -> bool {
self.idx == other.idx
self.index == other.index && self.identifier == other.identifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure I'm convinced. The opposite argument here is that the .id() method is provided to

retrieve a simple identifier for the process that can be communicated over a UART bus or syscall interface. This call is guaranteed to return a suitable identifier for the AppId

So now we have a situation where it's possible that AppIdA.id() == AppIdB.id() but AppIdA != AppIdB.

I feel like the identifier is the identifier of an app, and the index is just an internal optimization that saves an otherwise expensive lookup.

What situation would arise where the identifiers match but the indexes don't?

@@ -769,6 +829,11 @@ impl<C: Chip> ProcessType for Process<'a, C> {
}

fn setup_mpu(&self) {
// Do not modify an inactive process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the proliferation of these:

  • Are all the functions with these checks actually getting called on stopped processes?
  • Should that be a bug rather than a silent ignore?
  • How do we make sure we don't miss these in future process management stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Are all the functions with these checks actually getting called on stopped processes?

Well the one that initiated this change is, notably if a capsule holds a copy of an appid for a stopped process and then uses it to enter its grant region. The others are not as most are called as a result of the process being scheduled, which it won't since it is stopped.

* Should that be a bug rather than a silent ignore?

In the case of entering a grant region it returns an error to the capsule.

For the others, maybe? It adds complexity in the common case. Maybe the better approach (at least for now) is to make the documentation clear that calling the functions without return codes is only valid when the process is active.

* How do we make sure we don't miss these in future process management stuff?

I'm not sure. Generally, functions with return values need to handle this, as those return values likely are passed (eventually) outside of the kernel crate. Functions without return values tend to be only called inside the kernel.

This is also complicated a little by code like the process console which does obtain references to ProcessType objects. But that is really a separate issue.

bradjc and others added 14 commits February 25, 2020 14:59
AppId.idx used to refer to the index of the app in the process array.
This enforces a very strong association between the two, which hurts our
ability to change AppId in the future to handle apps that restart, are
moved, or are removed. This removes that direct association in code
outside of sched.rs. Since at this point AppId.idx() is still the index
of the process in the array, we go ahead and keep using it. But code
that does not manage the processes array should not use it as a direct
index.
The IPC structure was based on application IDs starting at 0 and being
able to correspond to array indices. This adds a layer of indirection.
No more correlation between app id and index in the array.
This change allows AppId to both:
- Use an index in the array to ensure fast access to a process since it
is stored at a particular location in the array.
- And express that the AppId is no longer valid by returning `None`.

So far the index is always considered value (hence the return is always
`Some(self.idx)` but eventually we will be able to check if the index is
valid.
Allows the kernel struct more access to the AppId to make lookups more
optimized.

Also adds documentation.
If a process is in an inactive state (like it has faulted and the kernel
stopped it) then many of the operations in the `ProcessType` trait
should not be functional and instead should return errors. This makes
those functions return errors or explicitly say that they are not valid.

This is needed to help signal to capsules or other users of a process
that the process is not active and will not run again. For example, if a
capsule tries to call `.enter()` on a grant with an inactive process,
the `Kernel` object in sched.rs will find the matching process, but
trying to access one of its grant pointers will fail since it is not
valid to setup or access a grant on an inactive process. This error will
then get back to the capsule.
don't stop at processes which dont have the grant setup
Since index is an internal bookkeeping item we do not use it to compare
AppIds. Note: this works because processes cannot move in the processes
array. If we ever want to move processes we will need to update AppId to
match.
Note that several functions can only be called on a active process.
These functions are only called from the kernel scheduler, and will not
currently be called unless the kernel can schedule the process.
@bradjc bradjc force-pushed the appid-no-index-enter branch from 8d4cdf5 to 21c1f85 Compare February 25, 2020 20:00
@bradjc
Copy link
Contributor Author

bradjc commented Feb 25, 2020

I rebased, and updated the AppId equality check, and undid some of the changes to the ProcessType trait and instead added a little more documentation for some methods.

Copy link
Member

@ppannuto ppannuto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors d+

I think this looks good to me at this point. The inactive process stuff is probably still imperfect, but definitely best-of-our-current-world

@bors
Copy link
Contributor

bors bot commented Feb 25, 2020

✌️ bradjc can now approve this pull request. To approve and merge a pull request, simply reply with bors r+. More detailed instructions are available here.

@bradjc
Copy link
Contributor Author

bradjc commented Feb 25, 2020

Seems reasonable to add a last-call tag for this.

@bradjc bradjc added the last-call Final review period for a pull request. label Feb 25, 2020
Copy link
Contributor

@hudson-ayers hudson-ayers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still in favor of changing the body of process_each_capability to:

    for process in self.processes.iter() {
            match process {
                Some(p) => {
                    closure(p.appid().id(), *p);
                }
                None => {}
            }
        }

so that the PID printed by the process console would be correct, but either way LGTM.

@bradjc
Copy link
Contributor Author

bradjc commented Mar 2, 2020

bors r+

@bors
Copy link
Contributor

bors bot commented Mar 2, 2020

Build succeeded

@bors bors bot merged commit e50ca0a into master Mar 2, 2020
@bors bors bot deleted the appid-no-index-enter branch March 2, 2020 22:10
bors bot added a commit that referenced this pull request Mar 3, 2020
1588: kernel: give processes unique identifiers r=alevy a=bradjc

### Pull Request Overview

This builds on top of PR #1566 and makes app identifiers for processes unique by incrementing a counter every time a new process needs an identifier.

Since this requires #1566 there are many commits here, but the relevant one (and where you should look) is here: a2020b6

This is part of #1082.

### Testing Strategy

todo


### TODO or Help Wanted

This is a redo of a prior PR, and one question that came up is should the counter be 64 bit in case a process restarts continuously. Changing from a `usize` to `u64` was not as straightforward as I thought, so I did not include that change here. Should we do something about the counter wrapping around? Or should boards use one of the restart policies to avoid restarting an app that many times?


### Documentation Updated

- [x] Updated the relevant files in `/docs`, or no updates are required.

### Formatting

- [x] Ran `make formatall`.


1639: doc: wg: OpenTitan: Update charter with note about libtock-rs r=alevy a=bradjc

Since robust RISC-V support for libtock-rs is essential to the OpenTitan effort, the working group has agreed it should be added to the charter in the same way that rv32i arch support is.


### Testing Strategy

n/a


### TODO or Help Wanted

Any comments or feedback?


### Documentation Updated

- [x] Updated the relevant files in `/docs`, or no updates are required.

### Formatting

- [x] Ran `make formatall`.


Co-authored-by: Brad Campbell <bradjc5@gmail.com>
bors bot added a commit that referenced this pull request Mar 3, 2020
1609: ADC Capsule: More work on supporting restarts. r=alevy a=bradjc

### Pull Request Overview

This pull request continues to update the ADC capsule to allow applications to restart while using the ADC driver and for the system to continue working afterwards.

Highlights of changes:

- Handle `InactiveApp` error when checking if a grant region is valid.
- Check on `sample_ready()` and `samples_ready()` if the app is still valid, and properly handle it if not.
- Check on every command syscall if the app that has exclusive control of the adc driver still exists.
- Call stop sampling if a callback from the adc peripheral driver occurs and the app is no longer valid.

Note, if/when the ADC capsule is virtualized then this will no longer be an issue.


### Testing Strategy

I ran the adc and adc_continuous apps numerous times and used the process console to cause them to fault in various stages of their operation. I can now no longer get the ADC capsule into a state where apps only error.


### TODO or Help Wanted

I tested this with a combination of my other PRs #1565 #1566 #1588. Until those are merged this PR/commit alone will not work.


### Documentation Updated

- [x] Updated the relevant files in `/docs`, or no updates are required.

### Formatting

- [x] Ran `make formatall`.


Co-authored-by: Brad Campbell <bradjc5@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel last-call Final review period for a pull request.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants