kernel: Add an "identifier" to AppId #1566

bradjc · 2020-01-27T22:42:47Z

Pull Request Overview

Currently a capsule can store an AppId to refer to a process, and there is not really a mechanism for ensuring that that AppId is still valid. Currently in Tock we are OK with this since we only create processes once, they don't restart, and we don't move them around in the processes array. However, as process management becomes more sophisticated, the idea that an AppId is always valid will no longer be true. Since AppId only stores the index of the relevant process in that array, if a process restarts, or if a process is removed and then a new one is added in the same slot, then the stored AppId will appear valid, even though it actually refers to a different process.

To enable verifying that AppIds are still valid, this PR adds a new field (identifier) to the AppId object. This allows AppIds to be used as before (i.e. they are still copyable), but now operations using AppId can verify that the identifier in the AppId matches the identifier assigned to the process at a particular index.

AppId retains the index field which does correspond to the index in the processes array (at least while the AppId is valid). This allows us to retain direct lookups, as well as simplifies the IPC implementation.

Changes to AppId API

The existing .idx() function has been removed, and .index() and .id() have been added. .index() checks that the AppId is still valid before returning the index in the array of the corresponding app. .id() just returns the identifier, which can be passed to userspace or over a UART channel, for example.

These changes resulted in a couple changes to capsules which need usize identifiers for apps, and now must call .id().

Changes to kernel APIs

The major change in the kernel (beyond AppId) is that functions which used to take an index now just take an AppId. Mostly this is a minor change, except for kernel.process_map_or(). This function runs a closure on a specific app if it is exists, and uses the index and identifier to verify the app is still valid.

Otherwise, IPC had to be updated quite a bit to handle the changes. I also took the opportunity to update how grant iters are implemented since this PR reduces the reliance on an app's identifier being its index in the process array.

How does this fix the capsule problem?

A capsule can still store an AppId, and that AppId can still become invalid. However, the main use of an AppId between a capsule and the kernel is to call grant.enter(appid). Without this PR, if appid refers to an old process, and a new process has started in the same array index, grant.enter(appid) would succeed (after initializing a new grant region for the new process). Now, when grant.enter() calls .kernel.process_map_or(appid), an error will be returned which will bubble back to the capsule.

What does this PR not do?

This PR just improves the mechanism for detecting app changes, but does not change the policy on how app identifiers are assigned. Identifiers remain equivalent to their index in the array.

I iterated on the design in this PR several times, and the changes became substantial without also changing how app identifiers are assigned.

This is part of #1082.

Testing Strategy

This pull request was tested by running the IPC apps. More testing is needed.

TODO or Help Wanted

n/a

Documentation Updated

Updated the relevant files in /docs, or no updates are required.

Formatting

Ran make formatall.

hudson-ayers · 2020-01-31T01:49:40Z

kernel/src/sched.rs

-    /// will be returned. Otherwise the closure will executed and passed a
-    /// reference to the process.
-    crate fn process_map_or<F, R>(&self, default: R, process_index: usize, closure: F) -> R
+    /// not exist (i.e. it is `None` in the `processes` array, or has stopped,


Is it true that if a process has been stopped this will return default? From looking through it seems that a stopped process still has a valid and unchanged AppId, and this function does not seem to call anything that checks if a process has been stopped.

Ah ok good catch! This raises a case I hadn't really considered...what should happen if an app is in the StoppedFaulted state (or perhaps an Ended state if one were to exist)? The process is still there, and its the same process so its identifier probably shouldn't change, and certain capsules may want to be able to see that it exists (like process console). However, other capsules should be able to know it is no longer active (and will never be again). Perhaps process_map_or() should not change, but users of process_map_or() may need to check that the process is still active. In particular grant.enter() shouldn't succeed on an ended process.

In any case I will update the comment.

I updated the comment.

bradjc · 2020-01-31T21:12:00Z

Added a commit to address the issue of what happens with a "stopped" process, in particular a process that cannot be executed again. I call this an "inactive" process, and this can happen if the process crashes and the kernel does not restart it, or (hypothetically) if the process exits itself and the kernel does not free its Process struct.

The issue in this case is to ensure that capsules, or anything that holds an AppId pointing to that app, can learn that the app is inactive and will not run again and should not hold any state for it. Since the process is still valid in the processes array, calling kernel.process_map_or() will find a valid match, so that will not signal the caller that the process is inactive. I thought about requiring callers to check if the process is active, but that seems easy to forget to todo.

Instead, I went through all of the functions in the ProcessType trait and made it clear that any functions that attempt to modify a process in any way will fail if the process is inactive. That allows something to still query the process (say to access its debug information or for process console to show that the app exists and is inactive), but any attempt by an AppId holder to modify the process will cause an error to propagate.

The canonical example of this is when accessing a grant. Capsules access a specific grant by calling grant.enter(appid). If the appid is completely invalid then when the grant module tries to find the process an error will be returned (as described above). However, if the appid is valid but the process is inactive, then the grant module will find the process but when it tries to manipulate the grant region an error will occur and that error will propagate back to the capsule.

hudson-ayers

I think this looks good, and is a great step towards better handling restarted apps.

That said, a few thoughts I have:

Whatever solution we choose for assigning the "identifier" field, we will probably need to be careful to avoid reuse of identifiers.
It seems like this could potentially be implemented without storing the index field at all. This would require restructuring IPC to store tuples of (callbacks/identifiers) and (AppSlices/identifiers), and replacing the few places where we directly index into the process array with searches for the identifier. Given that no board currently supports more than 20 processes, this is not a huge overhead, though I realize this could be a limiting design decision if that were to change. I think the main overhead is that you would have to iterate through the list on every call to .enter(). The upside is that the resulting design is less confusing and entirely decouples processes from how the kernel stores them.

ppannuto · 2020-01-31T18:22:14Z

kernel/src/callback.rs

 }

 impl PartialEq for AppId {
    fn eq(&self, other: &AppId) -> bool {
-        self.idx == other.idx
+        self.index == other.index && self.identifier == other.identifier


Should this just compare identifier?

Semantically, I think an app reference is a reference to the same app whether or not that app is alive or dead.

I think in the current implementation it is probably more correct to compare both fields. The reason is that using an appid with a valid identifier but the wrong index will not work, but using one with both fields correct will. It would be a little strange to have AppIdA == AppIdB, but then using AppIdA fails while AppIdB works.

At one point I did implement the logic for an AppId to automatically update its index, but since processes cannot currently move that seemed like unnecessary complexity right now.

Hm, I'm not sure I'm convinced. The opposite argument here is that the .id() method is provided to

retrieve a simple identifier for the process that can be communicated over a UART bus or syscall interface. This call is guaranteed to return a suitable identifier for the AppId

So now we have a situation where it's possible that AppIdA.id() == AppIdB.id() but AppIdA != AppIdB.

I feel like the identifier is the identifier of an app, and the index is just an internal optimization that saves an otherwise expensive lookup.

What situation would arise where the identifiers match but the indexes don't?

What situation would arise where the identifiers match but the indexes don't?

The kernel moves the application to a different index in the processes array. But, there isn't any functionality to do this, so it's a case that doesn't matter.

The fact that processes cannot move slots in the array (since there is currently no code that will do that) can support either implementation of eq. However, since just using identifier would be required if there ever was a reason to move processes around in the array, that implementation makes more sense. I will update.

ppannuto · 2020-02-04T16:57:14Z

kernel/src/callback.rs

+    /// originally referred to, even if that app no longer exists. For example,
+    /// the app may have restarted, or may have been ended or removed by the
+    /// kernel. Therefore, calling `id()` is _not_ a valid way to check
+    /// that an application still exists.


Thinking about global uniqueness of identifiers -- since we're proposing this as a mechanism that labels a process possibly off-core (here tockloader is given as example, but one would also imagine using as an RPC identifier), I think part of the future design discussion for the actual identifiers should include whether we have an entropy source of some kind in the kernel to try to ensure that identifiers are unique even across reboots

[...] whether we have an entropy source of some kind in the kernel to try to ensure that identifiers are unique even across reboots

Note that this would require more bits than usize from a cryptographic perspective, if you really want to avoid collision in all realistic scenarios (even considering that embedded CPUs are "slow" enough that one can only generate so many identifiers per second). A collision within a set of random 32-bit identifiers requires only around 2^16 identifiers.

I don't have much context on the use cases, so I guess it depends whether uniqueness is a nice to have property for e.g. debugging, or if a collision is to be avoided at all costs.

bradjc · 2020-02-05T16:47:11Z

1. Whatever solution we choose for assigning the "identifier" field, we will probably need to be careful to avoid reuse of identifiers.

Yes. My current simple implementation is just to keep counting up, and hope we don't hit 2^32. But we can discuss this more on the next PR.

2. It seems like this could potentially be implemented without storing the `index` field at all. This would require restructuring IPC to store tuples of (callbacks/identifiers) and (AppSlices/identifiers), and replacing the few places where we directly index into the process array with searches for the identifier. Given that no board currently supports more than 20 processes, this is not a huge overhead, though I realize this could be a limiting design decision if that were to change. I think the main overhead is that you would have to iterate through the list on every call to `.enter()`. The upside is that the resulting design is less confusing and entirely decouples processes from how the kernel stores them.

Yes, definitely, and my original implementation did just that in two different forms. It definitely complicates IPC, and I tried to store tuples of <identifier, callback/slice>. While conceptually it is pretty straightforward, the rust code gets much more complicated because there is a lot more checking required, with more map()s, etc. It eventually became too much of a burden to actually implement in Rust. My second version stored an array of identifiers, and used the index in the array of the identifier as the lookup to the arrays for callbacks and app slices. This was much easier to actually implement, and worked, but was a little clumsy since there are these extra arrays and a scan of those arrays. Plus, process_map_or() then had to be a scan, and that could potentially lead to nested scans if implementors aren't careful. Complicating IPC and forcing an array scan on every process lookup seemed like a rather large cost just to eliminate the index that we already have in AppId, hence I settled on the design you see in this PR.

hudson-ayers · 2020-02-07T21:16:24Z

I came across a pretty obscure piece of code affected by this change while looking into modifying the ProcessConsole. Currently, the ProcessConsole lists process in the form:

 PID    Name                Quanta  Syscalls  Dropped Callbacks  Restarts    State
  00    hello3                 298        60                  0         0  Running
  01    hello2                 280        54                  0         0  Running
  02    hello_loop             262        99                  0         0  Running

where the PID is not actually the index of the process in the process array, but the index of the iterator through the processes array (as output by .iter().enumerate() in process_each_capability() in grant.rs). This means that if there are empty slots in the process array prior to an process, the PID assigned by the process console will match neither the identifier of the process nor its index in the process array.

This is unchanged by this PR, but probably it would make more sense to have the ProcessConsole print the actual identifier of the process now that we have them.

bradjc · 2020-02-18T17:47:24Z

I've tested this with various apps on Hail, including the tutorials/05 IPC apps, the Hail test app, and other collections of apps. I also tried these five:

 PID    Name                Quanta  Syscalls  Dropped Callbacks  Restarts    State
  00	hello_loop               0       359                  0         0  Yielded
  01	button_print             0         7                  0         0  Yielded
  02	adc_continuous           0       675                  0         0  Yielded
  03	rot13_client             0       829                  0         0  Yielded
  04	org.tockos.ex.rot13      0       131                  0         0  Yielded

and all 5 seem to work.

ppannuto · 2020-02-25T02:20:55Z

kernel/src/callback.rs

 }

 impl PartialEq for AppId {
    fn eq(&self, other: &AppId) -> bool {
-        self.idx == other.idx
+        self.index == other.index && self.identifier == other.identifier


Hm, I'm not sure I'm convinced. The opposite argument here is that the .id() method is provided to

retrieve a simple identifier for the process that can be communicated over a UART bus or syscall interface. This call is guaranteed to return a suitable identifier for the AppId

So now we have a situation where it's possible that AppIdA.id() == AppIdB.id() but AppIdA != AppIdB.

I feel like the identifier is the identifier of an app, and the index is just an internal optimization that saves an otherwise expensive lookup.

What situation would arise where the identifiers match but the indexes don't?

ppannuto · 2020-02-25T02:33:55Z

kernel/src/process.rs

@@ -769,6 +829,11 @@ impl<C: Chip> ProcessType for Process<'a, C> {
    }

    fn setup_mpu(&self) {
+        // Do not modify an inactive process.


I'm curious about the proliferation of these:

Are all the functions with these checks actually getting called on stopped processes?

Should that be a bug rather than a silent ignore?

How do we make sure we don't miss these in future process management stuff?

* Are all the functions with these checks actually getting called on stopped processes?

Well the one that initiated this change is, notably if a capsule holds a copy of an appid for a stopped process and then uses it to enter its grant region. The others are not as most are called as a result of the process being scheduled, which it won't since it is stopped.

* Should that be a bug rather than a silent ignore?

In the case of entering a grant region it returns an error to the capsule.

For the others, maybe? It adds complexity in the common case. Maybe the better approach (at least for now) is to make the documentation clear that calling the functions without return codes is only valid when the process is active.

* How do we make sure we don't miss these in future process management stuff?

I'm not sure. Generally, functions with return values need to handle this, as those return values likely are passed (eventually) outside of the kernel crate. Functions without return values tend to be only called inside the kernel.

This is also complicated a little by code like the process console which does obtain references to ProcessType objects. But that is really a separate issue.

AppId.idx used to refer to the index of the app in the process array. This enforces a very strong association between the two, which hurts our ability to change AppId in the future to handle apps that restart, are moved, or are removed. This removes that direct association in code outside of sched.rs. Since at this point AppId.idx() is still the index of the process in the array, we go ahead and keep using it. But code that does not manage the processes array should not use it as a direct index.

The IPC structure was based on application IDs starting at 0 and being able to correspond to array indices. This adds a layer of indirection.

No more correlation between app id and index in the array.

This change allows AppId to both: - Use an index in the array to ensure fast access to a process since it is stored at a particular location in the array. - And express that the AppId is no longer valid by returning `None`. So far the index is always considered value (hence the return is always `Some(self.idx)` but eventually we will be able to check if the index is valid.

Allows the kernel struct more access to the AppId to make lookups more optimized. Also adds documentation.

If a process is in an inactive state (like it has faulted and the kernel stopped it) then many of the operations in the `ProcessType` trait should not be functional and instead should return errors. This makes those functions return errors or explicitly say that they are not valid. This is needed to help signal to capsules or other users of a process that the process is not active and will not run again. For example, if a capsule tries to call `.enter()` on a grant with an inactive process, the `Kernel` object in sched.rs will find the matching process, but trying to access one of its grant pointers will fail since it is not valid to setup or access a grant on an inactive process. This error will then get back to the capsule.

don't stop at processes which dont have the grant setup

Since index is an internal bookkeeping item we do not use it to compare AppIds. Note: this works because processes cannot move in the processes array. If we ever want to move processes we will need to update AppId to match.

Note that several functions can only be called on a active process. These functions are only called from the kernel scheduler, and will not currently be called unless the kernel can schedule the process.

bradjc · 2020-02-25T20:02:25Z

I rebased, and updated the AppId equality check, and undid some of the changes to the ProcessType trait and instead added a little more documentation for some methods.

ppannuto

bors d+

I think this looks good to me at this point. The inactive process stuff is probably still imperfect, but definitely best-of-our-current-world

bors · 2020-02-25T20:34:21Z

✌️ bradjc can now approve this pull request. To approve and merge a pull request, simply reply with bors r+. More detailed instructions are available here.

bradjc · 2020-02-25T20:57:20Z

Seems reasonable to add a last-call tag for this.

hudson-ayers

I am still in favor of changing the body of process_each_capability to:

    for process in self.processes.iter() {
            match process {
                Some(p) => {
                    closure(p.appid().id(), *p);
                }
                None => {}
            }
        }

so that the PID printed by the process console would be correct, but either way LGTM.

bradjc · 2020-03-02T21:59:15Z

bors r+

bors · 2020-03-02T22:10:07Z

Build succeeded

continuous-integration/travis-ci/push

1588: kernel: give processes unique identifiers r=alevy a=bradjc ### Pull Request Overview This builds on top of PR #1566 and makes app identifiers for processes unique by incrementing a counter every time a new process needs an identifier. Since this requires #1566 there are many commits here, but the relevant one (and where you should look) is here: a2020b6 This is part of #1082. ### Testing Strategy todo ### TODO or Help Wanted This is a redo of a prior PR, and one question that came up is should the counter be 64 bit in case a process restarts continuously. Changing from a `usize` to `u64` was not as straightforward as I thought, so I did not include that change here. Should we do something about the counter wrapping around? Or should boards use one of the restart policies to avoid restarting an app that many times? ### Documentation Updated - [x] Updated the relevant files in `/docs`, or no updates are required. ### Formatting - [x] Ran `make formatall`. 1639: doc: wg: OpenTitan: Update charter with note about libtock-rs r=alevy a=bradjc Since robust RISC-V support for libtock-rs is essential to the OpenTitan effort, the working group has agreed it should be added to the charter in the same way that rv32i arch support is. ### Testing Strategy n/a ### TODO or Help Wanted Any comments or feedback? ### Documentation Updated - [x] Updated the relevant files in `/docs`, or no updates are required. ### Formatting - [x] Ran `make formatall`. Co-authored-by: Brad Campbell <bradjc5@gmail.com>

1609: ADC Capsule: More work on supporting restarts. r=alevy a=bradjc ### Pull Request Overview This pull request continues to update the ADC capsule to allow applications to restart while using the ADC driver and for the system to continue working afterwards. Highlights of changes: - Handle `InactiveApp` error when checking if a grant region is valid. - Check on `sample_ready()` and `samples_ready()` if the app is still valid, and properly handle it if not. - Check on every command syscall if the app that has exclusive control of the adc driver still exists. - Call stop sampling if a callback from the adc peripheral driver occurs and the app is no longer valid. Note, if/when the ADC capsule is virtualized then this will no longer be an issue. ### Testing Strategy I ran the adc and adc_continuous apps numerous times and used the process console to cause them to fault in various stages of their operation. I can now no longer get the ADC capsule into a state where apps only error. ### TODO or Help Wanted I tested this with a combination of my other PRs #1565 #1566 #1588. Until those are merged this PR/commit alone will not work. ### Documentation Updated - [x] Updated the relevant files in `/docs`, or no updates are required. ### Formatting - [x] Ran `make formatall`. Co-authored-by: Brad Campbell <bradjc5@gmail.com>

bradjc added the kernel label Jan 27, 2020

bradjc mentioned this pull request Jan 27, 2020

Tracking: Allow Apps to Fault and Restart Successfully #1082

Closed

15 tasks

ppannuto force-pushed the appid-no-index-enter branch from d0c944e to 756d1b9 Compare January 28, 2020 00:17

hudson-ayers reviewed Jan 31, 2020

View reviewed changes

bradjc force-pushed the appid-no-index-enter branch from 70d4a4c to b046c51 Compare January 31, 2020 21:14

hudson-ayers previously approved these changes Feb 4, 2020

View reviewed changes

ppannuto reviewed Feb 4, 2020

View reviewed changes

bradjc mentioned this pull request Feb 6, 2020

kernel: give processes unique identifiers #1588

Merged

2 tasks

bradjc dismissed hudson-ayers’s stale review via 0dfa83a February 18, 2020 00:10

bradjc force-pushed the appid-no-index-enter branch from 0dfa83a to 8d4cdf5 Compare February 18, 2020 17:39

bradjc mentioned this pull request Feb 18, 2020

ADC Capsule: More work on supporting restarts. #1609

Merged

2 tasks

ppannuto requested changes Feb 25, 2020

View reviewed changes

bradjc and others added 14 commits February 25, 2020 14:59

kernel: IPC work without app ids as indices

f4bc4f5

The IPC structure was based on application IDs starting at 0 and being able to correspond to array indices. This adds a layer of indirection.

kernel: update grants to iterate correctly

6d89534

No more correlation between app id and index in the array.

kernel: ipc: update to use index shortcut

85d33b0

kernel: appid: separate index from identifier

344ac81

kernel: appid: verify identifier matches at index

047c997

kernel: appid: improve api

dc95b40

Allows the kernel struct more access to the AppId to make lookups more optimized. Also adds documentation.

kernel: sched: change to appid_is_valid()

3db549a

don't create needless copy of self

ecd9fb7

kernel: grant: fix iter

b14f57a

don't stop at processes which dont have the grant setup

kernel: appid: equality based only on identifier

ff0928d

Since index is an internal bookkeeping item we do not use it to compare AppIds. Note: this works because processes cannot move in the processes array. If we ever want to move processes we will need to update AppId to match.

kernel: process: clarify ProcessType API

21c1f85

Note that several functions can only be called on a active process. These functions are only called from the kernel scheduler, and will not currently be called unless the kernel can schedule the process.

bradjc force-pushed the appid-no-index-enter branch from 8d4cdf5 to 21c1f85 Compare February 25, 2020 20:00

ppannuto approved these changes Feb 25, 2020

View reviewed changes

bradjc added the last-call Final review period for a pull request. label Feb 25, 2020

hudson-ayers approved these changes Feb 28, 2020

View reviewed changes

bors bot merged commit e50ca0a into master Mar 2, 2020

bors bot deleted the appid-no-index-enter branch March 2, 2020 22:10

bradjc mentioned this pull request Apr 30, 2020

Release 2020-04-30 13:11:39 -0400 #1810

Closed

ppannuto mentioned this pull request May 5, 2020

kernel: configure MPU only if switching to a different process #1822

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel: Add an "identifier" to AppId #1566

kernel: Add an "identifier" to AppId #1566

bradjc commented Jan 27, 2020 •

edited

Loading

hudson-ayers Jan 31, 2020

bradjc Jan 31, 2020

bradjc Jan 31, 2020

bradjc commented Jan 31, 2020

hudson-ayers left a comment

ppannuto Jan 31, 2020

bradjc Feb 5, 2020

ppannuto Feb 25, 2020

bradjc Feb 25, 2020

ppannuto Feb 4, 2020

gendx Feb 21, 2020

bradjc commented Feb 5, 2020

hudson-ayers commented Feb 7, 2020

bradjc commented Feb 18, 2020

ppannuto Feb 25, 2020

ppannuto Feb 25, 2020

bradjc Feb 25, 2020

bradjc commented Feb 25, 2020

ppannuto left a comment

bors bot commented Feb 25, 2020

bradjc commented Feb 25, 2020

hudson-ayers left a comment

bradjc commented Mar 2, 2020

bors bot commented Mar 2, 2020

kernel: Add an "identifier" to AppId #1566

kernel: Add an "identifier" to AppId #1566

Conversation

bradjc commented Jan 27, 2020 • edited Loading

Pull Request Overview

Changes to AppId API

Changes to kernel APIs

How does this fix the capsule problem?

What does this PR not do?

Testing Strategy

TODO or Help Wanted

Documentation Updated

Formatting

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bradjc commented Jan 31, 2020

hudson-ayers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bradjc commented Feb 5, 2020

hudson-ayers commented Feb 7, 2020

bradjc commented Feb 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bradjc commented Feb 25, 2020

ppannuto left a comment

Choose a reason for hiding this comment

bors bot commented Feb 25, 2020

bradjc commented Feb 25, 2020

hudson-ayers left a comment

Choose a reason for hiding this comment

bradjc commented Mar 2, 2020

bors bot commented Mar 2, 2020

Build succeeded

bradjc commented Jan 27, 2020 •

edited

Loading