From d21965866ece73f9552d9d06c012e4f526ab8c15 Mon Sep 17 00:00:00 2001 From: Andreas Hartmann Date: Wed, 21 Sep 2022 10:48:56 +0200 Subject: [PATCH] docs: Add ERROR_HANDLING that explains how we plan to change error handling in zellij and invites new contributors to join the fun. Details the currently existent error handling capabilities and gives a bunch of examples taken from #1670. --- docs/ERROR_HANDLING.md | 313 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 313 insertions(+) create mode 100644 docs/ERROR_HANDLING.md diff --git a/docs/ERROR_HANDLING.md b/docs/ERROR_HANDLING.md new file mode 100644 index 0000000000..5479ebf4a1 --- /dev/null +++ b/docs/ERROR_HANDLING.md @@ -0,0 +1,313 @@ +# Help wanted + +As the zellij code-base changed, a lot of places where a call to `unwrap()` +previously made sense can now potentially cause errors which we'd like to +handle. While we don't consider `unwrap` to be a bad thing in general, it hides +the underlying error and leaves the user only with a stack trace to go on. +Worse than this, it will crash the application without giving us a chance to +potentially recover. This is particularly bad when the user is using +long-running sessions to perform tasks. + +Hence, we would like to eliminate `unwrap()` statements from the code where +possible, and apply better error handling instead. This way, functions higher +up in the call stack can react to errors from underlying functions and either +try to recover, or give some meaningful error messages if recovery isn't +possible. + +Since the zellij codebase is pretty big and growing rapidly, this endeavor +will continue to be pursued over time, as zellij develops. The idea is that +modules or single files are converted bit by bit, preferably in small PRs that +each target a specific module or file. **If you are looking to contribute to +zellij, this may be an ideal start for you!** This way you get to know the +codebase and get an idea which modules are used at which other places in the +code. + +If you have an interest in this, don't hesitate to get in touch with us and +refer to the [tracking issue][tracking_issue] to see what has already been +done. + + +# Error handling facilities + +You get access to all the relevant functions and traits mentioned in the +remainder of this document by including/adding this in the code you're working +on: + +```rust +use zellij_utils::errors::prelude::*; +``` + +## Displaying panic messages + +Panics are generally handled via the `Panic` error type and the +[`handle_panic`][handle_panic] panic handler function. The fancy formatting +is performed by the [`miette`][miette] crate. + + +## Propagating errors + +We use the [`anyhow`][anyhow] crate to propagate errors up the call stack. At +the moment, zellij doesn't have custom error types, so we wrap whatever errors +the underlying libraries give us, if any. [`anyhow`][anyhow] serves the purpose +of providing [`context`][context] about where (i.e. under which circumstances) +an error happened. + +A critical requirement for propagating errors is that all functions involved +must return the [`Result`][Result] type. This allows convenient error handling +with the `?` operator. + +At some point you will likely stop propagating errors and decide what to do +with the error. Generally you can: + +1. Try to recover from the error, or +2. Report the error to the user and either + 1. Terminate program execution (See [`fatal`][fatal]), or + 2. Continue program execution (See [`non_fatal`][non_fatal]) + + +## Handling errors + +Ideally, when the program encounters an error it will try to recover as good as +it can. This can mean falling back to some sane default if a specific value +(e.g. an environment variable) cannot be found. Note that this isn't always +applicable. If in doubt, don't hesitate to ask. + +Recovery usually isn't an option if an operation has changed the internal state +(i.e. the value or content of specific variables) of objects in the code. In +this case, if an error is encountered, it is best to declare the program state +corrupted and terminate the whole application. This can be done by `unwrap`ing +on the [`Result`][Result] type. Always try to propagate the error as good as +you can and attach meaningful context before `unwrap`ing. This gives the user +an idea what went wrong and can also help developers in quickly identifying +which parts of the code to debug if necessary. + +When you encounter such a fatal error and cannot propagate it further up (e.g. +because the current function cannot be changed to return a [`Result`][Result], +or because it is the "root" function of a program thread), use the +[`fatal`][fatal] function to panic the application. It will attach some small +context to the error and finally `unwrap` it. Using this function over the +regular `unwrap` has the added benefit that other developers seeing this in the +code know that someone has previously spent some thought about error handling +at this location. + +If you encounter a non-fatal error, use the [`non_fatal`][non_fatal] function +to handle it. Instead of `panic`ing the application, the error is written to +the application log and execution continues. Please use this sparingly, as an +error usually calls for actions to be taken rather than ignoring it. + + +# Examples of applied error handling + +You can have a look at the commit that introduced error handling to the +`zellij_server::screen` module [right here][1] (look at the changes in +`zellij-server/src/screen.rs`). We'll use this to demonstrate a few things in +the following text. + + +## Converting a function to return a `Result` type + +Here's an example of the `Screen::render` function as it looked before: + +```rust + pub fn render(&mut self) { + // ... + let serialized_output = output.serialize(); + self.bus + .senders + .send_to_server(ServerInstruction::Render(Some(serialized_output))) + .unwrap(); + } +``` + +It performs a few actions (not shown here for brevity) and then sends an IPC +message to the server. As you can see it calls `unwrap()` on the result from +sending a message to the server. This means: If sending the message to the +server fails, execution is terminated and the program crashes. Let's assume +that crashing the application in this case is a reasonable course of action. + +In total (as of writing this), the `render()` function is called 80 times from +various places in the code of the `Screen` struct. Hence, if sending the +message fails, we only see that the application crashed trying to send an IPC +message to the server. We won't know which of the 80 different code paths lead +to the execution of this function. + +So what can we do? Instead of `unwrap`ing on the `Result` type here, we can +pass it up to the calling functions. Each of the callers can then decide for +themselves what to do: Continue regardless, propagate the error further up or +terminate execution. + +Here's what the function looked like after the change linked above: + +```rust + pub fn render(&mut self) -> Result<()> { + let err_context = || "failed to render screen".to_string(); + // ... + + let serialized_output = output.serialize(); + self.bus + .senders + .send_to_server(ServerInstruction::Render(Some(serialized_output))) + .with_context(err_context) + } +``` + +We leverage the [`Context`][context] trait from [`anyhow`][anyhow] to attach a +context message to the error and make the function return a `Result` type +instead. As you can see, the `Result` here contains a `()`, which is the empty +type. It's primary purpose here is allowing us to propagate errors to callers +of this function. + +Hence, for example the `resize_to_screen` function changed from this: + +```rust + pub fn resize_to_screen(&mut self, new_screen_size: Size) { + // ... + self.render(); + } +``` + +to this: + +```rust + pub fn resize_to_screen(&mut self, new_screen_size: Size) -> Result<()> { + // ... + self.render() + .context("failed to resize to screen size: {new_screen_size:#?}") + } +``` + +Note how it returns a `Result` type now, too. This way we can pass the error up +to callers of `resize_to_screen` and keep going like this until we decide it's +time to do something about the error. + +In general, any function calling `unwrap` or `expect` is a good candidate to be +rewritten to return a `Result` type instead. + + +## Attaching context + +[Anyhow][anyhow]s [`Context`][context] trait gives us two methods to attach +context to an error: `context` and `with_context`. In functions where `Result`s +can be returned in multiple places, it is preferable to use `with_context`, as +it saves you from duplicating the same context message in multiple places in +the code. This was shown in the `render` method above, but the call to the +other function returning a `Result` wasn't shown: + +```rust + pub fn render(&mut self) -> Result<()> { + let err_context = || "failed to render screen".to_string(); + // ... + + for tab_index in tabs_to_close { + // ... + self.close_tab_at_index(tab_index) + .with_context(err_context)?; + } + // ... + self.bus + .senders + .send_to_server(ServerInstruction::Render(Some(serialized_output))) + .with_context(err_context) + } +``` + +When there is only a single `Result` to be returned from your function, use +`context` as shown above for `resize_to_screen`. + + +## Choosing helpful context messages + +When attaching context to an error, usually you want to express what you were +doing when the error occurred, i.e. in what context the error occurred. In the +`render` method, we could have done something like this instead: + +```rust + pub fn render(&mut self) -> Result<()> { + // ... + + for tab_index in tabs_to_close { + // ... + self.close_tab_at_index(tab_index) + .context("Failed to close tab at index: {tab_index}")?; + } + // ... + self.bus + .senders + .send_to_server(ServerInstruction::Render(Some(serialized_output))) + .context("Failed to send message to server") + } +``` + +Why do we add the message "failed to render screen" instead? Because that is +what we were trying to do when we received the error from the underlying +functions (`close_tab_at_index` and `send_to_server` in this case). Functions +from libraries usually already return an error that describes what went wrong +(Example: When we try to open a file that doesn't exist, the std library will +give us a [`NotFound`][2] error), so we don't have to repeat that. + +In case of doubt, look at the name of the function you're currently working in +and write a context message somehow mentioning this. + + +## Terminating execution + +We want to propagate errors as far up as we can. This way, every function along +the way can at least attach a context message giving us an idea what chain of +events lead to the error. Where do we terminate execution in `Screen`? If you +study the code in `screen.rs`, you'll notice all the components of zellij +interact with the `Screen` instance by means of IPC messages. These messages +are handled in the `screen_thread_main` function. Here's an excerpt: + +```rust + ScreenInstruction::Render => { + screen.render()?; + }, + ScreenInstruction::NewPane(pid, client_or_tab_index) => { + // ... + screen.update_tabs()?; + + screen.render()?; + }, + ScreenInstruction::OpenInPlaceEditor(pid, client_id) => { + // ... + screen.update_tabs()?; + + screen.render()?; + }, +``` + +The code goes on like this for quite a while, so there are many places where an +error may occur. In this case, since all the functions are called from this +central location, we forego attaching a context message to every error. +Instead, we propagate the errors to the caller of this function, which happens +to be the function `init_session` in `zellij-server/src/lib.rs`. We see that +`screen_thread_main` is spawned to run in a separate thread. Hence, we cannot +propagate the error further up and terminate execution at this point: + +```rust + // ... + screen_thread_main( + screen_bus, + max_panes, + client_attributes_clone, + config_options, + ) + .fatal(); +``` + +Remember the call to [`fatal`][fatal] will log the error and afterwards panic +the application (i.e. crash zellij). Since we made sure to attach context +messages to the errors on their way up, we will see these messages in the +resulting output! + + +[tracking_issue]: https://github.com/zellij-org/zellij/issues/1753 +[handle_panic]: https://docs.rs/zellij-utils/latest/zellij_utils/errors/fn.handle_panic.html +[miette]: https://crates.io/crates/miette +[anyhow]: https://crates.io/crates/anyhow +[context]: https://docs.rs/anyhow/latest/anyhow/trait.Context.html +[Result]: https://doc.rust-lang.org/stable/std/result/enum.Result.html +[fatal]: https://docs.rs/zellij-utils/latest/zellij_utils/errors/trait.FatalError.html#tymethod.fatal +[non_fatal]: https://docs.rs/zellij-utils/latest/zellij_utils/errors/trait.FatalError.html#tymethod.non_fatal +[1]: https://github.com/zellij-org/zellij/commit/99e2bef8c68bd166cf89e90c8ffe8c02272ab4d3 +[2]: https://doc.rust-lang.org/stable/std/io/enum.ErrorKind.html#variant.NotFound