Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data_safety.md #2768

Merged
merged 16 commits into from
Mar 30, 2023
Merged
1 change: 1 addition & 0 deletions docs/README.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

183 changes: 183 additions & 0 deletions docs/design/data_safety.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
Data Safety in ICU4X
====================

At its core, ICU4X is a set of algorithms that map input data to human-ready outputs. Since a core value proposition of ICU4X is the ability to dynamically load data, we can't always assume that our data are valid and pristine. This doc covers how we think about this problem and handle it in ICU4X.

## Problem and Policy Statement

Given the following goals…

1. The ICU4X core library should not panic internally.
1. Both baked and dynamically-loaded data are core features, and we seek to minimize the tradeoffs between them.

…and the the following evidence…

1. It is rare to be 100% confident about the safety of your data.[^1]
1. Data loading and validation is known to be a performance bottleneck in prior-art libraries such as ICU4C.
1. Validating data invariants can be expensive at deserialization time.[^2]
1. Validating data invariants at runtime is _sometimes free_ and _usually cheaper_ than at deserialization time.
1. It is useful to _be able to_ learn whether code traverses unexpected code paths.
1. End users of ICU4X algorithms are not in a position to reason about invalid data in otherwise-infallible terminal functions.
1. All types of ICU4X data, including CLDR and Unicode property data, may be loaded and deserialized dynamically.
1. Algorithms that panic on malformed data increase the vulnerability space of an application.[^3]
1. "Garbage in, garbage out" (GIGO) does not increase the space that malicious actors could leverage.[^3]

…the ICU4X project has adopted the following policy:

1. Code should never panic at runtime based on invalid data.
1. Data structs should minimize the number of internal invariants, _especially_ ones that are expensive to validate.[^2]
1. Code paths only reachable by invalid data should use GIGO with debug assertions.

[^1]: *As a thought experiment, if you were 100% confident, you could use `get_unchecked` and other unsafe operations. If you are not confident enough to use unsafe code, then you are not 100% confident.*

[^2]: *"Expensive" refers to both code size and performance.*

[^3]: *If an attacker is able to mutate ICU4X data sources, they could construct data to produce a result they desire, regardless of whether unexpected operations end in GIGO. However, panicky algorithms could enable attackers to perform denial-of-service attacks.*

Additional points:

1. As a rule of thumb, validation that requires a single linear pass over the data with no memory allocations is fine, similar to UTF-8 validation, but validation that requires memory allocations, runs in superlinear time, or pulls in large amounts of code or dependencies is discouraged.
2. If performing binary search on a vector from data, it is not necessary to validate that the vector is sorted. The binary search will fail to find elements in an unsorted vector, but it won't panic, so this is consistent with GIGO.
3. The best data structs are those that don't need to be validated. See the below example on how to write data structs without internal invariants.

## Example

Consider the struct

```rust
#[derive(serde::Deserialize)]
pub struct MonthNamesBad {
// Invariant 1: the first element in the vector is the default value
// Invariant 2: the length of the vector is the same as month_count
// Invariant 3: month_count is at least 1
month_names: Vec<String>,
month_count: usize,
}

impl MonthNamesBad {
pub fn get_first_month_name(&self) -> &str {
&self.month_names[0]
}
/// Returns None only if `idx` is greater than the month count.
pub fn get_month_name_at_index(&self, idx: usize) -> Option<&str> {
if idx >= self.month_count {
return None;
}
Some(&self.month_names[idx])
}
pub fn get_month_count(&self) -> usize {
self.month_count
}
}
```

This struct has 2 invariants, and each of the functions could panic if those invariants are not upheld.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there not an implicit 3rd invariant, which is that the Vec has at least 1 element? Or do you want to consider that a part of Invariant 1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're listing 3 invariants, but stating that there are only two.


Below are 3 approaches to resolve this issue that are consistent with the ICU4X style.

*Editor's note: the examples in this document are tested along with the tutorials.*

### Solution 1: Garbage In, Garbage Out

Change the functions to return default fallback values if the data is not in the expected form. This code also consolidates `month_count` and `month_names` into the same field, a change reflected in all of the proposed solutions.

```rust
#[derive(serde::Deserialize)]
pub struct MonthNamesGIGO {
// WEAK Invariant: the first element in the vector is the default value
month_names: Vec<String>,
}

impl MonthNamesGIGO {
pub fn get_first_month_name(&self) -> &str {
match self.month_names.get(0) {
Some(v) => v,
None => {
debug_assert!(false, "month_names is empty");
// Return a GIGO result:
""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call out that here you are returning a "garbage" response "" in case the data is not upholding to invariants.

}
}
}
pub fn get_month_name_at_index(&self, idx: usize) -> Option<&str> {
self.month_names.get(idx).map(String::as_str)
}
pub fn get_month_count(&self) -> usize {
self.month_names.len()
}
}
```

### Solution 2: Remove Internal Invariants

Change the struct so that it doesn't have internal invariants.

```rust
#[derive(serde::Deserialize)]
pub struct MonthNamesNoInvariants {
pub first_month_name: String,
pub remaining_months: Vec<String>,
}

impl MonthNamesNoInvariants {
pub fn get_first_month_name(&self) -> &str {
&self.first_month_name
}
pub fn get_month_name_at_index(&self, idx: usize) -> Option<&str> {
if idx == 0 {
Some(&self.first_month_name)
} else {
self.remaining_months.get(idx - 1).map(String::as_str)
}
}
pub fn get_month_count(&self) -> usize {
self.remaining_months.len() + 1
}
}
```

This struct has no internal invariants. The deserialization step automatically enforces that the default value is present.

### Solution 3: Use Helper Struct

With this solution, check that there are the expected number of items at deserialization time. Leverage Serde's [try_from](https://serde.rs/container-attrs.html#try_from) attribute to perform the validation.

```rust
use icu_provider::DataError;

#[derive(serde::Deserialize)]
struct MonthNamesInner {
pub month_names: Vec<String>,
}

#[derive(serde::Deserialize)]
#[serde(try_from = "MonthNamesInner")]
pub struct MonthNamesCheckedInvariants {
// Invariant: the vector is non-empty.
month_names: Vec<String>,
}

impl TryFrom<MonthNamesInner> for MonthNamesCheckedInvariants {
type Error = DataError;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I realized that this error gets promptly cast to serde::de::Error::custom(...) as part of serde::try_from

https://github.com/serde-rs/serde/blob/fb2fe409c8f7ad6c95e3096e5e9ede865c8cfb49/serde_derive/src/de.rs#L401

What would be the best practice error type for this position?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should ideally use serde's error type directly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most efficient type to use is &'static str because serde only exposes a trait, and the trait casts whatever comes in as Display.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's do that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors for private types should be tailored, AND errors in doc examples should typically be simple

(though string errors aren't exactly best practice; the goal of this is not to teach people errors, and usually it's reasonably obvious that string errors should be replaced with something more codebase specific)

fn try_from(other: MonthNamesInner) -> Result<Self, Self::Error> {
if other.month_names.is_empty() {
Err(DataError::custom("month_names must not be empty"))
} else {
Ok(Self { month_names: other.month_names })
}
}
}

impl MonthNamesCheckedInvariants {
pub fn get_first_month_name(&self) -> &str {
#[allow(clippy::indexing_slicing)] // validated invariant
&self.month_names[0]
}
pub fn get_month_name_at_index(&self, idx: usize) -> Option<&str> {
self.month_names.get(idx).map(String::as_str)
}
pub fn get_month_count(&self) -> usize {
self.month_names.len()
}
}
```
2 changes: 2 additions & 0 deletions docs/tutorials/writing_a_new_data_struct.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ In general, data structs should be annotated with `#[icu_provider::data_struct]`

As explained in *data_pipeline.md*, the data struct should support zero-copy deserialization. The `#[icu_provider::data_struct]` annotation will enforce this for you. **See more information in [style_guide.md](https://github.com/unicode-org/icu4x/blob/main/docs/process/style_guide.md#zero-copy-in-dataprovider-structs--required),** as well as the example below in this tutorial.

Additionally, data structs should keep internal invariants to a minimum. For more information, see [data_safety.md](../design/data_safety.md).

### Data Download

The first step to introduce data into the ICU4X pipeline is to download it from an external source. This corresponds to step 1 above.
Expand Down
2 changes: 2 additions & 0 deletions tools/tutorials-test/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,5 @@ mod js {}
mod markdown_tips {}
#[doc = include_str!("../../../README.md")]
mod readme {}
#[doc = include_str!("../../../docs/design/data_safety.md")]
mod data_safety {}