Added guide

jorgecarleitao · Feb 25, 2021 · 8b7bdee · 8b7bdee
1 parent 9cc7c10
commit 8b7bdee
Show file tree

Hide file tree

Showing 26 changed files with 396 additions and 290 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1 @@
-/target
-Cargo.lock
+target
diff --git a/Cargo.toml b/Cargo.toml
@@ -31,6 +31,7 @@ rand = "0.7"
 criterion = "0.3"
 tempfile = "3"
 flate2 = "1"
+doc-comment = "0.3"
 
 [features]
 default = ["io_csv", "io_json", "io_ipc", "io_json_integration"]

diff --git a/guide/.gitignore b/guide/.gitignore
@@ -0,0 +1 @@
+book
diff --git a/guide/book.toml b/guide/book.toml
@@ -0,0 +1,6 @@
+[book]
+authors = ["Jorge C. Leitao"]
+language = "en"
+multilingual = false
+src = "src"
+title = "arrow 2 documentation"
diff --git a/guide/src/README.md b/guide/src/README.md
@@ -0,0 +1,13 @@
+# Arrow2
+
+Arrow2 is a `safe` Rust library that implements data structures and functionality enabling
+interoperability with the arrow format.
+
+The typical use-case for this library is to perform CPU and memory-intensive analytics in a format
+that allows for IPC and FFI.
+
+Arrow2 is divided in two main parts: a [low-end API](./low_end.md) to efficiently
+operate with contiguous memory regions, and a [high-end API](./high_end.md) to operate with
+arrow arrays, logical types, schemas, etc.
+
+Finally, this crate has an anxiliary section devoted to IO (CSV, JSON).
diff --git a/guide/src/SUMMARY.md b/guide/src/SUMMARY.md
@@ -0,0 +1,6 @@
+# Summary
+
+- [Arrow2](./README.md)
+- [The arrow format](./arrow.md)
+- [Low end API](./low_end.md)
+- [High end API](./high_end.md)
diff --git a/guide/src/arrow.md b/guide/src/arrow.md
@@ -0,0 +1,8 @@
+# Arrow
+
+[Arrow](https://arrow.apache.org/) as in-memory columnar format for modern analytics that natively supports the concept of a null value, nested structs, and many others.
+
+It has an IPC protocol (based on flat buffers) and a stable C-represented ABI for intra-process communication via foreign interfaces (FFI).
+
+The arrow format recommends allocating memory along cache lines to leverage modern archtectures,
+as well as using a shared memory model that enables multi-threading.
diff --git a/guide/src/high_end.md b/guide/src/high_end.md
@@ -0,0 +1,152 @@
+# High end API
+
+The simplest way to think about an arrow `Array` is that it is a `Vec<Option<T>>`
+with a logical type associated with it.
+
+Probably the simplest array is `PrimitiveArray<T>`. It can be constructed
+from an iterator as follows:
+
+```rust
+# use arrow2::array::{Array, PrimitiveArray, Primitive};
+# use arrow2::datatypes::DataType;
+# fn main() {
+let array: PrimitiveArray<i32> = [Some(1), None, Some(123)]
+    .iter()
+    .collect::<Primitive<i32>>()
+    .to(DataType::Int32);
+assert_eq!(array.len(), 3)
+# }
+```
+
+A `PrimitiveArray` has 3 components:
+
+1. A physical type (`i32`)
+2. A logical type (`DataType::Int32`)
+3. Data
+
+Its main difference vs a `Vec<Option<T>>` are:
+
+* Its data is layed out in memory as a `Buffer<T>` and a `Option<Bitmap>`.
+* `PrimitivArray<T>` has an associated datatype.
+
+The first difference allows interoperability with Arrow's ecosystem and efficient operations (we will re-visit this in a second);
+the second difference allows semantic meaning to the array: in the example
+
+```rust
+# use arrow2::array::Primitive;
+# use arrow2::datatypes::DataType;
+# fn main() {
+let ints = Primitive::<i32>::from(&[Some(1), None]).to(DataType::Int32);
+let dates = Primitive::<i32>::from(&[Some(1), None]).to(DataType::Date32);
+# }
+```
+
+`ints` and `dates` have the same in-memory representation but different logic representations,
+(e.g. dates are usually represented as a string).
+
+## Dynamic Array
+
+There is a more powerful aspect of arrow arrays, and that is that they all
+implement the trait `Array` and can be casted to `&dyn Array`, i.e. they can be turned into
+a trait object. This in particular enables arrays to have types that are dynamic in nature.
+`ListArray<i32>` is an example of a nested (dynamic) array:
+
+```rust
+# use std::sync::Arc;
+# use arrow2::array::{Array, ListPrimitive, ListArray, Primitive};
+# use arrow2::datatypes::DataType;
+# fn main() {
+let data = vec![
+    Some(vec![Some(1i32), Some(2), Some(3)]),
+    None,
+    Some(vec![Some(4), None, Some(6)]),
+];
+
+let a: ListArray<i32> = data
+    .into_iter()
+    .collect::<ListPrimitive<i32, Primitive<i32>, i32>>()
+    .to(ListArray::<i32>::default_datatype(DataType::Int32));
+
+let inner: &Arc<dyn Array> = a.values();
+# }
+```
+
+Note how we have not specified the the inner type explicitely the types of `ListArray<i32>`.
+Instead, `ListArray` does have an inner `Array` representing all its values.
+
+## Downcast and `as_any`
+
+Given a trait object `&dyn Array`, we know its logical type via `Array::data_type()` and can use it to downcast the array to its concrete type:
+
+```rust
+# use arrow2::array::{Array, PrimitiveArray, Primitive};
+# use arrow2::datatypes::DataType;
+# fn main() {
+let array: PrimitiveArray<i32> = [Some(1), None, Some(123)]
+    .iter()
+    .collect::<Primitive<i32>>()
+    .to(DataType::Int32);
+let array = &array as &dyn Array;
+
+let array = array.as_any().downcast_ref::<PrimitiveArray<i32>>().unwrap();
+# }
+```
+
+## Iterators
+
+We've already seen how to create an array from an iterator. Most arrays also implement
+`IntoIterator`:
+
+```rust
+# use arrow2::array::{Array, PrimitiveArray, Primitive};
+# use arrow2::datatypes::DataType;
+# fn main() {
+let array: PrimitiveArray<i32> = [Some(1), None, Some(123)]
+    .iter()
+    .collect::<Primitive<i32>>()
+    .to(DataType::Int32);
+
+for item in array.iter() {
+    if let Some(value) = item {
+        println!("{}", value);
+    } else {
+        println!("NULL");
+    }
+}
+# }
+```
+
+## Vectorized operations
+
+One of the main advantages of the arrow format and its memory layout is that 
+it often enables SIMD. For example, an unary operation `op` on a `PrimitiveArray` is likely auto-vectorized on the following code:
+
+```rust
+# use arrow2::buffer::Buffer;
+# use arrow2::{
+#     array::{Array, PrimitiveArray},
+#     buffer::NativeType,
+#     datatypes::DataType,
+# };
+
+pub fn unary<I, F, O>(array: &PrimitiveArray<I>, op: F, data_type: &DataType) -> PrimitiveArray<O>
+where
+    I: NativeType,
+    O: NativeType,
+    F: Fn(I) -> O,
+{
+    let values = array.values().iter().map(|v| op(*v));
+    //  Soundness
+    //      `values` is an iterator with a known size because arrays are sized.
+    let values = unsafe { Buffer::from_trusted_len_iter(values) };
+
+    PrimitiveArray::<O>::from_data(data_type.clone(), values, array.nulls().clone())
+}
+```
+
+Two notes:
+
+1. We have used `from_trusted_len_iter`. This is because [`TrustedLen`](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) and trait specialization are still unstable.
+
+2. Note how we use `op` on the array's values irrespectively of their validity
+and cloned the validity `Bitmap`. This approach is suitable for operations whose branching off is more expensive than operating over all values. If the operation is expensive, then using a normal iterator may be justified.
diff --git a/guide/src/low_end.md b/guide/src/low_end.md
@@ -0,0 +1,63 @@
+# Low end API
+
+The most important design decision of this crate is that contiguous regions are shared via an `Arc`. In this context, the operation of slicing a memory region is `O(1)` because it corresponds to incrementing an offset. The tradeoff is that once under and `Arc`, memory regions are imutable.
+Because bitmaps' offsets are measured in bits, but other quantities are measured bytes, this crate contains 2 main types of containers of contiguous memory regions:
+
+* `Buffer<T>`: handle contigous memory regions of type T whose offsets are measured in items
+* `Bitmap`: handle contigous memory regions of bits whose offsets are measured in bits
+
+These hold _all_ data-related memory in this crate.
+
+Due to their intrinsic imutability, each container has a corresponding mutable (and non-sharable) variant:
+
+* `MutableBuffer<T>`
+* `Buffer<T>`
+* `MutableBitmap`
+* `Bitmap`
+
+Some examples:
+
+Create a new `Buffer<u32>`:
+
+```rust
+# use arrow2::buffer::Buffer;
+# fn main() {
+let x = Buffer::from(&[1u32, 2, 3]);
+assert_eq!(x.as_slice(), &[1u32, 2, 3]);
+
+let x = x.slice(1, 2);
+assert_eq!(x.as_slice(), &[2, 3]);
+# }
+```
+
+Using a `MutableBuffer<f64>`:
+
+```rust
+# use arrow2::buffer::{Buffer, MutableBuffer};
+# fn main() {
+let mut x = MutableBuffer::with_capacity(4);
+(0..3).for_each(|i| {
+    x.push(i as f64)
+});
+let x: Buffer<f64> = x.into();
+assert_eq!(x.as_slice(), &[0.0, 1.0, 2.0]);
+# }
+```
+
+In this context, `MutableBuffer` is the closest API to rust's `Vec`.
+
+The following demonstrates how to efficiently 
+perform an operation from an iterator of [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html):
+
+```rust
+# use std::iter::FromIterator;
+# use arrow2::buffer::{Buffer, MutableBuffer};
+# fn main() {
+let x = Buffer::from_iter((0..1000));
+let iter = x.as_slice().iter().map(|x| x * 2);
+let y = unsafe { Buffer::from_trusted_len_iter(iter) };
+assert_eq!(y.as_slice()[50], 100);
+# }
+```
+
+Using `from_trusted_len_iter` often causes the compiler to auto-vectorize.
diff --git a/src/datatypes/field.rs b/src/datatypes/field.rs
@@ -127,7 +127,7 @@ impl Field {
     /// Example:
     ///
     /// ```
-    /// use arrow::datatypes::*;
+    /// use arrow2::datatypes::*;
     ///
     /// let mut field = Field::new("c1", DataType::Int64, false);
     /// assert!(field.try_merge(&Field::new("c1", DataType::Int64, true)).is_ok());

diff --git a/src/datatypes/schema.rs b/src/datatypes/schema.rs
@@ -49,8 +49,7 @@ impl Schema {
     /// # Example
     ///
     /// ```
-    /// # extern crate arrow;
-    /// # use arrow::datatypes::{Field, DataType, Schema};
+    /// # use arrow2::datatypes::{Field, DataType, Schema};
     /// let field_a = Field::new("a", DataType::Int64, false);
     /// let field_b = Field::new("b", DataType::Boolean, false);
     ///
@@ -66,8 +65,7 @@ impl Schema {
     /// # Example
     ///
     /// ```
-    /// # extern crate arrow;
-    /// # use arrow::datatypes::{Field, DataType, Schema};
+    /// # use arrow2::datatypes::{Field, DataType, Schema};
     /// # use std::collections::HashMap;
     /// let field_a = Field::new("a", DataType::Int64, false);
     /// let field_b = Field::new("b", DataType::Boolean, false);
@@ -95,7 +93,7 @@ impl Schema {
     /// Example:
     ///
     /// ```
-    /// use arrow::datatypes::*;
+    /// use arrow2::datatypes::*;
     ///
     /// let merged = Schema::try_merge(vec![
     ///     Schema::new(vec![
@@ -125,7 +123,7 @@ impl Schema {
                 let Schema {
                     metadata,
                     fields,
-                    is_little_endian,
+                    is_little_endian: _,
                 } = schema;
                 for (key, value) in metadata.into_iter() {
                     // merge metadata

diff --git a/src/docs.rs b/src/docs.rs
@@ -0,0 +1,2 @@
+doc_comment::doctest!("../guide/src/low_end.md");
+doc_comment::doctest!("../guide/src/high_end.md");
diff --git a/src/ffi/ffi.rs b/src/ffi/ffi.rs
@@ -15,47 +15,6 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! Contains declarations to bind to the [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).
-//!
-//! Generally, this module is divided in two main interfaces:
-//! One interface maps C ABI to native Rust types, i.e. convert c-pointers, c_char, to native rust.
-//! This is handled by [FFI_ArrowSchema] and [FFI_ArrowArray].
-//!
-//! The second interface maps native Rust types to the Rust-specific implementation of Arrow such as `format` to `Datatype`,
-//! `Buffer`, etc. This is handled by `ArrowArray`.
-//!
-//! ```rust
-//! # use std::sync::Arc;
-//! # use arrow::array::{Int32Array, Array, ArrayData, make_array_from_raw};
-//! # use arrow::error::{Result, ArrowError};
-//! # use arrow::compute::kernels::arithmetic;
-//! # use std::convert::TryFrom;
-//! # fn main() -> Result<()> {
-//! // create an array natively
-//! let array = Int32Array::from(vec![Some(1), None, Some(3)]);
-//!
-//! // export it
-//! let (array_ptr, schema_ptr) = array.to_raw()?;
-//!
-//! // consumed and used by something else...
-//!
-//! // import it
-//! let array = unsafe { make_array_from_raw(array_ptr, schema_ptr)? };
-//!
-//! // perform some operation
-//! let array = array.as_any().downcast_ref::<Int32Array>().ok_or(
-//!     ArrowError::ParseError("Expects an int32".to_string()),
-//! )?;
-//! let array = arithmetic::add(&array, &array)?;
-//!
-//! // verify
-//! assert_eq!(array, Int32Array::from(vec![Some(2), None, Some(6)]));
-//!
-//! // (drop/release)
-//! Ok(())
-//! }
-//! ```
-
 /*
 # Design: