Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Commit

Permalink
Added guide
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgecarleitao committed Feb 25, 2021
1 parent 9cc7c10 commit 8b7bdee
Show file tree
Hide file tree
Showing 26 changed files with 396 additions and 290 deletions.
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
/target
Cargo.lock
target
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ rand = "0.7"
criterion = "0.3"
tempfile = "3"
flate2 = "1"
doc-comment = "0.3"

[features]
default = ["io_csv", "io_json", "io_ipc", "io_json_integration"]
Expand Down
1 change: 1 addition & 0 deletions guide/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
book
6 changes: 6 additions & 0 deletions guide/book.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[book]
authors = ["Jorge C. Leitao"]
language = "en"
multilingual = false
src = "src"
title = "arrow 2 documentation"
13 changes: 13 additions & 0 deletions guide/src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Arrow2

Arrow2 is a `safe` Rust library that implements data structures and functionality enabling
interoperability with the arrow format.

The typical use-case for this library is to perform CPU and memory-intensive analytics in a format
that allows for IPC and FFI.

Arrow2 is divided in two main parts: a [low-end API](./low_end.md) to efficiently
operate with contiguous memory regions, and a [high-end API](./high_end.md) to operate with
arrow arrays, logical types, schemas, etc.

Finally, this crate has an anxiliary section devoted to IO (CSV, JSON).
6 changes: 6 additions & 0 deletions guide/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Summary

- [Arrow2](./README.md)
- [The arrow format](./arrow.md)
- [Low end API](./low_end.md)
- [High end API](./high_end.md)
8 changes: 8 additions & 0 deletions guide/src/arrow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Arrow

[Arrow](https://arrow.apache.org/) as in-memory columnar format for modern analytics that natively supports the concept of a null value, nested structs, and many others.

It has an IPC protocol (based on flat buffers) and a stable C-represented ABI for intra-process communication via foreign interfaces (FFI).

The arrow format recommends allocating memory along cache lines to leverage modern archtectures,
as well as using a shared memory model that enables multi-threading.
152 changes: 152 additions & 0 deletions guide/src/high_end.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# High end API

The simplest way to think about an arrow `Array` is that it is a `Vec<Option<T>>`
with a logical type associated with it.

Probably the simplest array is `PrimitiveArray<T>`. It can be constructed
from an iterator as follows:

```rust
# use arrow2::array::{Array, PrimitiveArray, Primitive};
# use arrow2::datatypes::DataType;
# fn main() {
let array: PrimitiveArray<i32> = [Some(1), None, Some(123)]
.iter()
.collect::<Primitive<i32>>()
.to(DataType::Int32);
assert_eq!(array.len(), 3)
# }
```

A `PrimitiveArray` has 3 components:

1. A physical type (`i32`)
2. A logical type (`DataType::Int32`)
3. Data

Its main difference vs a `Vec<Option<T>>` are:

* Its data is layed out in memory as a `Buffer<T>` and a `Option<Bitmap>`.
* `PrimitivArray<T>` has an associated datatype.

The first difference allows interoperability with Arrow's ecosystem and efficient operations (we will re-visit this in a second);
the second difference allows semantic meaning to the array: in the example

```rust
# use arrow2::array::Primitive;
# use arrow2::datatypes::DataType;
# fn main() {
let ints = Primitive::<i32>::from(&[Some(1), None]).to(DataType::Int32);
let dates = Primitive::<i32>::from(&[Some(1), None]).to(DataType::Date32);
# }
```

`ints` and `dates` have the same in-memory representation but different logic representations,
(e.g. dates are usually represented as a string).

## Dynamic Array

There is a more powerful aspect of arrow arrays, and that is that they all
implement the trait `Array` and can be casted to `&dyn Array`, i.e. they can be turned into
a trait object. This in particular enables arrays to have types that are dynamic in nature.
`ListArray<i32>` is an example of a nested (dynamic) array:

```rust
# use std::sync::Arc;
# use arrow2::array::{Array, ListPrimitive, ListArray, Primitive};
# use arrow2::datatypes::DataType;
# fn main() {
let data = vec![
Some(vec![Some(1i32), Some(2), Some(3)]),
None,
Some(vec![Some(4), None, Some(6)]),
];

let a: ListArray<i32> = data
.into_iter()
.collect::<ListPrimitive<i32, Primitive<i32>, i32>>()
.to(ListArray::<i32>::default_datatype(DataType::Int32));

let inner: &Arc<dyn Array> = a.values();
# }
```

Note how we have not specified the the inner type explicitely the types of `ListArray<i32>`.
Instead, `ListArray` does have an inner `Array` representing all its values.

## Downcast and `as_any`

Given a trait object `&dyn Array`, we know its logical type via `Array::data_type()` and can use it to downcast the array to its concrete type:

```rust
# use arrow2::array::{Array, PrimitiveArray, Primitive};
# use arrow2::datatypes::DataType;
# fn main() {
let array: PrimitiveArray<i32> = [Some(1), None, Some(123)]
.iter()
.collect::<Primitive<i32>>()
.to(DataType::Int32);
let array = &array as &dyn Array;

let array = array.as_any().downcast_ref::<PrimitiveArray<i32>>().unwrap();
# }
```

## Iterators

We've already seen how to create an array from an iterator. Most arrays also implement
`IntoIterator`:

```rust
# use arrow2::array::{Array, PrimitiveArray, Primitive};
# use arrow2::datatypes::DataType;
# fn main() {
let array: PrimitiveArray<i32> = [Some(1), None, Some(123)]
.iter()
.collect::<Primitive<i32>>()
.to(DataType::Int32);

for item in array.iter() {
if let Some(value) = item {
println!("{}", value);
} else {
println!("NULL");
}
}
# }
```

## Vectorized operations

One of the main advantages of the arrow format and its memory layout is that
it often enables SIMD. For example, an unary operation `op` on a `PrimitiveArray` is likely auto-vectorized on the following code:

```rust
# use arrow2::buffer::Buffer;
# use arrow2::{
# array::{Array, PrimitiveArray},
# buffer::NativeType,
# datatypes::DataType,
# };

pub fn unary<I, F, O>(array: &PrimitiveArray<I>, op: F, data_type: &DataType) -> PrimitiveArray<O>
where
I: NativeType,
O: NativeType,
F: Fn(I) -> O,
{
let values = array.values().iter().map(|v| op(*v));
// Soundness
// `values` is an iterator with a known size because arrays are sized.
let values = unsafe { Buffer::from_trusted_len_iter(values) };

PrimitiveArray::<O>::from_data(data_type.clone(), values, array.nulls().clone())
}
```

Two notes:

1. We have used `from_trusted_len_iter`. This is because [`TrustedLen`](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) and trait specialization are still unstable.

2. Note how we use `op` on the array's values irrespectively of their validity
and cloned the validity `Bitmap`. This approach is suitable for operations whose branching off is more expensive than operating over all values. If the operation is expensive, then using a normal iterator may be justified.
63 changes: 63 additions & 0 deletions guide/src/low_end.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Low end API

The most important design decision of this crate is that contiguous regions are shared via an `Arc`. In this context, the operation of slicing a memory region is `O(1)` because it corresponds to incrementing an offset. The tradeoff is that once under and `Arc`, memory regions are imutable.
Because bitmaps' offsets are measured in bits, but other quantities are measured bytes, this crate contains 2 main types of containers of contiguous memory regions:

* `Buffer<T>`: handle contigous memory regions of type T whose offsets are measured in items
* `Bitmap`: handle contigous memory regions of bits whose offsets are measured in bits

These hold _all_ data-related memory in this crate.

Due to their intrinsic imutability, each container has a corresponding mutable (and non-sharable) variant:

* `MutableBuffer<T>`
* `Buffer<T>`
* `MutableBitmap`
* `Bitmap`

Some examples:

Create a new `Buffer<u32>`:

```rust
# use arrow2::buffer::Buffer;
# fn main() {
let x = Buffer::from(&[1u32, 2, 3]);
assert_eq!(x.as_slice(), &[1u32, 2, 3]);

let x = x.slice(1, 2);
assert_eq!(x.as_slice(), &[2, 3]);
# }
```

Using a `MutableBuffer<f64>`:

```rust
# use arrow2::buffer::{Buffer, MutableBuffer};
# fn main() {
let mut x = MutableBuffer::with_capacity(4);
(0..3).for_each(|i| {
x.push(i as f64)
});
let x: Buffer<f64> = x.into();
assert_eq!(x.as_slice(), &[0.0, 1.0, 2.0]);
# }
```

In this context, `MutableBuffer` is the closest API to rust's `Vec`.

The following demonstrates how to efficiently
perform an operation from an iterator of [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html):

```rust
# use std::iter::FromIterator;
# use arrow2::buffer::{Buffer, MutableBuffer};
# fn main() {
let x = Buffer::from_iter((0..1000));
let iter = x.as_slice().iter().map(|x| x * 2);
let y = unsafe { Buffer::from_trusted_len_iter(iter) };
assert_eq!(y.as_slice()[50], 100);
# }
```

Using `from_trusted_len_iter` often causes the compiler to auto-vectorize.
2 changes: 1 addition & 1 deletion src/datatypes/field.rs
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ impl Field {
/// Example:
///
/// ```
/// use arrow::datatypes::*;
/// use arrow2::datatypes::*;
///
/// let mut field = Field::new("c1", DataType::Int64, false);
/// assert!(field.try_merge(&Field::new("c1", DataType::Int64, true)).is_ok());
Expand Down
10 changes: 4 additions & 6 deletions src/datatypes/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,7 @@ impl Schema {
/// # Example
///
/// ```
/// # extern crate arrow;
/// # use arrow::datatypes::{Field, DataType, Schema};
/// # use arrow2::datatypes::{Field, DataType, Schema};
/// let field_a = Field::new("a", DataType::Int64, false);
/// let field_b = Field::new("b", DataType::Boolean, false);
///
Expand All @@ -66,8 +65,7 @@ impl Schema {
/// # Example
///
/// ```
/// # extern crate arrow;
/// # use arrow::datatypes::{Field, DataType, Schema};
/// # use arrow2::datatypes::{Field, DataType, Schema};
/// # use std::collections::HashMap;
/// let field_a = Field::new("a", DataType::Int64, false);
/// let field_b = Field::new("b", DataType::Boolean, false);
Expand Down Expand Up @@ -95,7 +93,7 @@ impl Schema {
/// Example:
///
/// ```
/// use arrow::datatypes::*;
/// use arrow2::datatypes::*;
///
/// let merged = Schema::try_merge(vec![
/// Schema::new(vec![
Expand Down Expand Up @@ -125,7 +123,7 @@ impl Schema {
let Schema {
metadata,
fields,
is_little_endian,
is_little_endian: _,
} = schema;
for (key, value) in metadata.into_iter() {
// merge metadata
Expand Down
2 changes: 2 additions & 0 deletions src/docs.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
doc_comment::doctest!("../guide/src/low_end.md");
doc_comment::doctest!("../guide/src/high_end.md");
41 changes: 0 additions & 41 deletions src/ffi/ffi.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,47 +15,6 @@
// specific language governing permissions and limitations
// under the License.

//! Contains declarations to bind to the [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).
//!
//! Generally, this module is divided in two main interfaces:
//! One interface maps C ABI to native Rust types, i.e. convert c-pointers, c_char, to native rust.
//! This is handled by [FFI_ArrowSchema] and [FFI_ArrowArray].
//!
//! The second interface maps native Rust types to the Rust-specific implementation of Arrow such as `format` to `Datatype`,
//! `Buffer`, etc. This is handled by `ArrowArray`.
//!
//! ```rust
//! # use std::sync::Arc;
//! # use arrow::array::{Int32Array, Array, ArrayData, make_array_from_raw};
//! # use arrow::error::{Result, ArrowError};
//! # use arrow::compute::kernels::arithmetic;
//! # use std::convert::TryFrom;
//! # fn main() -> Result<()> {
//! // create an array natively
//! let array = Int32Array::from(vec![Some(1), None, Some(3)]);
//!
//! // export it
//! let (array_ptr, schema_ptr) = array.to_raw()?;
//!
//! // consumed and used by something else...
//!
//! // import it
//! let array = unsafe { make_array_from_raw(array_ptr, schema_ptr)? };
//!
//! // perform some operation
//! let array = array.as_any().downcast_ref::<Int32Array>().ok_or(
//! ArrowError::ParseError("Expects an int32".to_string()),
//! )?;
//! let array = arithmetic::add(&array, &array)?;
//!
//! // verify
//! assert_eq!(array, Int32Array::from(vec![Some(2), None, Some(6)]));
//!
//! // (drop/release)
//! Ok(())
//! }
//! ```

/*
# Design:
Expand Down
Loading

0 comments on commit 8b7bdee

Please sign in to comment.