This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 221
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
9cc7c10
commit 8b7bdee
Showing
26 changed files
with
396 additions
and
290 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1 @@ | ||
/target | ||
Cargo.lock | ||
target |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
book |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
[book] | ||
authors = ["Jorge C. Leitao"] | ||
language = "en" | ||
multilingual = false | ||
src = "src" | ||
title = "arrow 2 documentation" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Arrow2 | ||
|
||
Arrow2 is a `safe` Rust library that implements data structures and functionality enabling | ||
interoperability with the arrow format. | ||
|
||
The typical use-case for this library is to perform CPU and memory-intensive analytics in a format | ||
that allows for IPC and FFI. | ||
|
||
Arrow2 is divided in two main parts: a [low-end API](./low_end.md) to efficiently | ||
operate with contiguous memory regions, and a [high-end API](./high_end.md) to operate with | ||
arrow arrays, logical types, schemas, etc. | ||
|
||
Finally, this crate has an anxiliary section devoted to IO (CSV, JSON). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Summary | ||
|
||
- [Arrow2](./README.md) | ||
- [The arrow format](./arrow.md) | ||
- [Low end API](./low_end.md) | ||
- [High end API](./high_end.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Arrow | ||
|
||
[Arrow](https://arrow.apache.org/) as in-memory columnar format for modern analytics that natively supports the concept of a null value, nested structs, and many others. | ||
|
||
It has an IPC protocol (based on flat buffers) and a stable C-represented ABI for intra-process communication via foreign interfaces (FFI). | ||
|
||
The arrow format recommends allocating memory along cache lines to leverage modern archtectures, | ||
as well as using a shared memory model that enables multi-threading. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,152 @@ | ||
# High end API | ||
|
||
The simplest way to think about an arrow `Array` is that it is a `Vec<Option<T>>` | ||
with a logical type associated with it. | ||
|
||
Probably the simplest array is `PrimitiveArray<T>`. It can be constructed | ||
from an iterator as follows: | ||
|
||
```rust | ||
# use arrow2::array::{Array, PrimitiveArray, Primitive}; | ||
# use arrow2::datatypes::DataType; | ||
# fn main() { | ||
let array: PrimitiveArray<i32> = [Some(1), None, Some(123)] | ||
.iter() | ||
.collect::<Primitive<i32>>() | ||
.to(DataType::Int32); | ||
assert_eq!(array.len(), 3) | ||
# } | ||
``` | ||
|
||
A `PrimitiveArray` has 3 components: | ||
|
||
1. A physical type (`i32`) | ||
2. A logical type (`DataType::Int32`) | ||
3. Data | ||
|
||
Its main difference vs a `Vec<Option<T>>` are: | ||
|
||
* Its data is layed out in memory as a `Buffer<T>` and a `Option<Bitmap>`. | ||
* `PrimitivArray<T>` has an associated datatype. | ||
|
||
The first difference allows interoperability with Arrow's ecosystem and efficient operations (we will re-visit this in a second); | ||
the second difference allows semantic meaning to the array: in the example | ||
|
||
```rust | ||
# use arrow2::array::Primitive; | ||
# use arrow2::datatypes::DataType; | ||
# fn main() { | ||
let ints = Primitive::<i32>::from(&[Some(1), None]).to(DataType::Int32); | ||
let dates = Primitive::<i32>::from(&[Some(1), None]).to(DataType::Date32); | ||
# } | ||
``` | ||
|
||
`ints` and `dates` have the same in-memory representation but different logic representations, | ||
(e.g. dates are usually represented as a string). | ||
|
||
## Dynamic Array | ||
|
||
There is a more powerful aspect of arrow arrays, and that is that they all | ||
implement the trait `Array` and can be casted to `&dyn Array`, i.e. they can be turned into | ||
a trait object. This in particular enables arrays to have types that are dynamic in nature. | ||
`ListArray<i32>` is an example of a nested (dynamic) array: | ||
|
||
```rust | ||
# use std::sync::Arc; | ||
# use arrow2::array::{Array, ListPrimitive, ListArray, Primitive}; | ||
# use arrow2::datatypes::DataType; | ||
# fn main() { | ||
let data = vec![ | ||
Some(vec![Some(1i32), Some(2), Some(3)]), | ||
None, | ||
Some(vec![Some(4), None, Some(6)]), | ||
]; | ||
|
||
let a: ListArray<i32> = data | ||
.into_iter() | ||
.collect::<ListPrimitive<i32, Primitive<i32>, i32>>() | ||
.to(ListArray::<i32>::default_datatype(DataType::Int32)); | ||
|
||
let inner: &Arc<dyn Array> = a.values(); | ||
# } | ||
``` | ||
|
||
Note how we have not specified the the inner type explicitely the types of `ListArray<i32>`. | ||
Instead, `ListArray` does have an inner `Array` representing all its values. | ||
|
||
## Downcast and `as_any` | ||
|
||
Given a trait object `&dyn Array`, we know its logical type via `Array::data_type()` and can use it to downcast the array to its concrete type: | ||
|
||
```rust | ||
# use arrow2::array::{Array, PrimitiveArray, Primitive}; | ||
# use arrow2::datatypes::DataType; | ||
# fn main() { | ||
let array: PrimitiveArray<i32> = [Some(1), None, Some(123)] | ||
.iter() | ||
.collect::<Primitive<i32>>() | ||
.to(DataType::Int32); | ||
let array = &array as &dyn Array; | ||
|
||
let array = array.as_any().downcast_ref::<PrimitiveArray<i32>>().unwrap(); | ||
# } | ||
``` | ||
|
||
## Iterators | ||
|
||
We've already seen how to create an array from an iterator. Most arrays also implement | ||
`IntoIterator`: | ||
|
||
```rust | ||
# use arrow2::array::{Array, PrimitiveArray, Primitive}; | ||
# use arrow2::datatypes::DataType; | ||
# fn main() { | ||
let array: PrimitiveArray<i32> = [Some(1), None, Some(123)] | ||
.iter() | ||
.collect::<Primitive<i32>>() | ||
.to(DataType::Int32); | ||
|
||
for item in array.iter() { | ||
if let Some(value) = item { | ||
println!("{}", value); | ||
} else { | ||
println!("NULL"); | ||
} | ||
} | ||
# } | ||
``` | ||
|
||
## Vectorized operations | ||
|
||
One of the main advantages of the arrow format and its memory layout is that | ||
it often enables SIMD. For example, an unary operation `op` on a `PrimitiveArray` is likely auto-vectorized on the following code: | ||
|
||
```rust | ||
# use arrow2::buffer::Buffer; | ||
# use arrow2::{ | ||
# array::{Array, PrimitiveArray}, | ||
# buffer::NativeType, | ||
# datatypes::DataType, | ||
# }; | ||
|
||
pub fn unary<I, F, O>(array: &PrimitiveArray<I>, op: F, data_type: &DataType) -> PrimitiveArray<O> | ||
where | ||
I: NativeType, | ||
O: NativeType, | ||
F: Fn(I) -> O, | ||
{ | ||
let values = array.values().iter().map(|v| op(*v)); | ||
// Soundness | ||
// `values` is an iterator with a known size because arrays are sized. | ||
let values = unsafe { Buffer::from_trusted_len_iter(values) }; | ||
|
||
PrimitiveArray::<O>::from_data(data_type.clone(), values, array.nulls().clone()) | ||
} | ||
``` | ||
|
||
Two notes: | ||
|
||
1. We have used `from_trusted_len_iter`. This is because [`TrustedLen`](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) and trait specialization are still unstable. | ||
|
||
2. Note how we use `op` on the array's values irrespectively of their validity | ||
and cloned the validity `Bitmap`. This approach is suitable for operations whose branching off is more expensive than operating over all values. If the operation is expensive, then using a normal iterator may be justified. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Low end API | ||
|
||
The most important design decision of this crate is that contiguous regions are shared via an `Arc`. In this context, the operation of slicing a memory region is `O(1)` because it corresponds to incrementing an offset. The tradeoff is that once under and `Arc`, memory regions are imutable. | ||
Because bitmaps' offsets are measured in bits, but other quantities are measured bytes, this crate contains 2 main types of containers of contiguous memory regions: | ||
|
||
* `Buffer<T>`: handle contigous memory regions of type T whose offsets are measured in items | ||
* `Bitmap`: handle contigous memory regions of bits whose offsets are measured in bits | ||
|
||
These hold _all_ data-related memory in this crate. | ||
|
||
Due to their intrinsic imutability, each container has a corresponding mutable (and non-sharable) variant: | ||
|
||
* `MutableBuffer<T>` | ||
* `Buffer<T>` | ||
* `MutableBitmap` | ||
* `Bitmap` | ||
|
||
Some examples: | ||
|
||
Create a new `Buffer<u32>`: | ||
|
||
```rust | ||
# use arrow2::buffer::Buffer; | ||
# fn main() { | ||
let x = Buffer::from(&[1u32, 2, 3]); | ||
assert_eq!(x.as_slice(), &[1u32, 2, 3]); | ||
|
||
let x = x.slice(1, 2); | ||
assert_eq!(x.as_slice(), &[2, 3]); | ||
# } | ||
``` | ||
|
||
Using a `MutableBuffer<f64>`: | ||
|
||
```rust | ||
# use arrow2::buffer::{Buffer, MutableBuffer}; | ||
# fn main() { | ||
let mut x = MutableBuffer::with_capacity(4); | ||
(0..3).for_each(|i| { | ||
x.push(i as f64) | ||
}); | ||
let x: Buffer<f64> = x.into(); | ||
assert_eq!(x.as_slice(), &[0.0, 1.0, 2.0]); | ||
# } | ||
``` | ||
|
||
In this context, `MutableBuffer` is the closest API to rust's `Vec`. | ||
|
||
The following demonstrates how to efficiently | ||
perform an operation from an iterator of [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html): | ||
|
||
```rust | ||
# use std::iter::FromIterator; | ||
# use arrow2::buffer::{Buffer, MutableBuffer}; | ||
# fn main() { | ||
let x = Buffer::from_iter((0..1000)); | ||
let iter = x.as_slice().iter().map(|x| x * 2); | ||
let y = unsafe { Buffer::from_trusted_len_iter(iter) }; | ||
assert_eq!(y.as_slice()[50], 100); | ||
# } | ||
``` | ||
|
||
Using `from_trusted_len_iter` often causes the compiler to auto-vectorize. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
doc_comment::doctest!("../guide/src/low_end.md"); | ||
doc_comment::doctest!("../guide/src/high_end.md"); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.