-
Notifications
You must be signed in to change notification settings - Fork 33
Description
JavaScript GeoArrow Module Proposal
The strength of Arrow is in its interoperability, and therefore I think it's worthwhile to discuss how to ensure all the pieces around GeoArrow in JavaScript fit together really well.
This is a corollary to the Python GeoArrow Module Proposal but focused on GeoArrow interoperability in JavaScript and WebAssembly. I don't know anyone doing GeoArrow-Wasm stuff in C, so this will focus on my efforts in Rust and TypeScript. Unlike in Python, there aren't other people currently working on JavaScript GeoArrow infrastructure, so this is a manifesto to solidify my ideas.
WebAssembly limitations
WebAssembly is sandboxed, which means that Wasm code can only access and modify memory within its own memory space. So Wasm code cannot access JavaScript objects directly.
This also means that two Wasm modules can't share memory. So if you have one Wasm-based NPM library that loads GeoParquet to GeoArrow and another Wasm-based NPM library that implements spatial operations on GeoArrow, there must be a copy from the first module's memory space into JavaScript and then into the second module's memory space.
This means that grouping Wasm functionality together into a single module is more performant, as I/O and operations can be done in a single memory space. This runs up against bundle size: JavaScript bundlers are able to tree-shake JavaScript code, but they can't tree-shake a prebuilt Wasm binary. Instead, the original Rust would have to be recompiled, excluding unwanted functions.
The solution I'm gravitating towards is to have a variety of NPM libraries, described in this document, where I/O or operations are distributed both as their own libraries but also in a "kitchen sink" build, which contains everything at the cost of a larger bundle size. Advanced users can compile custom Wasm binaries from the rust source, with only the desired functionality.
Goals
Similar goals to the Python module proposal:
- Modular: the user can install what they need and choose which dependencies they want, with the goal of somewhat fine-tuned control of bundle size.
- Interoperable: the user can use WebAssembly-based and pure-JavaScript GeoArrow libraries together smoothly.
- Extensible: future developers can develop on top of
geoarrow-wasm
and largely reuse its JS bindings without having to create ones from scratch - Strongly typed. A method like
convex_hull
should always return aPolygonArray
instead of a genericGeometryArray
that the user can't "see into" statically. - Static typing: Full typing support and IDE autocompletion.
Data Movement
In contrast to Python, which is able to share the same memory space with native code, data movement between Wasm and JS is not always free, because they occupy two separate memory spaces. JS can see into Wasm memory but not the opposite. This means that data movement from Wasm -> JS can be zero-copy, but JS -> Wasm requires a copy.
The easiest data movement in JS is to use Arrow IPC buffers to move serialized data between JS and Wasm, but this has a number of drawbacks:
- Significant memory overhead: when constructing the IPC buffer, all
Data
chunks need to be copied into a newArrayBuffer
, a full copy of the dataset, before the copy into/out of Wasm. - All
Data
chunks in JS memory are references onto the same backingArrayBuffer
(from the original IPC buffer), which means aData
instance can't be transferred to a WebWorker without a copy.
The most performant data movement in JS is to directly view data from Wasm memory and conversely for JS to write array data directly into the Wasm memory space. I've been working on this in arrow-js-ffi
and it's a crucial part of Arrow interoperability in Wasm. This solves both of the downsides of Arrow IPC, as it avoids an extra data copy and the Data
instances in JS have a backing buffer not shared with any other Data
.
Module hierarchy
Here's a quick (messy) picture of the dependency graph. An arrow points to the library it depends on, so here geoarrow-wasm
depends on geoarrow-rs
.
The most important part is that there are no dependency cycles.
Rust Core (non-Wasm)
geoarrow-rs
is the rust core with all core GeoArrow functionality. All algorithms, core I/O, etc are implemented in this crate so that as much as possible can be shared among pure-Rust, JS, and Python.
This crate does not on its own have any JS bindings. All JS functionality is exported in separate crates/packages below.
- Rust crate name:
geoarrow
Arrow-Wasm Core
Shared arrow definitions and FFI functionality to/from Arrow JS.
- Rust crate name:
arrow-wasm
- JS package name: None? It's unclear whether this should even be published to NPM, as it's not useful on its own; it's useful as a building block for other libraries.
- Dependencies:
- Only the
arrow
crate.
- Only the
- Defines common abstractions in Rust with JS-facing APIs for
Table
,Vector
,Data
,DataType
. - Enables zero-copy (or one-copy, but serialization-free) interop with Arrow JS.
Computational library
Standalone library for spatial operations on GeoArrow arrays, without any I/O except for Arrow IPC and FFI. The slim
compilation feature of geoarrow-wasm
.
- Rust crate name:
geoarrow-wasm
- JS package name:
@geoarrow/geoarrow-wasm-slim
- Dependencies:
geoarrow-rs
for computational algorithms to wrap for JSarrow-wasm
for JS bindings for Arrow FFI with Arrow JS- Other dependencies in the graph are only used with the
full
compilation feature, described below under "Kitchen Sink"
- Algorithms to operate on GeoArrow memory
- All operations that have a pure-Rust core and can be compiled seamlessly to Wasm
- For now, includes all algorithms. Maaybe in the future, we could have different NPM packages for different sets of libraries, but that sounds like a lot of work.
I/O Wasm libraries
There should exist standalone libraries with a minimal bundle size to read and write various file formats to/from GeoArrow.
parquet-wasm
Standalone library to read and write Parquet files in Wasm.
- Rust crate name:
parquet-wasm
- JS package name:
parquet-wasm
- Dependencies:
arrow-wasm
for JS bindings for Arrow FFI with Arrow JS
geoparquet-wasm
Standalone library to read and write GeoParquet files in Wasm.
- Rust crate name:
geoparquet-wasm
- JS package name:
@geoarrow/geoparquet-wasm
- Dependencies:
parquet-wasm
for JS bindings to read/write Parquetgeoarrow-rs
to encode/decode WKB geometries to/from GeoArrow
- Functional API:
readGeoParquet
: wrapsparquet-wasm
'sreadParquet
, converting WKB column to GeoArrow before returning anarrow-wasm
Table
instancewriteGeoParquet
: wrapsparquet-wasm
'swriteParquet
, converting GeoArrow in theTable
input to WKB before passing on towriteParquet
.readGeoParquetStream
: wrapsparquet-wasm
'sreadParquetStream
- TODO: more async APIs
flatgeobuf-wasm
Standalone library to read and write FlatGeobuf files in Wasm.
- Rust crate name:
flatgeobuf-wasm
- JS package name:
@geoarrow/flatgeobuf-wasm
- Dependencies:
arrow-wasm
for JS bindings for Arrow FFI with Arrow JSgeoarrow-rs
to read/write FlatGeobuf to/from GeoArrow
- Functional API:
readFlatGeobuf
: parses FlatGeobuf buffer, returning anarrow-wasm
Table
instancewriteFlatGeobuf
: creates a FlatGeobuf buffer from anarrow-wasm
Table
instance.- Future:
readFlatGeobufStream
: generates an async iterable ofarrow-wasm
RecordBatch
from a remote FlatGeobuf file - Future: read data by bounding-box from a remote file
The kitchen sink
The full
compilation feature of geoarrow-wasm
.
- Rust crate name:
geoarrow-wasm
- JS package name:
@geoarrow/geoarrow-wasm
- Dependencies:
arrow-wasm
for JS bindings for Arrow FFI with Arrow JSgeoparquet-wasm
for JS bindings for GeoParquetflatgeobuf-wasm
for JS bindings for FlatGeobufgeoarrow-rs
for algorithms
Pure JS Interop
This is designed to smoothly interop with pure-JavaScript Arrow libraries.
Arrow JS
The canonical implementation of Arrow in JS. It only supports IPC for data I/O.
Arrow JS FFI
A library to read/write Arrow data across the Wasm boundary. This interops with the core arrow-wasm
crate above.
GeoArrow JS
A pure-JavaScript (TypeScript) implementation of GeoArrow. This uses the exact same memory layout as GeoArrow in Rust, so it should be possible to mix and match between pure-JS and wasm-based algorithms without changing data representations.