Skip to content

JavaScript GeoArrow Module Proposal #283

@kylebarron

Description

@kylebarron

JavaScript GeoArrow Module Proposal

The strength of Arrow is in its interoperability, and therefore I think it's worthwhile to discuss how to ensure all the pieces around GeoArrow in JavaScript fit together really well.

This is a corollary to the Python GeoArrow Module Proposal but focused on GeoArrow interoperability in JavaScript and WebAssembly. I don't know anyone doing GeoArrow-Wasm stuff in C, so this will focus on my efforts in Rust and TypeScript. Unlike in Python, there aren't other people currently working on JavaScript GeoArrow infrastructure, so this is a manifesto to solidify my ideas.

WebAssembly limitations

WebAssembly is sandboxed, which means that Wasm code can only access and modify memory within its own memory space. So Wasm code cannot access JavaScript objects directly.

This also means that two Wasm modules can't share memory. So if you have one Wasm-based NPM library that loads GeoParquet to GeoArrow and another Wasm-based NPM library that implements spatial operations on GeoArrow, there must be a copy from the first module's memory space into JavaScript and then into the second module's memory space.

This means that grouping Wasm functionality together into a single module is more performant, as I/O and operations can be done in a single memory space. This runs up against bundle size: JavaScript bundlers are able to tree-shake JavaScript code, but they can't tree-shake a prebuilt Wasm binary. Instead, the original Rust would have to be recompiled, excluding unwanted functions.

The solution I'm gravitating towards is to have a variety of NPM libraries, described in this document, where I/O or operations are distributed both as their own libraries but also in a "kitchen sink" build, which contains everything at the cost of a larger bundle size. Advanced users can compile custom Wasm binaries from the rust source, with only the desired functionality.

Goals

Similar goals to the Python module proposal:

  • Modular: the user can install what they need and choose which dependencies they want, with the goal of somewhat fine-tuned control of bundle size.
  • Interoperable: the user can use WebAssembly-based and pure-JavaScript GeoArrow libraries together smoothly.
  • Extensible: future developers can develop on top of geoarrow-wasm and largely reuse its JS bindings without having to create ones from scratch
  • Strongly typed. A method like convex_hull should always return a PolygonArray instead of a generic GeometryArray that the user can't "see into" statically.
  • Static typing: Full typing support and IDE autocompletion.

Data Movement

In contrast to Python, which is able to share the same memory space with native code, data movement between Wasm and JS is not always free, because they occupy two separate memory spaces. JS can see into Wasm memory but not the opposite. This means that data movement from Wasm -> JS can be zero-copy, but JS -> Wasm requires a copy.

The easiest data movement in JS is to use Arrow IPC buffers to move serialized data between JS and Wasm, but this has a number of drawbacks:

  • Significant memory overhead: when constructing the IPC buffer, all Data chunks need to be copied into a new ArrayBuffer, a full copy of the dataset, before the copy into/out of Wasm.
  • All Data chunks in JS memory are references onto the same backing ArrayBuffer (from the original IPC buffer), which means a Data instance can't be transferred to a WebWorker without a copy.

The most performant data movement in JS is to directly view data from Wasm memory and conversely for JS to write array data directly into the Wasm memory space. I've been working on this in arrow-js-ffi and it's a crucial part of Arrow interoperability in Wasm. This solves both of the downsides of Arrow IPC, as it avoids an extra data copy and the Data instances in JS have a backing buffer not shared with any other Data.

Module hierarchy

Here's a quick (messy) picture of the dependency graph. An arrow points to the library it depends on, so here geoarrow-wasm depends on geoarrow-rs.

image

The most important part is that there are no dependency cycles.

Rust Core (non-Wasm)

geoarrow-rs is the rust core with all core GeoArrow functionality. All algorithms, core I/O, etc are implemented in this crate so that as much as possible can be shared among pure-Rust, JS, and Python.

This crate does not on its own have any JS bindings. All JS functionality is exported in separate crates/packages below.

  • Rust crate name: geoarrow

Arrow-Wasm Core

Shared arrow definitions and FFI functionality to/from Arrow JS.

  • Rust crate name: arrow-wasm
  • JS package name: None? It's unclear whether this should even be published to NPM, as it's not useful on its own; it's useful as a building block for other libraries.
  • Dependencies:
    • Only the arrow crate.
  • Defines common abstractions in Rust with JS-facing APIs for Table, Vector, Data, DataType.
  • Enables zero-copy (or one-copy, but serialization-free) interop with Arrow JS.

Computational library

Standalone library for spatial operations on GeoArrow arrays, without any I/O except for Arrow IPC and FFI. The slim compilation feature of geoarrow-wasm.

  • Rust crate name: geoarrow-wasm
  • JS package name: @geoarrow/geoarrow-wasm-slim
  • Dependencies:
    • geoarrow-rs for computational algorithms to wrap for JS
    • arrow-wasm for JS bindings for Arrow FFI with Arrow JS
    • Other dependencies in the graph are only used with the full compilation feature, described below under "Kitchen Sink"
  • Algorithms to operate on GeoArrow memory
    • All operations that have a pure-Rust core and can be compiled seamlessly to Wasm
    • For now, includes all algorithms. Maaybe in the future, we could have different NPM packages for different sets of libraries, but that sounds like a lot of work.

I/O Wasm libraries

There should exist standalone libraries with a minimal bundle size to read and write various file formats to/from GeoArrow.

parquet-wasm

Standalone library to read and write Parquet files in Wasm.

  • Rust crate name: parquet-wasm
  • JS package name: parquet-wasm
  • Dependencies:
    • arrow-wasm for JS bindings for Arrow FFI with Arrow JS

geoparquet-wasm

Standalone library to read and write GeoParquet files in Wasm.

  • Rust crate name: geoparquet-wasm
  • JS package name: @geoarrow/geoparquet-wasm
  • Dependencies:
    • parquet-wasm for JS bindings to read/write Parquet
    • geoarrow-rs to encode/decode WKB geometries to/from GeoArrow
  • Functional API:
    • readGeoParquet: wraps parquet-wasm's readParquet, converting WKB column to GeoArrow before returning an arrow-wasm Table instance
    • writeGeoParquet: wraps parquet-wasm's writeParquet, converting GeoArrow in the Table input to WKB before passing on to writeParquet.
    • readGeoParquetStream: wraps parquet-wasm's readParquetStream
    • TODO: more async APIs

flatgeobuf-wasm

Standalone library to read and write FlatGeobuf files in Wasm.

  • Rust crate name: flatgeobuf-wasm
  • JS package name: @geoarrow/flatgeobuf-wasm
  • Dependencies:
    • arrow-wasm for JS bindings for Arrow FFI with Arrow JS
    • geoarrow-rs to read/write FlatGeobuf to/from GeoArrow
  • Functional API:
    • readFlatGeobuf: parses FlatGeobuf buffer, returning an arrow-wasm Table instance
    • writeFlatGeobuf: creates a FlatGeobuf buffer from an arrow-wasm Table instance.
    • Future: readFlatGeobufStream: generates an async iterable of arrow-wasm RecordBatch from a remote FlatGeobuf file
    • Future: read data by bounding-box from a remote file

The kitchen sink

The full compilation feature of geoarrow-wasm.

  • Rust crate name: geoarrow-wasm
  • JS package name: @geoarrow/geoarrow-wasm
  • Dependencies:
    • arrow-wasm for JS bindings for Arrow FFI with Arrow JS
    • geoparquet-wasm for JS bindings for GeoParquet
    • flatgeobuf-wasm for JS bindings for FlatGeobuf
    • geoarrow-rs for algorithms

Pure JS Interop

This is designed to smoothly interop with pure-JavaScript Arrow libraries.

Arrow JS

The canonical implementation of Arrow in JS. It only supports IPC for data I/O.

Arrow JS FFI

A library to read/write Arrow data across the Wasm boundary. This interops with the core arrow-wasm crate above.

GeoArrow JS

A pure-JavaScript (TypeScript) implementation of GeoArrow. This uses the exact same memory layout as GeoArrow in Rust, so it should be possible to mix and match between pure-JS and wasm-based algorithms without changing data representations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    javascriptPertains to JS WebAssembly bindings

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions