Easily create and run user-defined functions (UDF) on Apache Arrow. You can define functions in Rust, Python, Java or JavaScript. The functions can be executed natively, or in WebAssembly, or in a remote server.
| Language | Native | WebAssembly | Remote |
|---|---|---|---|
| Rust | arrow-udf | arrow-udf-runtime/wasm | |
| Python | arrow-udf-runtime/python | arrow-udf-remote/python | |
| JavaScript | arrow-udf-runtime/javascript | ||
| Java | arrow-udf-remote/java |
Note
arrow-udf generates RecordBatch Rust functions from scalar functions, and can be used in more general contexts
whenever you need to work with Arrow Data in Rust, not specifically user-provided code.
Other crates are more focused on providing runtimes or protocols for running user-provided code.
arrow-udf: You callfn(&RecordBatch)->RecordBatchdirectly, as if you wrote it by hand.arrow-udf-runtime/python/arrow-udf-runtime/javascript: You firstadd_functionto aRuntime, and then call it with theRuntime.arrow-udf-runtime/wasm: You first create aRuntimewith compiled WASM binary, and thenfind_functionand call it.arrow-udf-runtime/remote: You start aClientto call the function running in a remoteServerprocess.
You can also use this library to add custom functions to DuckDB, see arrow-udf-duckdb-example.
In addition to the standard types defined by Arrow, these crates also support the following data types through Arrow's extension type. When using extension types, you need to add the ARROW:extension:name key to the field's metadata.
| Extension Type | Physical Type | ARROW:extension:name |
|---|---|---|
| JSON | Utf8, Binary, LargeBinary | arrowudf.json |
| Decimal | Utf8 | arrowudf.decimal |
Alternatively, you can configure the extension metadata key and values to look for when converting between Arrow and extension types:
let mut js_runtime = arrow_udf_runtime::javascript::Runtime::new().unwrap();
let converter = js_runtime.converter_mut();
converter.set_arrow_extension_key("Extension");
converter.set_json_extension_name("Variant");
converter.set_decimal_extension_name("Decimal");JSON type is stored in string array in text form.
let json_field = Field::new(name, DataType::Utf8, true)
.with_metadata([("ARROW:extension:name".into(), "arrowudf.json".into())].into());
let json_array = StringArray::from(vec![r#"{"key": "value"}"#]);Different from the fixed-point decimal type built into Arrow, this decimal type represents floating-point numbers with arbitrary precision or scale, that is, the unconstrained numeric in Postgres. The decimal type is stored in a string array in text form.
let decimal_field = Field::new(name, DataType::Utf8, true)
.with_metadata([("ARROW:extension:name".into(), "arrowudf.decimal".into())].into());
let decimal_array = StringArray::from(vec!["0.0001", "-1.23", "0"]);We have benchmarked the performance of function calls in different environments. You can run the benchmarks with the following command:
cargo bench --bench benchPerformance comparison of calling gcd on a chunk of 1024 rows:
gcd/native 1.5237 µs x1
gcd/wasm 15.547 µs x10
gcd/js(quickjs) 85.007 µs x55
gcd/python 175.29 µs x115
- RisingWave: A Distributed SQL Database for Stream Processing.
- Databend: An open-source cloud data warehouse that serves as a cost-effective alternative to Snowflake.