Builder API and further removal of ONNX from the IR #170

pixelspark · 2023-05-21T21:31:23Z

This is a first attempt at addressing #169 as well as an earlier suggestion to move ONNX out of the IR.

tl;dr, this PR exposes a 'builder API' that allows you to use WONNX inference without an ONNX model (of course currently only ops from ONNX are available, but due to the further decoupling of the IR and ONNX, it will be easier to add custom ops in the future):

use wonnx::builder::*;

let a = tensor("x", &[1, 3], vec![0.1, 0.2, 0.3].into());
let b = tensor("y", &[1, 3], vec![3.0, 2.0, 1.0].into());
let axb = a.add(&b);

let sesh = session_for_outputs(&["result"], &[axb], 13).await.unwrap();
let result = sesh.run(&HashMap::new()).await.unwrap();
assert_eq!(
   result["result"],
   TensorData::F32(vec![3.1, 2.2, 1.3].into())
)

Currently the API is very limited (only add, neg and conv ops as well as input and tensor to define inference-time inputs and weights). Some work that needs to be done:

Remove the last bits of ONNX from the IR (AttributeValue in particular)
Replace "op name + attributes" with an actual enum containing all (supported) ops and their parameters.
Employ the existing shape inference logic for the more complicated ops (currently for conv the user has to specify the output shape, but the code to determine the output shape based on input shapes is already in wonnx_preprocessing. We might want to move that into the core crate, however the shape inference code is still very specific to ONNX).
Split ONNX support off from the backend into a separate crate (wonnx-backend and wonnx = pub use wonnx-backend::* + the ONNX bits, clean up the API surface in general.
Add all other ops
Write documentation for end users, README explaining we are not ONNX-only anymore
Documentation for developers (in particular explain things like TensorData<'static>)

Help is very much appreciated!

Nice to haves

Expose this API through Python/WASM?
Discuss overriding operators (so you can do e.g. 'a + b' on two tensors to generate an add.). I did not yet add this now because I think it is better to make things explicit (i.e. in the case of an 'Add' in ONNX you can still set attributes for e.g. broadcasting).

philpax · 2023-05-22T02:39:28Z

This is awesome! Nice work, looking forward to trying it out 🚀

pixelspark · 2023-05-23T05:14:34Z

@philpax you should be able to try this out now (see limitations above). Looking for some feedback!

… OperatorDefinition

…building an ONNX-less wonnx later)

…is now fully independent of ONNX protos

philpax · 2023-05-23T21:16:26Z

Unfortunately I'm pretty busy this week, but I'll see if one of our devs can try it out 🤞

LLukas22 · 2023-05-24T07:48:29Z

What will the developer experience be like? Is some sort of "interactive" mode planned where i can see the results an operation will have on a tensor while stepping throught with a debugger? Or will it be similar to GGML where the graph is build and then executed, which makes debugging extremly tedious?

pixelspark · 2023-05-24T10:56:24Z

What will the developer experience be like? Is some sort of "interactive" mode planned where i can see the results an operation will have on a tensor while stepping throught with a debugger? Or will it be similar to GGML where the graph is build and then executed, which makes debugging extremly tedious?

Our execution happens on the GPU and we generate a single set of commands to execute all ops in one go, so breakpoints are difficult to support. Also intermediate buffers remain in VRAM, are re-used, and cannot easily be made readable.

That said, you can mark any node output as inference output (add it to the output list), so you can obtain intermediate results for debugging. This will introduce some copying and extra buffers but is probably useful.

LLukas22 · 2023-05-24T11:55:52Z

Our execution happens on the GPU and we generate a single set of commands to execute all ops in one go, so breakpoints are difficult to support. Also intermediate buffers remain in VRAM, are re-used, and cannot easily be made readable.

That said, you can mark any node output as inference output (add it to the output list), so you can obtain intermediate results for debugging. This will introduce some copying and extra buffers but is probably useful.

Alright got it, beeing able to mark any tensor as an output is a nice addition. Would it also be possible to build and execute the commands after each tensor operation and copy the results back from VRAM? It would be highly inefficient but i could see all tensor operation results while debugging. I'm just asking, as i worked with GGML a bit and debugging it is an ..... experience.

pixelspark · 2023-05-24T12:07:03Z

Our execution happens on the GPU and we generate a single set of commands to execute all ops in one go, so breakpoints are difficult to support. Also intermediate buffers remain in VRAM, are re-used, and cannot easily be made readable.
That said, you can mark any node output as inference output (add it to the output list), so you can obtain intermediate results for debugging. This will introduce some copying and extra buffers but is probably useful.

Alright got it, beeing able to mark any tensor as an output is a nice addition. Would it also be possible to build and execute the commands after each tensor operation and copy the results back from VRAM? It would be highly inefficient but i could see all tensor operation results while debugging. I'm just asking, as i worked with GGML a bit and debugging it is an ..... experience.

Well, given the current API, you could build your graph iteratively, create a new session each time (with the most recently added node as output) and call .run on it. (The TensorRef you get from the graph builder API are not tied to a session and so can be re-used. A session does exclusively maintain GPU buffers though). It would be terribly slow (also because intermediate results will not be re-used, so your are calculating 1 + 2 + 3 + 4 + .. + N ops instead of just N) but I've used this approach in the past with some succes (let it generate a file containing each nodes output, then use a few scripts to find differences between a known reference, stop execution when the difference is too large, etc.).

Another way to do it would be to generate a graph and session for a single op each time (and then run that with the previous node's output). This is a bit more performant but you are still copying everything back and forth from the GPU of course. Also you need to somehow deal with ops that take more than one input (but given the relative simplicity of the LLaMA models that should be doable).

LLukas22 · 2023-05-30T18:10:54Z

Hey i finally got some time to take a look at these changes and i will start to play around a bit. Is there somewhere a list of all supported Tensor ops? And how would i integrate a custom OP, like Alibi?

pixelspark · 2023-05-30T18:15:35Z

Hey i finally got some time to take a look at these changes and i will start to play around a bit. Is there somewhere a list of all supported Tensor ops? And how would i integrate a custom OP, like Alibi?

Currently it is probably the easiest to check the giant match statement in wonnx/src/compiler.rs. It matches by ONNX op name and then generates shader code (through tera templates). I want to change this to an enum/trait and associated methods in this PR but haven't yet had the time to do so.

To implement a new op, see these instructions. Remember that in addition to implementing the op itself we also prefer to have some tests and (if relevant) shape inference code.

LLukas22 · 2023-05-31T11:43:41Z

I converted a GPT-2 model to ONNX and visualized the operations here. The main thing still missing are the different shape manipulating operations like "Squeeze, Unsqueez, Concat, Gather" and stuff like that. But maybe you could also take a look, this is the first time im working with onnx, so maybe im missing something😅.

I will probably try to implement a multihead-attention layer with the current graph api.

pixelspark · 2023-05-31T14:11:16Z

I converted a GPT-2 model to ONNX and visualized the operations here. The main thing still missing are the different shape manipulating operations like "Squeeze, Unsqueez, Concat, Gather" and stuff like that. But maybe you could also take a look, this is the first time im working with onnx, so maybe im missing something😅.

Forgot to mention two things that might be useful:

You can use the nnx (wonnx-cli) tool to list operations used in an ONNX model (it is not smart enough to check WONNX support itself). That is a bit easier to read than your chart I guess :-)
Some operations (Reshape, Squeeze, Unsqueeze notably) only modify shape information and are 'optimized away' in wonnx/src/optimizer.rs (removed from the DAG).

From the top of my head Concat and Gather are implemented (but if not, relatively trivial operations as they just copy data)

LLukas22 · 2023-05-31T14:23:47Z

Forgot to mention two things that might be useful:

* You can use the `nnx` (`wonnx-cli`) tool to list operations used in an ONNX model (it is not smart enough to check WONNX support itself). That is a bit easier to read than your chart I guess :-)

* Some operations (`Reshape`, `Squeeze`, `Unsqueeze` notably) only modify shape information and are 'optimized away' in `wonnx/src/optimizer.rs` (removed from the DAG).

Yeah i will give this a try, maybe it will produce a more readable output. But GPT-2 should be a good starting point as it is relatively simple, openly available and contains all building blocks we need for other LLMs.

From the top of my head Concat and Gather are implemented (but if not, relatively trivial operations as they just copy data)

I just looked at the functions implemented by TensorRef, which isn't much but later realized these are just wrappers around the functions defined in compiler.rs. But i now know what you were referring to with "Replace "op name + attributes" with an actual enum containing all (supported) ops and their parameters". 😅

I'm also kinda confused how i could create a f16 tensor, as the TensorData enum only specifies F32. Would the existing operations even work with f16 tensors?

pixelspark · 2023-05-31T18:58:30Z

But i now know what you were referring to with "Replace "op name + attributes" with an actual enum containing all (supported) ops and their parameters". 😅

Yeah, that's a bit ugly :-) But also quite easily cleaned up.

I'm also kinda confused how i could create a f16 tensor, as the TensorData enum only specifies F32. Would the existing operations even work with f16 tensors?

F16 is not yet supported. However it should be easy to add as long as WGSL supports it. All shaders are 'generic' over the 'scalar type' (through the scalar_type variable in the shader code templates) so all that is needed is basically some code to tell WONNX the correct WGSL type name and some other things (related to e.g. alignment), as well as some From trait implementations for loading/reading F16 tensors.

My strategy would be: add an F16 variant to the TensorData enum (note, there is no Rust f16 so either add a crate to support that or convert from f32...). Then fix all the compiler complaints about missing match arms.

LLukas22 · 2023-05-31T20:50:05Z

Got it, my plan was to somehow get a GPT-2 F16 GGML model loaded and executed and then work my way toward some more exotic quantization formats, which will require more work.

Is there a way to instantiate a tensor from a memory address, an offset and dimensions? Or do i have to first load the ggml tensor into lists and then instantiate a tensor with these lists? Would something like direct GPU storage be possible or is this out of scope of the web gpu standard?

FL33TW00D · 2023-06-01T08:05:17Z

Got it, my plan was to somehow get a GPT-2 F16 GGML model loaded and executed and then work my way toward some more exotic quantization formats, which will require more work.

Is there a way to instantiate a tensor from a memory address, an offset and dimensions? Or do i have to first load the ggml tensor into lists and then instantiate a tensor with these lists? Would something like direct GPU storage be possible or is this out of scope of the web gpu standard?

If I were you, I would write a script that packs the GGML weights. This would pack 2 Float 16 values into a single u32 value. You could then use https://www.w3.org/TR/WGSL/#unpack2x16float-builtin to unpack them inside your shader.

I've done something similar here: https://gist.github.com/FL33TW00D/d81562557279d887705985f7c6ae4481
But it operates on an existing ONNX model.

LLukas22 · 2023-06-01T09:54:33Z

If I were you, I would write a script that packs the GGML weights. This would pack 2 Float 16 values into a single u32 value. You could then use https://www.w3.org/TR/WGSL/#unpack2x16float-builtin to unpack them inside your shader.

This could be interesting, thanks. But i thought WebGPU had native f16 support since this got merged. Thats the reason i was asking for f16 support in WONNX.

I've done something similar here: https://gist.github.com/FL33TW00D/d81562557279d887705985f7c6ae4481
But it operates on an existing ONNX model.

Also very interesting, i probably have to try something similar when i try to implement the other quantization formats. But i guess i can also reference the BLAS GGML implementation a bit.

FL33TW00D · 2023-06-01T10:07:56Z

If I were you, I would write a script that packs the GGML weights. This would pack 2 Float 16 values into a single u32 value. You could then use https://www.w3.org/TR/WGSL/#unpack2x16float-builtin to unpack them inside your shader.

This could be interesting, thanks. But i thought WebGPU had native f16 support since this got merged. Thats the reason i was asking for f16 support in WONNX.

I've done something similar here: https://gist.github.com/FL33TW00D/d81562557279d887705985f7c6ae4481
But it operates on an existing ONNX model.

Also very interesting, i probably have to try something similar when i try to implement the other quantization formats. But i guess i can also reference the BLAS GGML implementation a bit.

gpuweb/gpuweb hosts the specification, but that does not mean it has been implemented anywhere.
F16 is gated behind a feature gate, and this is not currently available in any stable browsers.

Chrome is in the process of shipping F16, but not yet. naga and wgpu support will take even longer, and this would be required for F16 in WONNX.

pixelspark · 2023-06-01T16:41:02Z

Is there a way to instantiate a tensor from a memory address, an offset and dimensions? Or do i have to first load the ggml tensor into lists and then instantiate a tensor with these lists?

The TensorData enum takes a Cow<'model, [f32]> so it will happily use any slice you can throw at it as long as it has a lifetime of 'model or better (use bytemuck as used elsewhere in wonnx to cast slices if necessary. I suppose you will have a pointer to some mmap'ed memory you want to convert into a slice).

Would something like direct GPU storage be possible or is this out of scope of the web gpu standard?

Not sure what you mean by this - as far as I know there are no special provisions for unified memory in WebGPU/wgpu but I might be mistaken.

LLukas22 · 2023-06-02T12:43:44Z

Not sure what you mean by this - as far as I know there are no special provisions for unified memory in WebGPU/wgpu but I might be mistaken.

I was refering to something like GPUDirect Storage where data can be moved directly into VRAM without beeing loaded into RAM first.

The TensorData enum takes a Cow<'model, [f32]> so it will happily use any slice you can throw at it as long as it has a lifetime of 'model or better (use bytemuck as used elsewhere in wonnx to cast slices if necessary. I suppose you will have a pointer to some mmap'ed memory you want to convert into a slice).

If i find some time this weekend i'll give it a try. Thanks for the info. And yes you are right we are working with either mmap'ed memory or a pointer to a file.

As i don't want to spam this thread further, is there a way to reach you directly (e.g. discord) if i have further questions? And again sorry for beeing nooby, but as i said this is my first time working with wgpu.😅

pixelspark · 2023-06-02T12:47:15Z

As i don't want to spam this thread further, is there a way to reach you directly (e.g. discord) if i have further questions? And again sorry for beeing nooby, but as i said this is my first time working with wgpu.😅

Feel free to spam the thread ;-) I am not (regularly) on Discord but can be reached by email (see profile).

pixelspark added 12 commits May 23, 2023 22:57

feat: add a (first version of) a builder API. Remove ONNX struct from…

e7a2ee5

… OperatorDefinition

chore: merge OutputTensor and InputTensor -> TensorData

1b9b1cf

chore: remove ONNX TensorProto from IR

4b16929

chore: remove ValueInfoProto (ONNX input) from IR

73f7162

fix: WASM build

26e507c

chore: fix simple_graph example

e6855a7

fix: failing test

0cf026d

chore: rename mod utils -> mod tensor (automated refactor)

e67602e

chore: move ONNX-specific functionality to onnx_model module (allows …

9a884c0

…building an ONNX-less wonnx later)

chore: replace ONNX AttributeValueProto in IR with our own enum - IR …

737ecd2

…is now fully independent of ONNX protos

chore: minor edits to simple_graph example

4edd635

chore: document the builder functions

9d0caae

pixelspark force-pushed the feature/builder branch from d3536f7 to 9d0caae Compare May 23, 2023 20:57

pixelspark mentioned this pull request May 26, 2023

Inference without ONNX / usage of WONNX as backend for LLMs #169

Open

Builder API and further removal of ONNX from the IR #170

Are you sure you want to change the base?

Builder API and further removal of ONNX from the IR #170

Uh oh!

Conversation

pixelspark commented May 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Nice to haves

Uh oh!

philpax commented May 22, 2023

Uh oh!

pixelspark commented May 23, 2023

Uh oh!

philpax commented May 23, 2023

Uh oh!

LLukas22 commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pixelspark commented May 24, 2023

Uh oh!

LLukas22 commented May 24, 2023

Uh oh!

pixelspark commented May 24, 2023

Uh oh!

LLukas22 commented May 30, 2023

Uh oh!

pixelspark commented May 30, 2023

Uh oh!

LLukas22 commented May 31, 2023

Uh oh!

pixelspark commented May 31, 2023

Uh oh!

LLukas22 commented May 31, 2023

Uh oh!

pixelspark commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LLukas22 commented May 31, 2023

Uh oh!

FL33TW00D commented Jun 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LLukas22 commented Jun 1, 2023

Uh oh!

FL33TW00D commented Jun 1, 2023

Uh oh!

pixelspark commented Jun 1, 2023

Uh oh!

LLukas22 commented Jun 2, 2023

Uh oh!

pixelspark commented Jun 2, 2023

Uh oh!

Uh oh!

pixelspark commented May 21, 2023 •

edited

Loading

LLukas22 commented May 24, 2023 •

edited

Loading

pixelspark commented May 31, 2023 •

edited

Loading

FL33TW00D commented Jun 1, 2023 •

edited

Loading