Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builder API and further removal of ONNX from the IR #170

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

pixelspark
Copy link
Collaborator

@pixelspark pixelspark commented May 21, 2023

This is a first attempt at addressing #169 as well as an earlier suggestion to move ONNX out of the IR.

tl;dr, this PR exposes a 'builder API' that allows you to use WONNX inference without an ONNX model (of course currently only ops from ONNX are available, but due to the further decoupling of the IR and ONNX, it will be easier to add custom ops in the future):

use wonnx::builder::*;

let a = tensor("x", &[1, 3], vec![0.1, 0.2, 0.3].into());
let b = tensor("y", &[1, 3], vec![3.0, 2.0, 1.0].into());
let axb = a.add(&b);

let sesh = session_for_outputs(&["result"], &[axb], 13).await.unwrap();
let result = sesh.run(&HashMap::new()).await.unwrap();
assert_eq!(
   result["result"],
   TensorData::F32(vec![3.1, 2.2, 1.3].into())
)

Currently the API is very limited (only add, neg and conv ops as well as input and tensor to define inference-time inputs and weights). Some work that needs to be done:

  • Remove the last bits of ONNX from the IR (AttributeValue in particular)
  • Replace "op name + attributes" with an actual enum containing all (supported) ops and their parameters.
  • Employ the existing shape inference logic for the more complicated ops (currently for conv the user has to specify the output shape, but the code to determine the output shape based on input shapes is already in wonnx_preprocessing. We might want to move that into the core crate, however the shape inference code is still very specific to ONNX).
  • Split ONNX support off from the backend into a separate crate (wonnx-backend and wonnx = pub use wonnx-backend::* + the ONNX bits, clean up the API surface in general.
  • Add all other ops
  • Write documentation for end users, README explaining we are not ONNX-only anymore
  • Documentation for developers (in particular explain things like TensorData<'static>)

Help is very much appreciated!

Nice to haves

  • Expose this API through Python/WASM?
  • Discuss overriding operators (so you can do e.g. 'a + b' on two tensors to generate an add.). I did not yet add this now because I think it is better to make things explicit (i.e. in the case of an 'Add' in ONNX you can still set attributes for e.g. broadcasting).

@philpax
Copy link
Contributor

philpax commented May 22, 2023

This is awesome! Nice work, looking forward to trying it out 🚀

@pixelspark
Copy link
Collaborator Author

@philpax you should be able to try this out now (see limitations above). Looking for some feedback!

@philpax
Copy link
Contributor

philpax commented May 23, 2023

Unfortunately I'm pretty busy this week, but I'll see if one of our devs can try it out 🤞

@LLukas22
Copy link

LLukas22 commented May 24, 2023

What will the developer experience be like? Is some sort of "interactive" mode planned where i can see the results an operation will have on a tensor while stepping throught with a debugger? Or will it be similar to GGML where the graph is build and then executed, which makes debugging extremly tedious?

@pixelspark
Copy link
Collaborator Author

What will the developer experience be like? Is some sort of "interactive" mode planned where i can see the results an operation will have on a tensor while stepping throught with a debugger? Or will it be similar to GGML where the graph is build and then executed, which makes debugging extremly tedious?

Our execution happens on the GPU and we generate a single set of commands to execute all ops in one go, so breakpoints are difficult to support. Also intermediate buffers remain in VRAM, are re-used, and cannot easily be made readable.

That said, you can mark any node output as inference output (add it to the output list), so you can obtain intermediate results for debugging. This will introduce some copying and extra buffers but is probably useful.

@LLukas22
Copy link

Our execution happens on the GPU and we generate a single set of commands to execute all ops in one go, so breakpoints are difficult to support. Also intermediate buffers remain in VRAM, are re-used, and cannot easily be made readable.

That said, you can mark any node output as inference output (add it to the output list), so you can obtain intermediate results for debugging. This will introduce some copying and extra buffers but is probably useful.

Alright got it, beeing able to mark any tensor as an output is a nice addition. Would it also be possible to build and execute the commands after each tensor operation and copy the results back from VRAM? It would be highly inefficient but i could see all tensor operation results while debugging. I'm just asking, as i worked with GGML a bit and debugging it is an ..... experience.

@pixelspark
Copy link
Collaborator Author

Our execution happens on the GPU and we generate a single set of commands to execute all ops in one go, so breakpoints are difficult to support. Also intermediate buffers remain in VRAM, are re-used, and cannot easily be made readable.
That said, you can mark any node output as inference output (add it to the output list), so you can obtain intermediate results for debugging. This will introduce some copying and extra buffers but is probably useful.

Alright got it, beeing able to mark any tensor as an output is a nice addition. Would it also be possible to build and execute the commands after each tensor operation and copy the results back from VRAM? It would be highly inefficient but i could see all tensor operation results while debugging. I'm just asking, as i worked with GGML a bit and debugging it is an ..... experience.

Well, given the current API, you could build your graph iteratively, create a new session each time (with the most recently added node as output) and call .run on it. (The TensorRef you get from the graph builder API are not tied to a session and so can be re-used. A session does exclusively maintain GPU buffers though). It would be terribly slow (also because intermediate results will not be re-used, so your are calculating 1 + 2 + 3 + 4 + .. + N ops instead of just N) but I've used this approach in the past with some succes (let it generate a file containing each nodes output, then use a few scripts to find differences between a known reference, stop execution when the difference is too large, etc.).

Another way to do it would be to generate a graph and session for a single op each time (and then run that with the previous node's output). This is a bit more performant but you are still copying everything back and forth from the GPU of course. Also you need to somehow deal with ops that take more than one input (but given the relative simplicity of the LLaMA models that should be doable).

@LLukas22
Copy link

Hey i finally got some time to take a look at these changes and i will start to play around a bit. Is there somewhere a list of all supported Tensor ops? And how would i integrate a custom OP, like Alibi?

@pixelspark
Copy link
Collaborator Author

Hey i finally got some time to take a look at these changes and i will start to play around a bit. Is there somewhere a list of all supported Tensor ops? And how would i integrate a custom OP, like Alibi?

Currently it is probably the easiest to check the giant match statement in wonnx/src/compiler.rs. It matches by ONNX op name and then generates shader code (through tera templates). I want to change this to an enum/trait and associated methods in this PR but haven't yet had the time to do so.

To implement a new op, see these instructions. Remember that in addition to implementing the op itself we also prefer to have some tests and (if relevant) shape inference code.

@LLukas22
Copy link

I converted a GPT-2 model to ONNX and visualized the operations here. The main thing still missing are the different shape manipulating operations like "Squeeze, Unsqueez, Concat, Gather" and stuff like that. But maybe you could also take a look, this is the first time im working with onnx, so maybe im missing something😅.

I will probably try to implement a multihead-attention layer with the current graph api.

@pixelspark
Copy link
Collaborator Author

I converted a GPT-2 model to ONNX and visualized the operations here. The main thing still missing are the different shape manipulating operations like "Squeeze, Unsqueez, Concat, Gather" and stuff like that. But maybe you could also take a look, this is the first time im working with onnx, so maybe im missing something😅.

Forgot to mention two things that might be useful:

  • You can use the nnx (wonnx-cli) tool to list operations used in an ONNX model (it is not smart enough to check WONNX support itself). That is a bit easier to read than your chart I guess :-)
  • Some operations (Reshape, Squeeze, Unsqueeze notably) only modify shape information and are 'optimized away' in wonnx/src/optimizer.rs (removed from the DAG).

From the top of my head Concat and Gather are implemented (but if not, relatively trivial operations as they just copy data)

@LLukas22
Copy link

Forgot to mention two things that might be useful:

* You can use the `nnx` (`wonnx-cli`) tool to list operations used in an ONNX model (it is not smart enough to check WONNX support itself). That is a bit easier to read than your chart I guess :-)

* Some operations (`Reshape`, `Squeeze`, `Unsqueeze` notably) only modify shape information and are 'optimized away' in `wonnx/src/optimizer.rs` (removed from the DAG).

Yeah i will give this a try, maybe it will produce a more readable output. But GPT-2 should be a good starting point as it is relatively simple, openly available and contains all building blocks we need for other LLMs.

From the top of my head Concat and Gather are implemented (but if not, relatively trivial operations as they just copy data)

I just looked at the functions implemented by TensorRef, which isn't much but later realized these are just wrappers around the functions defined in compiler.rs. But i now know what you were referring to with "Replace "op name + attributes" with an actual enum containing all (supported) ops and their parameters". 😅

I'm also kinda confused how i could create a f16 tensor, as the TensorData enum only specifies F32. Would the existing operations even work with f16 tensors?

@pixelspark
Copy link
Collaborator Author

pixelspark commented May 31, 2023

But i now know what you were referring to with "Replace "op name + attributes" with an actual enum containing all (supported) ops and their parameters". 😅

Yeah, that's a bit ugly :-) But also quite easily cleaned up.

I'm also kinda confused how i could create a f16 tensor, as the TensorData enum only specifies F32. Would the existing operations even work with f16 tensors?

F16 is not yet supported. However it should be easy to add as long as WGSL supports it. All shaders are 'generic' over the 'scalar type' (through the scalar_type variable in the shader code templates) so all that is needed is basically some code to tell WONNX the correct WGSL type name and some other things (related to e.g. alignment), as well as some From trait implementations for loading/reading F16 tensors.

My strategy would be: add an F16 variant to the TensorData enum (note, there is no Rust f16 so either add a crate to support that or convert from f32...). Then fix all the compiler complaints about missing match arms.

@LLukas22
Copy link

Got it, my plan was to somehow get a GPT-2 F16 GGML model loaded and executed and then work my way toward some more exotic quantization formats, which will require more work.

Is there a way to instantiate a tensor from a memory address, an offset and dimensions? Or do i have to first load the ggml tensor into lists and then instantiate a tensor with these lists? Would something like direct GPU storage be possible or is this out of scope of the web gpu standard?

@FL33TW00D
Copy link

FL33TW00D commented Jun 1, 2023

Got it, my plan was to somehow get a GPT-2 F16 GGML model loaded and executed and then work my way toward some more exotic quantization formats, which will require more work.

Is there a way to instantiate a tensor from a memory address, an offset and dimensions? Or do i have to first load the ggml tensor into lists and then instantiate a tensor with these lists? Would something like direct GPU storage be possible or is this out of scope of the web gpu standard?

If I were you, I would write a script that packs the GGML weights. This would pack 2 Float 16 values into a single u32 value. You could then use https://www.w3.org/TR/WGSL/#unpack2x16float-builtin to unpack them inside your shader.

I've done something similar here: https://gist.github.com/FL33TW00D/d81562557279d887705985f7c6ae4481
But it operates on an existing ONNX model.

@LLukas22
Copy link

LLukas22 commented Jun 1, 2023

If I were you, I would write a script that packs the GGML weights. This would pack 2 Float 16 values into a single u32 value. You could then use https://www.w3.org/TR/WGSL/#unpack2x16float-builtin to unpack them inside your shader.

This could be interesting, thanks. But i thought WebGPU had native f16 support since this got merged. Thats the reason i was asking for f16 support in WONNX.

I've done something similar here: https://gist.github.com/FL33TW00D/d81562557279d887705985f7c6ae4481
But it operates on an existing ONNX model.

Also very interesting, i probably have to try something similar when i try to implement the other quantization formats. But i guess i can also reference the BLAS GGML implementation a bit.

@FL33TW00D
Copy link

If I were you, I would write a script that packs the GGML weights. This would pack 2 Float 16 values into a single u32 value. You could then use https://www.w3.org/TR/WGSL/#unpack2x16float-builtin to unpack them inside your shader.

This could be interesting, thanks. But i thought WebGPU had native f16 support since this got merged. Thats the reason i was asking for f16 support in WONNX.

I've done something similar here: https://gist.github.com/FL33TW00D/d81562557279d887705985f7c6ae4481
But it operates on an existing ONNX model.

Also very interesting, i probably have to try something similar when i try to implement the other quantization formats. But i guess i can also reference the BLAS GGML implementation a bit.

gpuweb/gpuweb hosts the specification, but that does not mean it has been implemented anywhere.
F16 is gated behind a feature gate, and this is not currently available in any stable browsers.

Chrome is in the process of shipping F16, but not yet. naga and wgpu support will take even longer, and this would be required for F16 in WONNX.

@pixelspark
Copy link
Collaborator Author

Is there a way to instantiate a tensor from a memory address, an offset and dimensions? Or do i have to first load the ggml tensor into lists and then instantiate a tensor with these lists?

The TensorData enum takes a Cow<'model, [f32]> so it will happily use any slice you can throw at it as long as it has a lifetime of 'model or better (use bytemuck as used elsewhere in wonnx to cast slices if necessary. I suppose you will have a pointer to some mmap'ed memory you want to convert into a slice).

Would something like direct GPU storage be possible or is this out of scope of the web gpu standard?

Not sure what you mean by this - as far as I know there are no special provisions for unified memory in WebGPU/wgpu but I might be mistaken.

@LLukas22
Copy link

LLukas22 commented Jun 2, 2023

Not sure what you mean by this - as far as I know there are no special provisions for unified memory in WebGPU/wgpu but I might be mistaken.

I was refering to something like GPUDirect Storage where data can be moved directly into VRAM without beeing loaded into RAM first.

The TensorData enum takes a Cow<'model, [f32]> so it will happily use any slice you can throw at it as long as it has a lifetime of 'model or better (use bytemuck as used elsewhere in wonnx to cast slices if necessary. I suppose you will have a pointer to some mmap'ed memory you want to convert into a slice).

If i find some time this weekend i'll give it a try. Thanks for the info. And yes you are right we are working with either mmap'ed memory or a pointer to a file.

As i don't want to spam this thread further, is there a way to reach you directly (e.g. discord) if i have further questions? And again sorry for beeing nooby, but as i said this is my first time working with wgpu.😅

@pixelspark
Copy link
Collaborator Author

As i don't want to spam this thread further, is there a way to reach you directly (e.g. discord) if i have further questions? And again sorry for beeing nooby, but as i said this is my first time working with wgpu.😅

Feel free to spam the thread ;-) I am not (regularly) on Discord but can be reached by email (see profile).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants