-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Builder API and further removal of ONNX from the IR #170
base: master
Are you sure you want to change the base?
Conversation
This is awesome! Nice work, looking forward to trying it out 🚀 |
@philpax you should be able to try this out now (see limitations above). Looking for some feedback! |
… OperatorDefinition
…building an ONNX-less wonnx later)
…is now fully independent of ONNX protos
d3536f7
to
9d0caae
Compare
Unfortunately I'm pretty busy this week, but I'll see if one of our devs can try it out 🤞 |
What will the developer experience be like? Is some sort of "interactive" mode planned where i can see the results an operation will have on a tensor while stepping throught with a debugger? Or will it be similar to GGML where the graph is build and then executed, which makes debugging extremly tedious? |
Our execution happens on the GPU and we generate a single set of commands to execute all ops in one go, so breakpoints are difficult to support. Also intermediate buffers remain in VRAM, are re-used, and cannot easily be made readable. That said, you can mark any node output as inference output (add it to the output list), so you can obtain intermediate results for debugging. This will introduce some copying and extra buffers but is probably useful. |
Alright got it, beeing able to mark any tensor as an output is a nice addition. Would it also be possible to build and execute the commands after each tensor operation and copy the results back from VRAM? It would be highly inefficient but i could see all tensor operation results while debugging. I'm just asking, as i worked with GGML a bit and debugging it is an ..... experience. |
Well, given the current API, you could build your graph iteratively, create a new session each time (with the most recently added node as output) and call Another way to do it would be to generate a graph and session for a single op each time (and then run that with the previous node's output). This is a bit more performant but you are still copying everything back and forth from the GPU of course. Also you need to somehow deal with ops that take more than one input (but given the relative simplicity of the LLaMA models that should be doable). |
Hey i finally got some time to take a look at these changes and i will start to play around a bit. Is there somewhere a list of all supported Tensor ops? And how would i integrate a custom OP, like Alibi? |
Currently it is probably the easiest to check the giant To implement a new op, see these instructions. Remember that in addition to implementing the op itself we also prefer to have some tests and (if relevant) shape inference code. |
I converted a GPT-2 model to ONNX and visualized the operations here. The main thing still missing are the different shape manipulating operations like "Squeeze, Unsqueez, Concat, Gather" and stuff like that. But maybe you could also take a look, this is the first time im working with onnx, so maybe im missing something😅. I will probably try to implement a multihead-attention layer with the current graph api. |
Forgot to mention two things that might be useful:
From the top of my head Concat and Gather are implemented (but if not, relatively trivial operations as they just copy data) |
Yeah i will give this a try, maybe it will produce a more readable output. But GPT-2 should be a good starting point as it is relatively simple, openly available and contains all building blocks we need for other LLMs.
I just looked at the functions implemented by I'm also kinda confused how i could create a |
Yeah, that's a bit ugly :-) But also quite easily cleaned up.
F16 is not yet supported. However it should be easy to add as long as WGSL supports it. All shaders are 'generic' over the 'scalar type' (through the My strategy would be: add an |
Got it, my plan was to somehow get a GPT-2 F16 GGML model loaded and executed and then work my way toward some more exotic quantization formats, which will require more work. Is there a way to instantiate a tensor from a memory address, an offset and dimensions? Or do i have to first load the ggml tensor into lists and then instantiate a tensor with these lists? Would something like direct GPU storage be possible or is this out of scope of the web gpu standard? |
If I were you, I would write a script that packs the GGML weights. This would pack 2 Float 16 values into a single u32 value. You could then use https://www.w3.org/TR/WGSL/#unpack2x16float-builtin to unpack them inside your shader. I've done something similar here: https://gist.github.com/FL33TW00D/d81562557279d887705985f7c6ae4481 |
This could be interesting, thanks. But i thought WebGPU had native
Also very interesting, i probably have to try something similar when i try to implement the other quantization formats. But i guess i can also reference the BLAS GGML implementation a bit. |
Chrome is in the process of shipping F16, but not yet. |
The
Not sure what you mean by this - as far as I know there are no special provisions for unified memory in WebGPU/wgpu but I might be mistaken. |
I was refering to something like GPUDirect Storage where data can be moved directly into VRAM without beeing loaded into RAM first.
If i find some time this weekend i'll give it a try. Thanks for the info. And yes you are right we are working with either mmap'ed memory or a pointer to a file. As i don't want to spam this thread further, is there a way to reach you directly (e.g. discord) if i have further questions? And again sorry for beeing nooby, but as i said this is my first time working with wgpu.😅 |
Feel free to spam the thread ;-) I am not (regularly) on Discord but can be reached by email (see profile). |
This is a first attempt at addressing #169 as well as an earlier suggestion to move ONNX out of the IR.
tl;dr, this PR exposes a 'builder API' that allows you to use WONNX inference without an ONNX model (of course currently only ops from ONNX are available, but due to the further decoupling of the IR and ONNX, it will be easier to add custom ops in the future):
Currently the API is very limited (only
add
,neg
andconv
ops as well asinput
andtensor
to define inference-time inputs and weights). Some work that needs to be done:AttributeValue
in particular)enum
containing all (supported) ops and their parameters.conv
the user has to specify the output shape, but the code to determine the output shape based on input shapes is already inwonnx_preprocessing
. We might want to move that into the core crate, however the shape inference code is still very specific to ONNX).wonnx-backend
andwonnx
=pub use wonnx-backend::*
+ the ONNX bits, clean up the API surface in general.TensorData<'static>
)Help is very much appreciated!
Nice to haves