Add a new API to make `Func::call` faster #3298

alexcrichton · 2021-09-03T18:07:49Z

The fastest way to call a WebAssembly function with Wasmtime is to use
the TypedFunc API and methods. This is only available to Rust code,
however, due to the usage of generics. The C API as a result is left to
only be able to use Func::call, which is quite slow today. While
Func::call has a lot of reasons that it's slow, some major
contributors are:

Space must be allocated for the arguments/return values to call the
trampoline with. This u128 storage is allocated on all
Func::call-based calls today.
The function's type is loaded to typecheck the arguments, and this
requires taking an rwlock in the Engine as well as cloning out the
FuncType itself.
For the C API the slice of inputs needs to be translated to a slice of
Val, and the results are translated from a vector of Val back to a
vector of wasmtime_val_t.

These two operations are particularly costly and the goal of this commit
is to solve these two issues. The solution implemented here is a new
structure, called FuncStorage, which can be created within an Engine
on a per-function-type basis. This storage is then used with a new API,
Func::call_with_storage, which removes the first two slowdowns mentioned
above. Each FuncStorage stores a copy of the FuncType it's intended
to be used with. Additionally it stores an appropriately-sized
Vec<u128> for storage of trampoline-encoded arguments.

The final bullet above is solved with tweaks to the
Func::call_with_storage API relative to Func::call where the
parameters/results are both iterators instead of slices.

This new API is intended to be a "power user" API for the Rust crate,
but is expected to be more commonly used with the C API since it's such
a large performance improvement to calling wasm functions.

Overall I'm not overly happy with this API. It solves a lot of the slow
wasmtime_func_call problem, but the APIs added here are pretty
unfortunate I think. Ideally we could solve this issue with no
additional API surface area. For example the first bullet could be
solved with a solution along the lines of #3294 where vectors are stored
in a Store and reused per-call. The third bullet could probably be
fixed with the same style and also changing Func::call to taking a
&mut [Val] as an argument instead of returning a boxed slice. The
second bullet though is probably one of the harder ones to fix. Each
Func could store it's fully-fleshed-out FuncType, but that's a
relatively large impact and would also likely require changing
FuncType to internally use Arc<[WasmType]> or similar. In any case
I'm hoping that this can help spur on some creativity for someone to
find a better solution to this issue.

github-actions · 2021-09-03T18:34:11Z

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api", "wasmtime:c-api"

Thus the following users have been cc'd because of the following labels:

peterhuene: wasmtime:api, wasmtime:c-api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

The fastest way to call a WebAssembly function with Wasmtime is to use the `TypedFunc` API and methods. This is only available to Rust code, however, due to the usage of generics. The C API as a result is left to only be able to use `Func::call`, which is quite slow today. While `Func::call` has a lot of reasons that it's slow, some major contributors are: * Space must be allocated for the arguments/return values to call the trampoline with. This `u128` storage is allocated on all `Func::call`-based calls today. * The function's type is loaded to typecheck the arguments, and this requires taking an rwlock in the `Engine` as well as cloning out the `FuncType` itself. * For the C API the slice of inputs needs to be translated to a slice of `Val`, and the results are translated from a vector of `Val` back to a vector of `wasmtime_val_t`. These two operations are particularly costly and the goal of this commit is to solve these two issues. The solution implemented here is a new structure, called `FuncStorage`, which can be created within an `Engine` on a per-function-type basis. This storage is then used with a new API, `Func::call_with_storage`, which removes the first two slowdowns mentioned above. Each `FuncStorage` stores a copy of the `FuncType` it's intended to be used with. Additionally it stores an appropriately-sized `Vec<u128>` for storage of trampoline-encoded arguments. The final bullet above is solved with tweaks to the `Func::call_with_storage` API relative to `Func::call` where the parameters/results are both iterators instead of slices. This new API is intended to be a "power user" API for the Rust crate, but is expected to be more commonly used with the C API since it's such a large performance improvement to calling wasm functions. Overall I'm not overly happy with this API. It solves a lot of the slow `wasmtime_func_call` problem, but the APIs added here are pretty unfortunate I think. Ideally we could solve this issue with no additional API surface area. For example the first bullet could be solved with a solution along the lines of bytecodealliance#3294 where vectors are stored in a `Store` and reused per-call. The third bullet could probably be fixed with the same style and also changing `Func::call` to taking a `&mut [Val]` as an argument instead of returning a boxed slice. The second bullet though is probably one of the harder ones to fix. Each `Func` could store it's fully-fleshed-out `FuncType`, but that's a relatively large impact and would also likely require changing `FuncType` to internally use `Arc<[WasmType]>` or similar. In any case I'm hoping that this can help spur on some creativity for someone to find a better solution to this issue.

alexcrichton · 2021-09-08T15:28:24Z

Pondering this more, I'm not happy with this. I'm going to try the alternate strategies to speed up Func::call and see what happens.

This commit is an alternative to bytecodealliance#3298 which achieves effectively the same goal of optimizing the `Func::call` API as well as its C API sibling of `wasmtime_func_call`. The strategy taken here is different than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to an existing API is done. Specifically this commit handles the major sources of slowness with `Func::call` with: * Looking up the type of a function, to typecheck the arguments with and use to guide how the results should be loaded, no longer hits the rwlock in the `Engine` but instead each `Func` contains its own `FuncType`. This can be an unnecessary allocation for funcs not used with `Func::call`, so this is a downside of this implementation relative to bytecodealliance#3298. A mitigating factor, though, is that instance exports are loaded lazily into the `Store` and in theory not too many funcs are active in the store as `Func` objects. * Temporary storage is amortized with a long-lived `Vec` in the `Store` rather than allocating a new vector on each call. This is basically the same strategy as bytecodealliance#3294 only applied to different types in different places. Specifically `wasmtime::Store` now retains a `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for calling `Func::call`. * Finally, an API breaking change is made to `Func::call` and its type signature (as well as `Func::call_async`). Instead of returning `Box<[Val]>` as it did before this function now takes a `results: &mut [Val]` parameter. This allows the caller to manage the allocation and we can amortize-remove it in `wasmtime_func_call` by using space after the parameters in the `Vec<Val>` we're passing in. This change is naturally a breaking change and we'll want to consider it carefully, but mitigating factors are that most embeddings are likely using `TypedFunc::call` instead and this signature taking a mutable slice better aligns with `Func::new` which receives a mutable slice for the results. Overall this change, in the benchmark of "call a nop function from the C API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the order of 15ns, because there's lots of capacity checks around vectors and the type checks are slightly less optimized than before. Overall though this is still significantly better than today because allocations and the rwlock to acquire the type information are both avoided. I personally feel that this change is the best to do because it has less of an API impact than bytecodealliance#3298.

alexcrichton · 2021-09-08T21:28:40Z

Ok I have an alternative implementation with less (not zero, just less) API impact at #3319. I'm currently favoring that one despite it being slightly slower in benchmarks because there's still so much movement of Val values, but I'm not sure that this new API added here is really worth it.

This commit is an alternative to bytecodealliance#3298 which achieves effectively the same goal of optimizing the `Func::call` API as well as its C API sibling of `wasmtime_func_call`. The strategy taken here is different than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to an existing API is done. Specifically this commit handles the major sources of slowness with `Func::call` with: * Looking up the type of a function, to typecheck the arguments with and use to guide how the results should be loaded, no longer hits the rwlock in the `Engine` but instead each `Func` contains its own `FuncType`. This can be an unnecessary allocation for funcs not used with `Func::call`, so this is a downside of this implementation relative to bytecodealliance#3298. A mitigating factor, though, is that instance exports are loaded lazily into the `Store` and in theory not too many funcs are active in the store as `Func` objects. * Temporary storage is amortized with a long-lived `Vec` in the `Store` rather than allocating a new vector on each call. This is basically the same strategy as bytecodealliance#3294 only applied to different types in different places. Specifically `wasmtime::Store` now retains a `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for calling `Func::call`. * Finally, an API breaking change is made to `Func::call` and its type signature (as well as `Func::call_async`). Instead of returning `Box<[Val]>` as it did before this function now takes a `results: &mut [Val]` parameter. This allows the caller to manage the allocation and we can amortize-remove it in `wasmtime_func_call` by using space after the parameters in the `Vec<Val>` we're passing in. This change is naturally a breaking change and we'll want to consider it carefully, but mitigating factors are that most embeddings are likely using `TypedFunc::call` instead and this signature taking a mutable slice better aligns with `Func::new` which receives a mutable slice for the results. Overall this change, in the benchmark of "call a nop function from the C API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the order of 15ns, because there's lots of capacity checks around vectors and the type checks are slightly less optimized than before. Overall though this is still significantly better than today because allocations and the rwlock to acquire the type information are both avoided. I personally feel that this change is the best to do because it has less of an API impact than bytecodealliance#3298.

alexcrichton · 2021-09-13T18:05:56Z

I'm going to close this because I don't think that this is the best way to optimize this, it futzes with too many APIs and will be annoying to maintain I think.

This commit is an alternative to bytecodealliance#3298 which achieves effectively the same goal of optimizing the `Func::call` API as well as its C API sibling of `wasmtime_func_call`. The strategy taken here is different than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to an existing API is done. Specifically this commit handles the major sources of slowness with `Func::call` with: * Looking up the type of a function, to typecheck the arguments with and use to guide how the results should be loaded, no longer hits the rwlock in the `Engine` but instead each `Func` contains its own `FuncType`. This can be an unnecessary allocation for funcs not used with `Func::call`, so this is a downside of this implementation relative to bytecodealliance#3298. A mitigating factor, though, is that instance exports are loaded lazily into the `Store` and in theory not too many funcs are active in the store as `Func` objects. * Temporary storage is amortized with a long-lived `Vec` in the `Store` rather than allocating a new vector on each call. This is basically the same strategy as bytecodealliance#3294 only applied to different types in different places. Specifically `wasmtime::Store` now retains a `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for calling `Func::call`. * Finally, an API breaking change is made to `Func::call` and its type signature (as well as `Func::call_async`). Instead of returning `Box<[Val]>` as it did before this function now takes a `results: &mut [Val]` parameter. This allows the caller to manage the allocation and we can amortize-remove it in `wasmtime_func_call` by using space after the parameters in the `Vec<Val>` we're passing in. This change is naturally a breaking change and we'll want to consider it carefully, but mitigating factors are that most embeddings are likely using `TypedFunc::call` instead and this signature taking a mutable slice better aligns with `Func::new` which receives a mutable slice for the results. Overall this change, in the benchmark of "call a nop function from the C API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the order of 15ns, because there's lots of capacity checks around vectors and the type checks are slightly less optimized than before. Overall though this is still significantly better than today because allocations and the rwlock to acquire the type information are both avoided. I personally feel that this change is the best to do because it has less of an API impact than bytecodealliance#3298.

* Optimize `Func::call` and its C API This commit is an alternative to #3298 which achieves effectively the same goal of optimizing the `Func::call` API as well as its C API sibling of `wasmtime_func_call`. The strategy taken here is different than #3298 though where a new API isn't created, rather a small tweak to an existing API is done. Specifically this commit handles the major sources of slowness with `Func::call` with: * Looking up the type of a function, to typecheck the arguments with and use to guide how the results should be loaded, no longer hits the rwlock in the `Engine` but instead each `Func` contains its own `FuncType`. This can be an unnecessary allocation for funcs not used with `Func::call`, so this is a downside of this implementation relative to #3298. A mitigating factor, though, is that instance exports are loaded lazily into the `Store` and in theory not too many funcs are active in the store as `Func` objects. * Temporary storage is amortized with a long-lived `Vec` in the `Store` rather than allocating a new vector on each call. This is basically the same strategy as #3294 only applied to different types in different places. Specifically `wasmtime::Store` now retains a `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for calling `Func::call`. * Finally, an API breaking change is made to `Func::call` and its type signature (as well as `Func::call_async`). Instead of returning `Box<[Val]>` as it did before this function now takes a `results: &mut [Val]` parameter. This allows the caller to manage the allocation and we can amortize-remove it in `wasmtime_func_call` by using space after the parameters in the `Vec<Val>` we're passing in. This change is naturally a breaking change and we'll want to consider it carefully, but mitigating factors are that most embeddings are likely using `TypedFunc::call` instead and this signature taking a mutable slice better aligns with `Func::new` which receives a mutable slice for the results. Overall this change, in the benchmark of "call a nop function from the C API" is not quite as good as #3298. It's still a bit slower, on the order of 15ns, because there's lots of capacity checks around vectors and the type checks are slightly less optimized than before. Overall though this is still significantly better than today because allocations and the rwlock to acquire the type information are both avoided. I personally feel that this change is the best to do because it has less of an API impact than #3298. * Rebase issues

github-actions bot added wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:c-api Issues pertaining to the C API. labels Sep 3, 2021

alexcrichton force-pushed the arguments branch from ea410ed to 4c1b4f5 Compare September 3, 2021 19:04

alexcrichton force-pushed the arguments branch from 4c1b4f5 to 6dd07a2 Compare September 3, 2021 20:24

alexcrichton mentioned this pull request Sep 8, 2021

Optimize Func::call and its C API #3319

Merged

alexcrichton closed this Sep 13, 2021

alexcrichton deleted the arguments branch September 27, 2021 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a new API to make `Func::call` faster #3298

Add a new API to make `Func::call` faster #3298

Uh oh!

alexcrichton commented Sep 3, 2021

Uh oh!

github-actions bot commented Sep 3, 2021

Uh oh!

alexcrichton commented Sep 8, 2021

Uh oh!

alexcrichton commented Sep 8, 2021

Uh oh!

alexcrichton commented Sep 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add a new API to make Func::call faster #3298

Add a new API to make Func::call faster #3298

Uh oh!

Conversation

alexcrichton commented Sep 3, 2021

Uh oh!

github-actions bot commented Sep 3, 2021

Subscribe to Label Action

Uh oh!

alexcrichton commented Sep 8, 2021

Uh oh!

alexcrichton commented Sep 8, 2021

Uh oh!

alexcrichton commented Sep 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add a new API to make `Func::call` faster #3298

Add a new API to make `Func::call` faster #3298