Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hand off code to a preinstalled optimized runtime if available #2

Open
dtolnay opened this issue Oct 14, 2019 · 33 comments
Open

Hand off code to a preinstalled optimized runtime if available #2

dtolnay opened this issue Oct 14, 2019 · 33 comments
Labels
help wanted Extra attention is needed

Comments

@dtolnay
Copy link
Owner

dtolnay commented Oct 14, 2019

From some rough tests, Watt macro expansion when compiling the runtime in release mode is about 15x faster than when the runtime is compiled in debug mode.

Maybe we can set it up such that users can run something like cargo install watt-runtime and then our debug-mode runtime can detect whether that optimized runtime is installed; if it is, then handing off the program to it.

@dtolnay dtolnay added the help wanted Extra attention is needed label Oct 14, 2019
@kazimuth
Copy link

This seems like it should be pretty straightforward to implement. You just need some way to RPC with the tool... which could just be passing token streams to STDIN and reading output / errors from STDOUT / STDERR.

It would also be possible to use an entirely different runtime for this, such as wasmtime, which includes a JIT written in Rust. I'm not sure how much faster / lower-latency this is compared to the watt runtime, would be worth benchmarking. That would be especially worthwhile if this eventually gets added to the rust toolchain, since then users don't need to worry about the release-mode compile time.

Oh also, the tool should have some form of version check built-in.

I might be able to poke at this next weekend.

@dtolnay
Copy link
Owner Author

dtolnay commented Oct 14, 2019

I am on board with using a JIT runtime for the precompiled one, but we should make sure that it caches the JIT artifacts. In typical usage you might invoke the same macro many times, and we don't want the JIT to need to run on the same code more than once.

@fitzgen
Copy link

fitzgen commented Oct 14, 2019

wasmtime does indeed have a code cache, fwiw. +cc @sunfishcode

@alexcrichton
Copy link
Collaborator

First wanted to say thanks for exploring this space @dtolnay, this is all definitely super useful user-experience for eventual stabilization in rustc/cargo themselves!

On the topic of an optimized runtime, I'd probably discourage making a watt-specific runtime since running WebAssembly at speed in the limit of time can be a very difficult project to keep up with. WebAssembly is evolving (albeit somewhat slowly) and as rustc/LLVM keep up it might be a pain to have another runtime to have to keep up to date and all. Would you be up for having some exploration done to see if wasmtime could be suitable for this purpose?

The wasmtime runtime would indeed be maintained going forward and would get all the new features as they come into WebAssembly itself. Additionally it will have its own installation which will involve downloading precompiled binaries, so users don't even have to worry about a long compilation process for an optimized wasm runtime. I'm imagining that the build scripts of the wasm runtime support crates here would detect wasmtime on the host system (or something like that) and skip all the code currently compile (not even compile the interpreted runtime) and go straight to using that.

On a technical level it should be possible with using wasi APIs to communicate either with stdin/stdout or files. With wasi/wasmtime it's still somewhat early days so we can add features there too as necessary!

I wouldn't mind setting aside some time to investigate all this if this all sounds reasonable to you @dtolnay?

@dtolnay dtolnay changed the title Experiment with whether we can hand off code to a preinstalled optimized runtime Hand off code to a preinstalled optimized runtime if available Oct 15, 2019
@dtolnay
Copy link
Owner Author

dtolnay commented Oct 15, 2019

What I would have in mind by a watt-specific runtime isn't a whole new implementation of WebAssembly from scratch, but some existing maintained runtime like wasmtime wrapped with any additional proc macro specific logic we want compiled in release mode. Maybe that additional logic is nothing and we can use a vanilla wasmtime binary -- I just want to make sure we are running as little as possible in our debug-mode shim because the performance difference is extreme.

@alexcrichton what you wrote sounds reasonable to me and I would love if you had time to investigate further. Thanks!

@dtolnay
Copy link
Owner Author

dtolnay commented Oct 15, 2019

I think an amazing milestone would be when proc macros built for Watt running in Wasmtime are faster than natively compiled proc macros in a typical cargo build — because the performance boost from release-mode compilation of the wasm is bigger than any slowdown from the execution model. That seems like it should be within reach right?

@kazimuth
Copy link

kazimuth commented Oct 15, 2019

Question: would it be possible to just bundle platform-specific binaries with Watt? You could make a bunch of watt-runtime-[OS]-(arch)] packages with binaries in the crate, then add #[cfg]'d dependencies on them in watt-runtime, with a fallback of compiling from scratch. That would make installs pretty much instant for 99% of users, which fixes the main downside of using wasmtime / cranelift (compile time). I don't know if cargo allows baking binaries into crates, though.

@alexcrichton
Copy link
Collaborator

I've settled on a strategy where my thinking is that to at least prove this out I'm going to attempt to dlopen libwasmtime.so which has a C API. That C API would be bound in the watt runtime and watt would dynamically select, at build time, whether it'll link to libwasmtime.so or whether it'll compile in the fallback interpreter runtime. It'll take me a few days I think to get all the fiddling right.

@dtolnay do you have some benchmarks in mind already to game out? Or are you thinking of "let's just compile some crates with serde things"

@dtolnay
Copy link
Owner Author

dtolnay commented Oct 16, 2019

A good benchmark to start off would be derive(Deserialize) on some simple struct with 6 fields, using the wasm file published in wa-serde-derive.

@kazimuth
Copy link

@alexcrichton do you know if it would it be possible to bundle platform-specific libwasmtime.sos with Watt on crates.io?

@alexcrichton
Copy link
Collaborator

Ok this ended up actually being a lot easier than I thought it would be! Note that I don't actually have a "fallback mode" back to the interpreter, I just changed the whole crate and figured that if this panned out we could figure out how to have source code that simultaneously supports both later.

The jit code all lives on this branch, but it's a bit of a mess. Requires using RUSTFLAGS to pass -L to find libwasmtime_api.so and also requires using LD_LIBRARY_PATH to actually load it at runtime.

I compiled with cargo build (debug mode) using this code:

#![allow(dead_code)]

#[derive(wa_serde_derive::Deserialize)]
struct Foo {
    a: f32,
    b: String,
    c: (String, i32),
    d: Option<u32>,
    e: u128,
    f: f64,
}

#[derive(serde_derive::Deserialize)]
struct Bar {
    a: f32,
    b: String,
    c: (String, i32),
    d: Option<u32>,
    e: u128,
    f: f64,
}

fn main() {
    println!("Hello, world!");
}

I also instrumented a local checkout of serde_derive to just print the duration of a derive(Deserialize). The execution time numbers look like:

Runtime time
serde_derive 9.36ms
watt interpreter 1388.82ms
watt jit 748.51ms

The breakdown of the jit looks like:

Step time
creation of instance 706.11ms
calling exported function 24.10ms
creating import map 5.18ms
creating the wasm module 1.55ms

Next I was curious about the compile time for the entire project. Here I just included one of the impls above and measured the compile time of cargo build with nothing in cache (a full debug mode build).

Runtime compile time
serde_derive 10.78s
serde + derive feature 17.83s
watt interpeter 9.12s
watt jit 8.50s

and finally, the compile time of the watt crate (including dependencies in the jit case that I've added) from scratch, for comparison:

Runtime compile time
watt interpeter 2.69s
watt jit 1.23s

Some conclusions:

  • The jit can compile really fast since it's just binding a bunch of C APIs. It can probably compile a bit faster with a bit of elbow grease as well, I didn't try to optimize anything and I know that the imports are probably a bit more monomorphic than they need to be.
  • Expansion with the jit is faster than the interpreter, but only modestly so. AFAIK there has not been a huge amount of effort making wasmtime have screaming fast startup times, but there is work underway to improve this. Additionally I'm almost surely not using the code cache since I didn't actually enable it, I'd need to contact other folks to see how to get that enabled. The code cache would eliminate almost all of the 700ms runtime
  • The actual function call invocation of the wasm is still slower than the serde code itself in debug mode. I haven't profiled this at all though, the byte-by-byte transfer may be hurting performance. Additionally I'm using libwasmtime_api.so, which AFAIK is not at all optimized for performance yet.

Overall seems promising! Not quite ready for prime time (but then again none of this really is per se), but I think this is a solid path forward.


@kazimuth sorry meant to reply earlier but forgot! I do think we can certainly distribute precompiled libwasmtime.so crates on crates.io, but one of the downsides for proc macros (and serde) specifically is that cargo vendor vendors everything for every platform and would download quite a few binaries that wouldn't end up being needed (one for every platform we have a precompiled object for). For that reason I'm not sure it'd be a great idea to do so, but I think we'd still have a good story for "install wasmtime and your builds are now faster".

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Oct 17, 2019
This was used when [prototyping] but I found it wasn't implemented yet!

[prototyping]: dtolnay/watt#2 (comment)
@alexcrichton
Copy link
Collaborator

Ok dug in a bit more with the help of some wasmtime folks.

The wasmtime crate has support for a local code cache on your system, keyed off basically the checksum of the wasm module blob (afaik). That code cache vastly accelerates the instantiation phase since no compilation needs to happen. Above an instantiation on my machine took 700ms or so, but with the cache enabled it takes 45ms.

That means with a cached module expansion as a whole takes 65.97ms, which looks like a split between the loading of the cache (45ms), calling the exported function (16ms), creating the import map (3ms), and various change elsewhere.

Looks like loading the cache isn't going to be easy to change much, its 45ms breakdown is roughly:

  • 15ms - zstd decompress the data read from disk
  • 15ms - bincode deserialize decompressed data
  • 11ms - go from deserialized data to an actual wasm instance
  • ~0ms - read from disk (SSD for myself locally)

This also doesn't take into account execution time of the macro which is still slower than the debug mode version, clocking in at 24-20ms vs the 9ms for serde in debug mode.

My read from this is that we'll want to heavily cache things (wasmtime's cache, cache instances in-process for lots of derive(Deserialize), etc.). I think the next thing to achieve is to get the macro itself executing faster than the debug mode, for which I'll need to do some profiling.

@dtolnay
Copy link
Owner Author

dtolnay commented Oct 17, 2019

That's awesome! I am away at Rust Belt Rust until Sunday so I won't have a lot of time to engage with this until later, but I would be happy to start landing changes in this repo where it makes sense, for example all the simplified signatures in sym.rs in 5925e60. I've added @alexcrichton as a collaborator.

sunfishcode pushed a commit to bytecodealliance/wasmtime that referenced this issue Oct 18, 2019
This was used when [prototyping] but I found it wasn't implemented yet!

[prototyping]: dtolnay/watt#2 (comment)
@mrowqa
Copy link

mrowqa commented Oct 18, 2019

Adding my two cents regarding the Wasmtime cache system:

So, the things above might slightly affect the performance. I'll take a look at the SecondaryMap serialization.

@mrowqa
Copy link

mrowqa commented Oct 18, 2019

@alexcrichton when I was considering if Wasmtime cache needs compression, the uncompressed cache had some places with really low entropy. I haven't investigated it, but my guess was that SecondaryMaps were really sparse. I haven't profiled the code, but new deserialization might be faster. You can compile wasmtime with [patch.crates-io] pointing to my cranelift branch (bytecodealliance/cranelift#1158).

@alexcrichton
Copy link
Collaborator

Thanks for the info @mrowqa! It's good to know that we've got a lot of knobs if necessary when playing around with the cache here, and we can definitely investigate them trying to go forward!

One of my main worries now at this point for any viability whatsoever is to understand why the execution of a wasm optimized procedural macro is 2x slower than the execution of the native unoptimized version

@alexcrichton
Copy link
Collaborator

Sorry for the radio silence here I haven't forgotten about this. I still want to dig in more to investigate the peformance of wasm code vs not. It's gonna take some more time though, I haven't had a chance to start.

@mystor
Copy link
Contributor

mystor commented Oct 29, 2019

Sorry for the radio silence here I haven't forgotten about this. I still want to dig in more to investigate the peformance of wasm code vs not. It's gonna take some more time though, I haven't had a chance to start.

Some of the poor performance may be caused by the shape of the wasm/native ffi boundary. For example, until #10, strings were copied into wasm byte-by-byte. As string passing is used frequently to convert things like Ident and Literal into useful values, direct memory copies should be much faster there. In a macro I was playing with, it improved runtime by seconds (although I was dealing with megabyte string literals, so ymmv...).

It might also be desirable to use a fork of proc_macro's client-server code directly. It requires no unsafe code (except in the closure/buffer passing part, which we'd need to replace with wasm memory manipulation anyway), requires only a single ffi method, and is known to be fast-enough.

@alexcrichton
Copy link
Collaborator

Ok back to some benchmarking. This is based on #11 to gather timing information so it rules out the issue of bulk-data transfers. The benchmark here is:

#[derive(Serialize)]
struct S(f32, f32, f32, /* 1000 `f32` fields in total ..*/);

Here's the timings I'm getting:

debug release
serde_derive (native) 163.29ms 82.26ms
wa-serde-derive 1.02s 753.30ms
time in imported functions 912.32ms 676.61ms
time in wasm 77.87ms 48.29ms
time in instantiation 26.88ms 24.66ms
time in making imports 4.08ms 3.35ms

So it looks like almost all the time is spent in the imported functions. Taking a look at those with some instrumentation we get:

function debug self time release self time
watt::sym::token_stream_extend 667.43ms 609.87ms
watt::sym::token_stream_push_punct 48.89ms 30.14ms
watt::sym::token_stream_push_ident 23.59ms 10.28ms
watt::sym::watt_string_new 22.23ms 1.43ms
watt::sym::ident_eq_str 18.69ms 7.64ms
watt::sym::punct_set_span 10.05ms 2.91ms

My conclusion from this is that there's probably lower hanging fruit than further optimizing the wasm runtime. It appears that we basically get all the bang for the buck necessary with wasmtime, and the remaining optimization work would be between the boundary of the watt runtime as well as the proc-macro2 shim that's compiled to wasm and patched in.

@dtolnay or @mystor do you have ideas perhaps looking at this profile of ways that the watt APIs could be improved?

@alexcrichton
Copy link
Collaborator

I should also mention that for this benchmark the interpreter takes 10.99s in debug mode and 1.15s in release mode. If the runtime API calls are themselves optimized then I think it's definitely be apparent that (as expected) the JIT is at least one order of magnitude faster than the interpreter, if not multiple. (debug ~10s in wasm vs 77ms, and release ~500ms in wasm vs 48.29ms)

@dtolnay
Copy link
Owner Author

dtolnay commented Oct 29, 2019

Wow this is great!

Question about the "time in wasm" measurements -- how come there is a 60% difference between debug mode (78ms) and release mode (48ms)? Shouldn't everything going on inside the JIT runtime be the same between those two? Is it including some part of the overhead from the hostfunc calls?

It appears that we basically get all the bang for the buck necessary with wasmtime, and the remaining optimization work would be between the boundary of the watt runtime as well as the proc-macro2 shim that's compiled to wasm and patched in.

I agree.

My first thought for optimizing the boundary is: Right now we are routing every proc_macro API call individually out of the JIT. It would be good to experiment with how not to do that. For example we could provide a WASM compiled version of proc-macro2's fallback implementation that we hand off together with the caller's WASM into the JIT, such that the caller's macro runs against the emulated proc macro library and not real proc_macro calls. Then when their macro returns we translate the resulting emulated TokenStream into a proc_macro::TokenStream.

Basically the only tricky bit is indexing all the spans in the input and remapping each output span into which one of the input spans it corresponds to. The emulated Span type would hold just an index into our list of all input spans.

I believe this would be a large performance improvement because native serde_derive executes in 163ms while wa-serde-derive spends 912ms in hostfuncs -- the effect of this redesign would be that all our hostfunc time is replaced by doing a subset of the work that native serde_derive does, so I would expect the time for the translation from emulated TokenStream to real TokenStream to be less than 163ms in debug mode.

@alexcrichton
Copy link
Collaborator

Yeah I was sort of perplexed at that myself. I did a quick check though and nothing appears awry so it's either normal timing differences (30ms even is basically just variance unless you run it a bunch of times) or as you mentioned the various surrounding "cruft". There's a few small pieces before/after the timing locations which could have attributed more to the wasm than was actually spent in wasm in debug mode, I was just sort of crudely timing things by instrumenting all calls with Instant::now() and start.elapsed() calls.

I agree with your intuition as well, that makes sense! To achieve that goal I don't think watt would carry anything precompiled, but rather there could be a scheme where the actual wasm blob contains this instead of what it has today:

use watt_proc_macro2::TokenStream; // not a shadow of `proc-macro2`

#[no_mangle]
pub extern "C" fn my_macro(input: TokenStream) -> TokenStream {
    // not necessary since `watt_proc_macro2` has a statically known initialization symbol
    // we call first before we call `my_macro`, and that initialization function does this.
    // proc_macro2::set_wasm_panic_hook();

    let input = input.into_proc_macro2(); // creates a real crates.io `proc_macro2::TokenStream`

    // .. do the real macro on `proc_macro2::TokenStream`, as you usually do

    let ret = ...;

    // and convert back into a watt token stream
    input.into()
}

The conversion from a watt_proc_macro2::TokenStream to proc_macro2::TokenStream would be the "serialize everything into wasm" step and the other way would be the "deserialize out of wasm" and would ideally be the only two bridges, everything else would remain purely internal while the wasm is executing.

Furthermore you could actually imagine this being on steroids:

use proc_macro2::TokenStream;
use watt::prelude::*;

#[watt::proc_macro]
pub fn my_macro(input: TokenStream) -> TokenStream {
    // ...
}

Basically watt (or some similarly named crate) could provide all the proc-macro attributes and would do all the conversions for you. That way the changes would largely be in Cargo.toml and build-wise rather than in the code.

Anyway I digress. Basically my main point is that the wasm blob I think will want the translation baked into it. We could play with a few different deserialization/serialization strategies as well to see which is fastest, but it would indeed be pretty slick if everything was internal to the wasm blob until only the very edges of the wasm.

Some of this may require coordination in proc_macro2 to have a third form of "foreign handle", so actually getting rid of the [patch] may not be viable..

@dtolnay
Copy link
Owner Author

dtolnay commented Oct 29, 2019

That sounds good!

I don't mind relying on [patch] so much for now, since it's only on the part of macro authors and not consumers. I think once the performance is in good shape we can revisit everything from the tooling and integration side.

@alexcrichton
Copy link
Collaborator

👍

Ok I'll tinker with this and see what I can come up with

@mystor
Copy link
Contributor

mystor commented Oct 30, 2019

I think the most important thing is improving the transparency of the API to the optimizer. Many of the specific methods where a lot of time is being spent seem like obvious easy-to-optimize places, so it may be possible to make good progress with a better API (TokenStream::extend is a known problem point from dtolnay/proc-macro2#198, as an example).

My first reservation about the "send all of the data into wasm eagerly" approach was that extra data, like the text of unused Literal objects, may not be necessary. I suppose syn is very likely to to_string every token anyway, though, so we're probably better off sending it down eagerly.

As mentioned, one of the biggest issues there would be Span objects, which can't be re-created from raw binary data. We could probably intern these and use u32 indexes to reference them from within wasm. Each item stored in the wasm heap could then start with one of these values attached, in addition to their string values.

On the wasm side, the data would probably look similar to how it looks today, but with #[repr(C)] types, and u32 instead of pointers for indexes into the wasm address space. The wasm code would likely use wrapper methods to convert the u32s into pointers. We could have helper types like WattBox which would only drop the contained memory when in wasm memory. We'd have to ask the wasm code to allocate the memory region for us first (probably with a series of watt_alloc(size: u32, align: u32) calls?) and then read the final data back in before returning, but that seems quite doable.

I'm not sure how much other hidden data is associated with individual tokens beyond spans, but we'd also lose any such information with this model. I'm guessing that there is little enough of that for it to not matter.

@alexcrichton
Copy link
Collaborator

Ok so it turns out that the API of proc_macro is so conservative this is actually pretty easy to experiment with. Here's a pretty messy commit -- alexcrichton@18f2337. The highlights of this commit are:

  • The ABI boundary of procedural macros compiled to wasm use a RawTokenStream type.
  • The RawTokenStream type is exactly a u32 handle, as-is today.
  • There's one method on RawTokenStream to convert it to a TokenStream. This performs a bulk serialization to a binary format in the host runtime, then returns a Bytes handle. This Bytes handle is then copied into the wasm userspace.
  • Eventually once you're done there's a method to convert back. This, in wasm, serializes the TokenStream into the same binary format as before. The binary blob is passed directly to watt's native runtime for parsing. This is then deserialized into an actual proc_macro::TokenStream
  • Span is handled by always being an opaque u32 in wasm. That way we still have a u32-per-token in wasm, and it's all managed in the preexisting span array we have in Data today. All-in-all it should be lossless.
  • There's a few bugs. Span::call_site() in wasm probably returns some random span. Something about raw identifiers dosn't work b/c the serde expansion fails with try thinking it's a keyword. In any case these bugs don't hinder the timing and proof-of-concept of the API, they're easy to fix later.

So basically a macro looks like "serialize everything to a binary blob" which retains Span information. Next "deserialize binary blob in wasm". Next, process in wasm. Next "serialize back to binary blob" in wasm. Finally "deserialize binary blob" in the native runtime. The goal here was to absolutely minimize the runtime of imported functions and completely maximize the time spent in wasm.

The timings are looking impressive!

debug release
serde_derive (native) 163.29ms 82.26ms
wa-serde-derive 334.37 305.54ms
time in imported functions 120.99ms 98.81ms
time in wasm 176.89ms 170.20ms
time in instantiation 34.95ms 35.14ms
time in making imports 659.723µs 584.408µs

And for each imported function:

function debug self time release self time
watt::sym::token_stream_deserialize 109.4704ms 93.14ms
watt::sym::token_stream_serialize 9.020899ms 4.87ms
watt::sym::token_stream_parse 1.681334ms 725.114µs

This is dramatically faster by minimizing the time spent crossing a chatty boundary. We're spending 9x less time in imported code in debug mode adn ~6x in release mode. It sort of makes sense here that the deserialization of what's probably like a megabyte of source code takes 110ms in debug mode.

The "time in wasm" should be break down as (a) deserialize the input, (b) do the processing, and (c) serialize the input. I would expect that (b) should be close to the release mode execution time within a small margin (ish), but (a) and (c) are extra done today. If my assertion about (b) is true (which it probably isn't since I think Cranelift is still working on perf) then there's room to optimize in (a) and (c). For example @mystor's idea for perhaps leaving literals as handles to the watt runtime might make sense, once you create a Literal you almost never look at the actual textual literal.

From this I think I would conclude:

  • Expansion in debug/release mode with wasm has a ~50ms overhead per-crate. This is the overhead to instantiate the wasm module in each procedural macro invocation.
  • Expansion in debug mode with wasm is ~2x slower with wasm
    • Actual wasm itself is only marginally slower (~8%)
    • Serialization/deserialization of token streams is biggest contributor, and is proportional to the size of the input/output. In tests with this serde benchmark it's roughly 2x slower.
  • Expansion in release mode is ~3x slower. Same reasons as with debug mode.

Overall this looks like a great way forward. I suspect further tweaking like @mystor mentions in trying to keep as much string-like data on the watt-runtime side of things could further improve performance. Additionally watt::sym::token_stream_parse is me being too lazy to implement a Rust syntax tokenizer in wasm (aka copy it from proc-macro2), but we could likely optimize that slightly by running that in wasm as well.

@dtolnay
Copy link
Owner Author

dtolnay commented Oct 30, 2019

  • Never mind, this is how you've done it already. Great! Would it be possible for the boundary to not involving serializing to Rust-like syntax but instead some Bincode-like handrolled compact representation? That should be an obvious win for parsing time, though it may take a bit longer to compile watt itself. Still I think it's likely to be the right tradeoff.

  • Is it possible to allow for the user's entry point to be written directly in terms of proc_macro2::TokenStream rather than a different RawTokenStream type?

    use proc_macro2::TokenStream;
    
    #[no_mangle]
    pub extern "C" fn demo(input: TokenStream) -> TokenStream {

    I am wondering whether there is anything we can do in how we set up the call into JIT such that we put the right things in memory and stack for this to just work.

@alexcrichton
Copy link
Collaborator

So actually, as usual, experimenting is faster than actually typing up the comment saying what we may want to experiment with. Here's timing information where Literal is not serialized across the boundary (and Span and Ident things are fixed). Here Literal is always serialized as a handle, so wasm can either use these literally (ha!) or manufacture its own. I also did a few small optimizations to remove to_string where I could.

debug release
serde_derive (native) 154.957733ms 86.821315ms
wa-serde-derive 300.809265ms 278.288324ms
time in imported functions 121.252104ms 99.493415ms
time in wasm 141.912692ms 141.881934ms
time in instantiation 35.909112ms 35.374174ms
time in making imports 871.961µs 749.318µs

And for each imported function:

function debug self time release self time
watt::sym::token_stream_deserialize 111.502829ms 94.597588ms
watt::sym::token_stream_serialize 7.348556ms 4.056007ms
watt::sym::token_stream_parse 1.566301ms 767.402µs

So that was an easy 30ms win!


@dtolnay to answer your question about the signature, would you be opposed to a macro? Something like #[watt::proc_macro] to hide the details?

@dtolnay
Copy link
Owner Author

dtolnay commented Oct 30, 2019

It shouldn't require an attribute macro though, right? We control exactly what argument the main entry point receives here. I am imagining something like (pseudocode):

let raw_token_stream = Val::i32(d.tokenstream.push(input) as i32);
let input_token_stream = raw_to_pm2.call(&[raw_token_stream]).unwrap()[0];
let output_token_stream = main.call(&[input_token_stream]).unwrap()[0];
let raw_token_stream = pm2_to_raw.call(&[output_token_stream]).unwrap()[0];
return d.tokenstream[raw_token_stream].clone();

where main is the user-provided no_mangle entry point and raw_to_pm2 + pm2_to_raw are no_mangle functions built into our patched proc-macro2, equivalent to RawTokenStream::into_token_stream and TokenStream::into_raw_token_stream.

@alexcrichton
Copy link
Collaborator

That's possible but would require specifying the ABI of TokenStream itself as a u32, which today it's a Vec<TokenTree> internally. I've generally found a macro to be useful for decoupling the API and the ABI because we don't necessarily want users to write down the ABI but rather we have an API we want them to adhere to.

@dtolnay
Copy link
Owner Author

dtolnay commented Oct 30, 2019

Ah, makes sense. Yes I would be on board with an attribute macro to hide the ABI.

@mystor
Copy link
Contributor

mystor commented Oct 30, 2019

FWIW I experimented a bit, a while ago, with some really hacky macros around watt to allow writing proc_macro crates inline within the module you're working with (https://github.com/mystor/ctrs if anyone's interested, though It's pretty darn hacky). I included a transformation like the one you're talking about for #[watt::proc_macro]. It's perhaps a bit dumber than is needed here, though.

@alexcrichton
Copy link
Collaborator

Ok I've sent the culmination of all of this in as #14

alexcrichton added a commit to alexcrichton/watt that referenced this issue Jun 14, 2022
I was curious to see the impact of Wasmtime's recent development since I
last added the `WATT_JIT` env var feature to `watt` a few years ago
since quite a lot has changed about Wasmtime in the meantime. The
changes in this PR account for some ABI changes which have happened in
the C API which doesn't account for anything major.

Taking my old benchmark of `#[derive(Serialize)]` on
`struct S(f32, ...  /* 1000 times */)` the timings I get for the latest
version of `serde_derive` are:

|         | native | watt  | watt (cached) |
|---------|--------|-------|---------------|
| debug   | 156ms  | 280ms | 125ms         |
| release |  70ms  | 257ms | 100ms         |

Using instead `#[derive(Serialize)] struct S(f32)` the timings I get are:

|         | native | watt  | watt (cached) |
|---------|--------|-------|---------------|
| debug   |  1ms   | 241ms | 41ms          |
| release |  387us | 205ms | 46ms          |

So for large inputs jit-compiled WebAssembly can be faster than the
native `serde_derive` when serde is itself compiled in debug mode. Note
that this is almost always the default nowadays since `cargo build
--release` will currently build build-dependencies with no
optimizations. Only through explicit profile configuration can
`serde_derive` be built in optimized mode (as I did to collect the
above numbers).

The `watt (cached)` column is where I enabled Wasmtime's global
compilation cache to avoid recompiling the module every time the
proc-macro is loaded which is why the timings are much lower. The
difference between `watt` and `watt (cached)` is the compile time of the
module itself. The 40ms or so in `watt (cached)` is almost entirely
overhead of loading the module from cache which involves decompressing
the module from disk and additionally sloshing bytes around. More
efficient storage mediums exist for Wasmtime modules which means that it
would actually be pretty easy to shave off a good chunk of time from
that. Additionally Wasmtime has a custom C API which significantly
differs from the one used in this repository which would also be
significantly faster for calling into the host from wasm. Of the current
~3ms runtime in wasm itself that could probably be reduced further with
more optimized calls.

Overall this seems like pretty good progress made on Wasmtime in the
interim since all my initial work in dtolnay#2. In any case I wanted to post
this to get the `WATT_JIT` feature at least working again since
otherwise it's segfaulting right now, and perhaps in the future if
necessary more perf work can be done!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants