Hand off code to a preinstalled optimized runtime if available #2

dtolnay · 2019-10-14T16:19:49Z

From some rough tests, Watt macro expansion when compiling the runtime in release mode is about 15x faster than when the runtime is compiled in debug mode.

Maybe we can set it up such that users can run something like cargo install watt-runtime and then our debug-mode runtime can detect whether that optimized runtime is installed; if it is, then handing off the program to it.

The text was updated successfully, but these errors were encountered:

kazimuth · 2019-10-14T17:06:01Z

This seems like it should be pretty straightforward to implement. You just need some way to RPC with the tool... which could just be passing token streams to STDIN and reading output / errors from STDOUT / STDERR.

It would also be possible to use an entirely different runtime for this, such as wasmtime, which includes a JIT written in Rust. I'm not sure how much faster / lower-latency this is compared to the watt runtime, would be worth benchmarking. That would be especially worthwhile if this eventually gets added to the rust toolchain, since then users don't need to worry about the release-mode compile time.

Oh also, the tool should have some form of version check built-in.

I might be able to poke at this next weekend.

dtolnay · 2019-10-14T17:15:10Z

I am on board with using a JIT runtime for the precompiled one, but we should make sure that it caches the JIT artifacts. In typical usage you might invoke the same macro many times, and we don't want the JIT to need to run on the same code more than once.

fitzgen · 2019-10-14T18:49:57Z

wasmtime does indeed have a code cache, fwiw. +cc @sunfishcode

alexcrichton · 2019-10-15T15:12:00Z

First wanted to say thanks for exploring this space @dtolnay, this is all definitely super useful user-experience for eventual stabilization in rustc/cargo themselves!

On the topic of an optimized runtime, I'd probably discourage making a watt-specific runtime since running WebAssembly at speed in the limit of time can be a very difficult project to keep up with. WebAssembly is evolving (albeit somewhat slowly) and as rustc/LLVM keep up it might be a pain to have another runtime to have to keep up to date and all. Would you be up for having some exploration done to see if wasmtime could be suitable for this purpose?

The wasmtime runtime would indeed be maintained going forward and would get all the new features as they come into WebAssembly itself. Additionally it will have its own installation which will involve downloading precompiled binaries, so users don't even have to worry about a long compilation process for an optimized wasm runtime. I'm imagining that the build scripts of the wasm runtime support crates here would detect wasmtime on the host system (or something like that) and skip all the code currently compile (not even compile the interpreted runtime) and go straight to using that.

On a technical level it should be possible with using wasi APIs to communicate either with stdin/stdout or files. With wasi/wasmtime it's still somewhat early days so we can add features there too as necessary!

I wouldn't mind setting aside some time to investigate all this if this all sounds reasonable to you @dtolnay?

dtolnay · 2019-10-15T15:41:14Z

What I would have in mind by a watt-specific runtime isn't a whole new implementation of WebAssembly from scratch, but some existing maintained runtime like wasmtime wrapped with any additional proc macro specific logic we want compiled in release mode. Maybe that additional logic is nothing and we can use a vanilla wasmtime binary -- I just want to make sure we are running as little as possible in our debug-mode shim because the performance difference is extreme.

@alexcrichton what you wrote sounds reasonable to me and I would love if you had time to investigate further. Thanks!

dtolnay · 2019-10-15T16:06:45Z

I think an amazing milestone would be when proc macros built for Watt running in Wasmtime are faster than natively compiled proc macros in a typical cargo build — because the performance boost from release-mode compilation of the wasm is bigger than any slowdown from the execution model. That seems like it should be within reach right?

kazimuth · 2019-10-15T16:21:45Z

Question: would it be possible to just bundle platform-specific binaries with Watt? You could make a bunch of watt-runtime-[OS]-(arch)] packages with binaries in the crate, then add #[cfg]'d dependencies on them in watt-runtime, with a fallback of compiling from scratch. That would make installs pretty much instant for 99% of users, which fixes the main downside of using wasmtime / cranelift (compile time). I don't know if cargo allows baking binaries into crates, though.

alexcrichton · 2019-10-16T22:41:45Z

I've settled on a strategy where my thinking is that to at least prove this out I'm going to attempt to dlopen libwasmtime.so which has a C API. That C API would be bound in the watt runtime and watt would dynamically select, at build time, whether it'll link to libwasmtime.so or whether it'll compile in the fallback interpreter runtime. It'll take me a few days I think to get all the fiddling right.

@dtolnay do you have some benchmarks in mind already to game out? Or are you thinking of "let's just compile some crates with serde things"

dtolnay · 2019-10-16T22:49:49Z

A good benchmark to start off would be derive(Deserialize) on some simple struct with 6 fields, using the wasm file published in wa-serde-derive.

kazimuth · 2019-10-17T01:55:21Z

@alexcrichton do you know if it would it be possible to bundle platform-specific libwasmtime.sos with Watt on crates.io?

alexcrichton · 2019-10-17T05:20:34Z

Ok this ended up actually being a lot easier than I thought it would be! Note that I don't actually have a "fallback mode" back to the interpreter, I just changed the whole crate and figured that if this panned out we could figure out how to have source code that simultaneously supports both later.

The jit code all lives on this branch, but it's a bit of a mess. Requires using RUSTFLAGS to pass -L to find libwasmtime_api.so and also requires using LD_LIBRARY_PATH to actually load it at runtime.

I compiled with cargo build (debug mode) using this code:

#![allow(dead_code)]

#[derive(wa_serde_derive::Deserialize)]
struct Foo {
    a: f32,
    b: String,
    c: (String, i32),
    d: Option<u32>,
    e: u128,
    f: f64,
}

#[derive(serde_derive::Deserialize)]
struct Bar {
    a: f32,
    b: String,
    c: (String, i32),
    d: Option<u32>,
    e: u128,
    f: f64,
}

fn main() {
    println!("Hello, world!");
}

I also instrumented a local checkout of serde_derive to just print the duration of a derive(Deserialize). The execution time numbers look like:

Runtime	time
`serde_derive`	9.36ms
`watt` interpreter	1388.82ms
`watt` jit	748.51ms

The breakdown of the jit looks like:

Step	time
creation of instance	706.11ms
calling exported function	24.10ms
creating import map	5.18ms
creating the wasm module	1.55ms

Next I was curious about the compile time for the entire project. Here I just included one of the impls above and measured the compile time of cargo build with nothing in cache (a full debug mode build).

Runtime	compile time
`serde_derive`	10.78s
`serde` + `derive` feature	17.83s
`watt` interpeter	9.12s
`watt` jit	8.50s

and finally, the compile time of the watt crate (including dependencies in the jit case that I've added) from scratch, for comparison:

Runtime	compile time
`watt` interpeter	2.69s
`watt` jit	1.23s

Some conclusions:

The jit can compile really fast since it's just binding a bunch of C APIs. It can probably compile a bit faster with a bit of elbow grease as well, I didn't try to optimize anything and I know that the imports are probably a bit more monomorphic than they need to be.
Expansion with the jit is faster than the interpreter, but only modestly so. AFAIK there has not been a huge amount of effort making wasmtime have screaming fast startup times, but there is work underway to improve this. Additionally I'm almost surely not using the code cache since I didn't actually enable it, I'd need to contact other folks to see how to get that enabled. The code cache would eliminate almost all of the 700ms runtime
The actual function call invocation of the wasm is still slower than the serde code itself in debug mode. I haven't profiled this at all though, the byte-by-byte transfer may be hurting performance. Additionally I'm using libwasmtime_api.so, which AFAIK is not at all optimized for performance yet.

Overall seems promising! Not quite ready for prime time (but then again none of this really is per se), but I think this is a solid path forward.

@kazimuth sorry meant to reply earlier but forgot! I do think we can certainly distribute precompiled libwasmtime.so crates on crates.io, but one of the downsides for proc macros (and serde) specifically is that cargo vendor vendors everything for every platform and would download quite a few binaries that wouldn't end up being needed (one for every platform we have a precompiled object for). For that reason I'm not sure it'd be a great idea to do so, but I think we'd still have a good story for "install wasmtime and your builds are now faster".

This was used when [prototyping] but I found it wasn't implemented yet! [prototyping]: dtolnay/watt#2 (comment)

alexcrichton · 2019-10-17T21:45:03Z

Ok dug in a bit more with the help of some wasmtime folks.

The wasmtime crate has support for a local code cache on your system, keyed off basically the checksum of the wasm module blob (afaik). That code cache vastly accelerates the instantiation phase since no compilation needs to happen. Above an instantiation on my machine took 700ms or so, but with the cache enabled it takes 45ms.

That means with a cached module expansion as a whole takes 65.97ms, which looks like a split between the loading of the cache (45ms), calling the exported function (16ms), creating the import map (3ms), and various change elsewhere.

Looks like loading the cache isn't going to be easy to change much, its 45ms breakdown is roughly:

15ms - zstd decompress the data read from disk
15ms - bincode deserialize decompressed data
11ms - go from deserialized data to an actual wasm instance
~0ms - read from disk (SSD for myself locally)

This also doesn't take into account execution time of the macro which is still slower than the debug mode version, clocking in at 24-20ms vs the 9ms for serde in debug mode.

My read from this is that we'll want to heavily cache things (wasmtime's cache, cache instances in-process for lots of derive(Deserialize), etc.). I think the next thing to achieve is to get the macro itself executing faster than the debug mode, for which I'll need to do some profiling.

dtolnay · 2019-10-17T22:20:04Z

That's awesome! I am away at Rust Belt Rust until Sunday so I won't have a lot of time to engage with this until later, but I would be happy to start landing changes in this repo where it makes sense, for example all the simplified signatures in sym.rs in 5925e60. I've added @alexcrichton as a collaborator.

This was used when [prototyping] but I found it wasn't implemented yet! [prototyping]: dtolnay/watt#2 (comment)

mrowqa · 2019-10-18T10:02:21Z

Adding my two cents regarding the Wasmtime cache system:

after a configurable number of cache file usages, the file gets recompressed with a better ratio,
there's a todo in SecondaryMap ("sparse vector") serialization in Cranelift - potentially less function calls,
there's an idea, mentioned here Relocation of memory.grow and memory.size bytecodealliance/wasmtime#199, to allow users choose between execution performance and startup performance.

So, the things above might slightly affect the performance. I'll take a look at the SecondaryMap serialization.

mrowqa · 2019-10-18T14:31:33Z

@alexcrichton when I was considering if Wasmtime cache needs compression, the uncompressed cache had some places with really low entropy. I haven't investigated it, but my guess was that SecondaryMaps were really sparse. I haven't profiled the code, but new deserialization might be faster. You can compile wasmtime with [patch.crates-io] pointing to my cranelift branch (bytecodealliance/cranelift#1158).

alexcrichton · 2019-10-18T18:51:50Z

Thanks for the info @mrowqa! It's good to know that we've got a lot of knobs if necessary when playing around with the cache here, and we can definitely investigate them trying to go forward!

One of my main worries now at this point for any viability whatsoever is to understand why the execution of a wasm optimized procedural macro is 2x slower than the execution of the native unoptimized version

alexcrichton · 2019-10-25T21:06:36Z

Sorry for the radio silence here I haven't forgotten about this. I still want to dig in more to investigate the peformance of wasm code vs not. It's gonna take some more time though, I haven't had a chance to start.

mystor · 2019-10-29T16:12:12Z

Sorry for the radio silence here I haven't forgotten about this. I still want to dig in more to investigate the peformance of wasm code vs not. It's gonna take some more time though, I haven't had a chance to start.

Some of the poor performance may be caused by the shape of the wasm/native ffi boundary. For example, until #10, strings were copied into wasm byte-by-byte. As string passing is used frequently to convert things like Ident and Literal into useful values, direct memory copies should be much faster there. In a macro I was playing with, it improved runtime by seconds (although I was dealing with megabyte string literals, so ymmv...).

It might also be desirable to use a fork of proc_macro's client-server code directly. It requires no unsafe code (except in the closure/buffer passing part, which we'd need to replace with wasm memory manipulation anyway), requires only a single ffi method, and is known to be fast-enough.

alexcrichton · 2019-10-29T21:45:37Z

Ok back to some benchmarking. This is based on #11 to gather timing information so it rules out the issue of bulk-data transfers. The benchmark here is:

#[derive(Serialize)]
struct S(f32, f32, f32, /* 1000 `f32` fields in total ..*/);

Here's the timings I'm getting:

	debug	release
`serde_derive` (native)	163.29ms	82.26ms
`wa-serde-derive`	1.02s	753.30ms
time in imported functions	912.32ms	676.61ms
time in wasm	77.87ms	48.29ms
time in instantiation	26.88ms	24.66ms
time in making imports	4.08ms	3.35ms

So it looks like almost all the time is spent in the imported functions. Taking a look at those with some instrumentation we get:

function	debug self time	release self time
`watt::sym::token_stream_extend`	667.43ms	609.87ms
`watt::sym::token_stream_push_punct`	48.89ms	30.14ms
`watt::sym::token_stream_push_ident`	23.59ms	10.28ms
`watt::sym::watt_string_new`	22.23ms	1.43ms
`watt::sym::ident_eq_str`	18.69ms	7.64ms
`watt::sym::punct_set_span`	10.05ms	2.91ms

My conclusion from this is that there's probably lower hanging fruit than further optimizing the wasm runtime. It appears that we basically get all the bang for the buck necessary with wasmtime, and the remaining optimization work would be between the boundary of the watt runtime as well as the proc-macro2 shim that's compiled to wasm and patched in.

@dtolnay or @mystor do you have ideas perhaps looking at this profile of ways that the watt APIs could be improved?

alexcrichton · 2019-10-29T21:48:19Z

I should also mention that for this benchmark the interpreter takes 10.99s in debug mode and 1.15s in release mode. If the runtime API calls are themselves optimized then I think it's definitely be apparent that (as expected) the JIT is at least one order of magnitude faster than the interpreter, if not multiple. (debug ~10s in wasm vs 77ms, and release ~500ms in wasm vs 48.29ms)

dtolnay · 2019-10-29T22:46:05Z

Wow this is great!

Question about the "time in wasm" measurements -- how come there is a 60% difference between debug mode (78ms) and release mode (48ms)? Shouldn't everything going on inside the JIT runtime be the same between those two? Is it including some part of the overhead from the hostfunc calls?

It appears that we basically get all the bang for the buck necessary with wasmtime, and the remaining optimization work would be between the boundary of the watt runtime as well as the proc-macro2 shim that's compiled to wasm and patched in.

I agree.

My first thought for optimizing the boundary is: Right now we are routing every proc_macro API call individually out of the JIT. It would be good to experiment with how not to do that. For example we could provide a WASM compiled version of proc-macro2's fallback implementation that we hand off together with the caller's WASM into the JIT, such that the caller's macro runs against the emulated proc macro library and not real proc_macro calls. Then when their macro returns we translate the resulting emulated TokenStream into a proc_macro::TokenStream.

Basically the only tricky bit is indexing all the spans in the input and remapping each output span into which one of the input spans it corresponds to. The emulated Span type would hold just an index into our list of all input spans.

I believe this would be a large performance improvement because native serde_derive executes in 163ms while wa-serde-derive spends 912ms in hostfuncs -- the effect of this redesign would be that all our hostfunc time is replaced by doing a subset of the work that native serde_derive does, so I would expect the time for the translation from emulated TokenStream to real TokenStream to be less than 163ms in debug mode.

alexcrichton · 2019-10-29T23:07:55Z

Yeah I was sort of perplexed at that myself. I did a quick check though and nothing appears awry so it's either normal timing differences (30ms even is basically just variance unless you run it a bunch of times) or as you mentioned the various surrounding "cruft". There's a few small pieces before/after the timing locations which could have attributed more to the wasm than was actually spent in wasm in debug mode, I was just sort of crudely timing things by instrumenting all calls with Instant::now() and start.elapsed() calls.

I agree with your intuition as well, that makes sense! To achieve that goal I don't think watt would carry anything precompiled, but rather there could be a scheme where the actual wasm blob contains this instead of what it has today:

use watt_proc_macro2::TokenStream; // not a shadow of `proc-macro2`

#[no_mangle]
pub extern "C" fn my_macro(input: TokenStream) -> TokenStream {
    // not necessary since `watt_proc_macro2` has a statically known initialization symbol
    // we call first before we call `my_macro`, and that initialization function does this.
    // proc_macro2::set_wasm_panic_hook();

    let input = input.into_proc_macro2(); // creates a real crates.io `proc_macro2::TokenStream`

    // .. do the real macro on `proc_macro2::TokenStream`, as you usually do

    let ret = ...;

    // and convert back into a watt token stream
    input.into()
}

The conversion from a watt_proc_macro2::TokenStream to proc_macro2::TokenStream would be the "serialize everything into wasm" step and the other way would be the "deserialize out of wasm" and would ideally be the only two bridges, everything else would remain purely internal while the wasm is executing.

Furthermore you could actually imagine this being on steroids:

use proc_macro2::TokenStream;
use watt::prelude::*;

#[watt::proc_macro]
pub fn my_macro(input: TokenStream) -> TokenStream {
    // ...
}

Basically watt (or some similarly named crate) could provide all the proc-macro attributes and would do all the conversions for you. That way the changes would largely be in Cargo.toml and build-wise rather than in the code.

Anyway I digress. Basically my main point is that the wasm blob I think will want the translation baked into it. We could play with a few different deserialization/serialization strategies as well to see which is fastest, but it would indeed be pretty slick if everything was internal to the wasm blob until only the very edges of the wasm.

Some of this may require coordination in proc_macro2 to have a third form of "foreign handle", so actually getting rid of the [patch] may not be viable..

dtolnay · 2019-10-29T23:15:39Z

That sounds good!

I don't mind relying on [patch] so much for now, since it's only on the part of macro authors and not consumers. I think once the performance is in good shape we can revisit everything from the tooling and integration side.

alexcrichton · 2019-10-30T15:38:39Z

👍

Ok I'll tinker with this and see what I can come up with

mystor · 2019-10-30T16:43:34Z

I think the most important thing is improving the transparency of the API to the optimizer. Many of the specific methods where a lot of time is being spent seem like obvious easy-to-optimize places, so it may be possible to make good progress with a better API (TokenStream::extend is a known problem point from dtolnay/proc-macro2#198, as an example).

My first reservation about the "send all of the data into wasm eagerly" approach was that extra data, like the text of unused Literal objects, may not be necessary. I suppose syn is very likely to to_string every token anyway, though, so we're probably better off sending it down eagerly.

As mentioned, one of the biggest issues there would be Span objects, which can't be re-created from raw binary data. We could probably intern these and use u32 indexes to reference them from within wasm. Each item stored in the wasm heap could then start with one of these values attached, in addition to their string values.

On the wasm side, the data would probably look similar to how it looks today, but with #[repr(C)] types, and u32 instead of pointers for indexes into the wasm address space. The wasm code would likely use wrapper methods to convert the u32s into pointers. We could have helper types like WattBox which would only drop the contained memory when in wasm memory. We'd have to ask the wasm code to allocate the memory region for us first (probably with a series of watt_alloc(size: u32, align: u32) calls?) and then read the final data back in before returning, but that seems quite doable.

I'm not sure how much other hidden data is associated with individual tokens beyond spans, but we'd also lose any such information with this model. I'm guessing that there is little enough of that for it to not matter.

alexcrichton · 2019-10-30T17:19:25Z

Ok so it turns out that the API of proc_macro is so conservative this is actually pretty easy to experiment with. Here's a pretty messy commit -- alexcrichton@18f2337. The highlights of this commit are:

The ABI boundary of procedural macros compiled to wasm use a RawTokenStream type.
The RawTokenStream type is exactly a u32 handle, as-is today.
There's one method on RawTokenStream to convert it to a TokenStream. This performs a bulk serialization to a binary format in the host runtime, then returns a Bytes handle. This Bytes handle is then copied into the wasm userspace.
Eventually once you're done there's a method to convert back. This, in wasm, serializes the TokenStream into the same binary format as before. The binary blob is passed directly to watt's native runtime for parsing. This is then deserialized into an actual proc_macro::TokenStream
Span is handled by always being an opaque u32 in wasm. That way we still have a u32-per-token in wasm, and it's all managed in the preexisting span array we have in Data today. All-in-all it should be lossless.
There's a few bugs. Span::call_site() in wasm probably returns some random span. Something about raw identifiers dosn't work b/c the serde expansion fails with try thinking it's a keyword. In any case these bugs don't hinder the timing and proof-of-concept of the API, they're easy to fix later.

So basically a macro looks like "serialize everything to a binary blob" which retains Span information. Next "deserialize binary blob in wasm". Next, process in wasm. Next "serialize back to binary blob" in wasm. Finally "deserialize binary blob" in the native runtime. The goal here was to absolutely minimize the runtime of imported functions and completely maximize the time spent in wasm.

The timings are looking impressive!

	debug	release
`serde_derive` (native)	163.29ms	82.26ms
`wa-serde-derive`	334.37	305.54ms
time in imported functions	120.99ms	98.81ms
time in wasm	176.89ms	170.20ms
time in instantiation	34.95ms	35.14ms
time in making imports	659.723µs	584.408µs

And for each imported function:

function	debug self time	release self time
`watt::sym::token_stream_deserialize`	109.4704ms	93.14ms
`watt::sym::token_stream_serialize`	9.020899ms	4.87ms
`watt::sym::token_stream_parse`	1.681334ms	725.114µs

This is dramatically faster by minimizing the time spent crossing a chatty boundary. We're spending 9x less time in imported code in debug mode adn ~6x in release mode. It sort of makes sense here that the deserialization of what's probably like a megabyte of source code takes 110ms in debug mode.

The "time in wasm" should be break down as (a) deserialize the input, (b) do the processing, and (c) serialize the input. I would expect that (b) should be close to the release mode execution time within a small margin (ish), but (a) and (c) are extra done today. If my assertion about (b) is true (which it probably isn't since I think Cranelift is still working on perf) then there's room to optimize in (a) and (c). For example @mystor's idea for perhaps leaving literals as handles to the watt runtime might make sense, once you create a Literal you almost never look at the actual textual literal.

From this I think I would conclude:

Expansion in debug/release mode with wasm has a ~50ms overhead per-crate. This is the overhead to instantiate the wasm module in each procedural macro invocation.
Expansion in debug mode with wasm is ~2x slower with wasm
- Actual wasm itself is only marginally slower (~8%)
- Serialization/deserialization of token streams is biggest contributor, and is proportional to the size of the input/output. In tests with this serde benchmark it's roughly 2x slower.
Expansion in release mode is ~3x slower. Same reasons as with debug mode.

Overall this looks like a great way forward. I suspect further tweaking like @mystor mentions in trying to keep as much string-like data on the watt-runtime side of things could further improve performance. Additionally watt::sym::token_stream_parse is me being too lazy to implement a Rust syntax tokenizer in wasm (aka copy it from proc-macro2), but we could likely optimize that slightly by running that in wasm as well.

dtolnay · 2019-10-30T17:35:19Z

Never mind, this is how you've done it already. Great! Would it be possible for the boundary to not involving serializing to Rust-like syntax but instead some Bincode-like handrolled compact representation? That should be an obvious win for parsing time, though it may take a bit longer to compile watt itself. Still I think it's likely to be the right tradeoff.
Is it possible to allow for the user's entry point to be written directly in terms of proc_macro2::TokenStream rather than a different RawTokenStream type?
```
use proc_macro2::TokenStream;

#[no_mangle]
pub extern "C" fn demo(input: TokenStream) -> TokenStream {
```
I am wondering whether there is anything we can do in how we set up the call into JIT such that we put the right things in memory and stack for this to just work.

alexcrichton · 2019-10-30T17:54:22Z

So actually, as usual, experimenting is faster than actually typing up the comment saying what we may want to experiment with. Here's timing information where Literal is not serialized across the boundary (and Span and Ident things are fixed). Here Literal is always serialized as a handle, so wasm can either use these literally (ha!) or manufacture its own. I also did a few small optimizations to remove to_string where I could.

	debug	release
`serde_derive` (native)	154.957733ms	86.821315ms
`wa-serde-derive`	300.809265ms	278.288324ms
time in imported functions	121.252104ms	99.493415ms
time in wasm	141.912692ms	141.881934ms
time in instantiation	35.909112ms	35.374174ms
time in making imports	871.961µs	749.318µs

And for each imported function:

function	debug self time	release self time
`watt::sym::token_stream_deserialize`	111.502829ms	94.597588ms
`watt::sym::token_stream_serialize`	7.348556ms	4.056007ms
`watt::sym::token_stream_parse`	1.566301ms	767.402µs

So that was an easy 30ms win!

@dtolnay to answer your question about the signature, would you be opposed to a macro? Something like #[watt::proc_macro] to hide the details?

dtolnay · 2019-10-30T18:09:49Z

It shouldn't require an attribute macro though, right? We control exactly what argument the main entry point receives here. I am imagining something like (pseudocode):

let raw_token_stream = Val::i32(d.tokenstream.push(input) as i32);
let input_token_stream = raw_to_pm2.call(&[raw_token_stream]).unwrap()[0];
let output_token_stream = main.call(&[input_token_stream]).unwrap()[0];
let raw_token_stream = pm2_to_raw.call(&[output_token_stream]).unwrap()[0];
return d.tokenstream[raw_token_stream].clone();

where main is the user-provided no_mangle entry point and raw_to_pm2 + pm2_to_raw are no_mangle functions built into our patched proc-macro2, equivalent to RawTokenStream::into_token_stream and TokenStream::into_raw_token_stream.

alexcrichton · 2019-10-30T18:14:57Z

That's possible but would require specifying the ABI of TokenStream itself as a u32, which today it's a Vec<TokenTree> internally. I've generally found a macro to be useful for decoupling the API and the ABI because we don't necessarily want users to write down the ABI but rather we have an API we want them to adhere to.

dtolnay · 2019-10-30T18:16:51Z

Ah, makes sense. Yes I would be on board with an attribute macro to hide the ABI.

mystor · 2019-10-30T19:04:42Z

FWIW I experimented a bit, a while ago, with some really hacky macros around watt to allow writing proc_macro crates inline within the module you're working with (https://github.com/mystor/ctrs if anyone's interested, though It's pretty darn hacky). I included a transformation like the one you're talking about for #[watt::proc_macro]. It's perhaps a bit dumber than is needed here, though.

alexcrichton · 2019-10-30T19:15:01Z

Ok I've sent the culmination of all of this in as #14

I was curious to see the impact of Wasmtime's recent development since I last added the `WATT_JIT` env var feature to `watt` a few years ago since quite a lot has changed about Wasmtime in the meantime. The changes in this PR account for some ABI changes which have happened in the C API which doesn't account for anything major. Taking my old benchmark of `#[derive(Serialize)]` on `struct S(f32, ... /* 1000 times */)` the timings I get for the latest version of `serde_derive` are: | | native | watt | watt (cached) | |---------|--------|-------|---------------| | debug | 156ms | 280ms | 125ms | | release | 70ms | 257ms | 100ms | Using instead `#[derive(Serialize)] struct S(f32)` the timings I get are: | | native | watt | watt (cached) | |---------|--------|-------|---------------| | debug | 1ms | 241ms | 41ms | | release | 387us | 205ms | 46ms | So for large inputs jit-compiled WebAssembly can be faster than the native `serde_derive` when serde is itself compiled in debug mode. Note that this is almost always the default nowadays since `cargo build --release` will currently build build-dependencies with no optimizations. Only through explicit profile configuration can `serde_derive` be built in optimized mode (as I did to collect the above numbers). The `watt (cached)` column is where I enabled Wasmtime's global compilation cache to avoid recompiling the module every time the proc-macro is loaded which is why the timings are much lower. The difference between `watt` and `watt (cached)` is the compile time of the module itself. The 40ms or so in `watt (cached)` is almost entirely overhead of loading the module from cache which involves decompressing the module from disk and additionally sloshing bytes around. More efficient storage mediums exist for Wasmtime modules which means that it would actually be pretty easy to shave off a good chunk of time from that. Additionally Wasmtime has a custom C API which significantly differs from the one used in this repository which would also be significantly faster for calling into the host from wasm. Of the current ~3ms runtime in wasm itself that could probably be reduced further with more optimized calls. Overall this seems like pretty good progress made on Wasmtime in the interim since all my initial work in dtolnay#2. In any case I wanted to post this to get the `WATT_JIT` feature at least working again since otherwise it's segfaulting right now, and perhaps in the future if necessary more perf work can be done!

dtolnay added the help wanted Extra attention is needed label Oct 14, 2019

dtolnay changed the title ~~Experiment with whether we can hand off code to a preinstalled optimized runtime~~ Hand off code to a preinstalled optimized runtime if available Oct 15, 2019

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Oct 17, 2019

Add a missing api to the wasm C API

b14d021

This was used when [prototyping] but I found it wasn't implemented yet! [prototyping]: dtolnay/watt#2 (comment)

alexcrichton mentioned this issue Oct 17, 2019

Add a missing api to the wasm C API bytecodealliance/wasmtime#433

Merged

sunfishcode pushed a commit to bytecodealliance/wasmtime that referenced this issue Oct 18, 2019

Add a missing api to the wasm C API (#433)

50beb21

This was used when [prototyping] but I found it wasn't implemented yet! [prototyping]: dtolnay/watt#2 (comment)

mrowqa mentioned this issue Oct 18, 2019

Implement bitmask for SecondaryMap serialization bytecodealliance/cranelift#1158

Closed

alexcrichton mentioned this issue Oct 29, 2019

Consider adding a separate WASI runtime for build.rs isolation #3

Open

alexcrichton mentioned this issue Oct 30, 2019

Optimize the ABI boundary of wasm-compiled proc macros #14

Merged

alexcrichton mentioned this issue Oct 30, 2019

Cache wasm module instantiations across macro invocations #16

Closed

dtolnay mentioned this issue Oct 31, 2019

Move token_stream_parse implementation to Wasm #18

Closed

alexcrichton mentioned this issue Jun 14, 2022

Fix the WATT_JIT feature #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hand off code to a preinstalled optimized runtime if available #2

Hand off code to a preinstalled optimized runtime if available #2

dtolnay commented Oct 14, 2019

kazimuth commented Oct 14, 2019

dtolnay commented Oct 14, 2019

fitzgen commented Oct 14, 2019

alexcrichton commented Oct 15, 2019

dtolnay commented Oct 15, 2019

dtolnay commented Oct 15, 2019

kazimuth commented Oct 15, 2019 •

edited

Loading

alexcrichton commented Oct 16, 2019

dtolnay commented Oct 16, 2019

kazimuth commented Oct 17, 2019

alexcrichton commented Oct 17, 2019

alexcrichton commented Oct 17, 2019

dtolnay commented Oct 17, 2019

mrowqa commented Oct 18, 2019

mrowqa commented Oct 18, 2019

alexcrichton commented Oct 18, 2019

alexcrichton commented Oct 25, 2019

mystor commented Oct 29, 2019

alexcrichton commented Oct 29, 2019

alexcrichton commented Oct 29, 2019

dtolnay commented Oct 29, 2019

alexcrichton commented Oct 29, 2019

dtolnay commented Oct 29, 2019

alexcrichton commented Oct 30, 2019

mystor commented Oct 30, 2019

alexcrichton commented Oct 30, 2019

dtolnay commented Oct 30, 2019 •

edited

Loading

alexcrichton commented Oct 30, 2019

dtolnay commented Oct 30, 2019

alexcrichton commented Oct 30, 2019

dtolnay commented Oct 30, 2019

mystor commented Oct 30, 2019

alexcrichton commented Oct 30, 2019

Hand off code to a preinstalled optimized runtime if available #2

Hand off code to a preinstalled optimized runtime if available #2

Comments

dtolnay commented Oct 14, 2019

kazimuth commented Oct 14, 2019

dtolnay commented Oct 14, 2019

fitzgen commented Oct 14, 2019

alexcrichton commented Oct 15, 2019

dtolnay commented Oct 15, 2019

dtolnay commented Oct 15, 2019

kazimuth commented Oct 15, 2019 • edited Loading

alexcrichton commented Oct 16, 2019

dtolnay commented Oct 16, 2019

kazimuth commented Oct 17, 2019

alexcrichton commented Oct 17, 2019

alexcrichton commented Oct 17, 2019

dtolnay commented Oct 17, 2019

mrowqa commented Oct 18, 2019

mrowqa commented Oct 18, 2019

alexcrichton commented Oct 18, 2019

alexcrichton commented Oct 25, 2019

mystor commented Oct 29, 2019

alexcrichton commented Oct 29, 2019

alexcrichton commented Oct 29, 2019

dtolnay commented Oct 29, 2019

alexcrichton commented Oct 29, 2019

dtolnay commented Oct 29, 2019

alexcrichton commented Oct 30, 2019

mystor commented Oct 30, 2019

alexcrichton commented Oct 30, 2019

dtolnay commented Oct 30, 2019 • edited Loading

alexcrichton commented Oct 30, 2019

dtolnay commented Oct 30, 2019

alexcrichton commented Oct 30, 2019

dtolnay commented Oct 30, 2019

mystor commented Oct 30, 2019

alexcrichton commented Oct 30, 2019

kazimuth commented Oct 15, 2019 •

edited

Loading

dtolnay commented Oct 30, 2019 •

edited

Loading