[WasmFS] Support --embed-file #16028

kripken · 2022-01-14T18:44:41Z

This supports file embedding in a way that we couldn't do with the JS
filesystem: we emit a C file that contains the data + writes it into the FS,
and the WasmFS code calls that during startup. That is, the data is
truly embedded into the wasm file. This is much better than having it
be in JS where it is encoded as text etc.

The specific mechanism here is that wasmfs looks at a weak symbol
__wasmfs_load_embedded. If that is present we call it. The C code
emitted by the file packager defines that code. It contains something
like this:

#include <stdio.h>
#include <sys/stat.h>

void __wasmfs_load_embedded() {
  static const char fileData0[] = "\x23\x69\x6e\x63";
  FILE* file0 = fopen("/" "/" "a.cpp", "wb");
  fwrite(fileData0, 1, 204, file0);
  fclose(file0);
}

(Perhaps this could use internal WasmFS methods for even better
efficiency?)

Aside from the feature itself being useful, this also unblocks a bunch of
interesting tests that depend on file access.

The new package_files() function is needed in emcc.py because now
we call the file packager from 2 places.

Fixes #16014

sbc100

Love the idea... but not the running of the compiler at link time. I think we can find an alternative way to go from data to object file.. looking up options now.

sbc100 · 2022-01-14T18:59:47Z

emcc.py

+    with open(temp_c, 'w') as f:
+      f.write(package_files(options))
+    shared.check_call([shared.CLANG_CC, temp_c, '-o', temp_o, '-c'] + get_cflags([]))
+    linker_arguments.append(temp_o)


I'm not keen on adding more codegen or calls to clang during link time.

I've been hoping we can remove all compilation from the link phase at some point.

I wonder what alternatives we might be able to some up with? I think there are various ways to create object files from data files that don't involve running the compiler. @dschuff do you remember what different options we have?

It looks like objcopy is one option, but I'm not sure llvm-objcopy is advanced enough yet.

Maybe we can use assembly source code here and call llvm-mc to at least avoid calling the C compiler? I think I'd be OK with that if it try that llvm-objcopy is not mature enough yet.

Also, how about moving all of this logic into file_packager.py so that all of the logic for generated the object file is encapsulated there? That way folks who want to run file_packager.py themselves can generate and object file to feed to the linker, avoiding this "in-linker" compilation step.

Is the issue that running the compiler again at link time is expected to be too slow?

In this case its slower than I think it needs to be yes, but the biggest issue for me is that it breaks the normal layering of a toolchain. I know emcc is historically very magical at link time, so there is precedent, but I'm trying to make it less so where possible.

I think if the logic is moved out to file_packager.py I might feel a little less of a violation since this function could remain unchanged.

Moving it to file_packager would be awkward as we'd need to also pass it the relevant cflags. That script is very simple and standalone and not integrated with the main python logic that knows about how emcc compiles or even compilation at all.

But @sbc100 if you want to investigate other options than compiling during the link step that sgtm, I can wait on this PR. I agree this adds a bunch of magic that ideally we'd do without (but I don't see how).

For the old FS I guess we can also embed in a similar way, but I'm not sure how you'd get it to run at the same time as it currently does from the old call from JS. There might be code size and other issues too. Overall I think trying to get that working would be work for little gain? Seems simpler to just not do it and then when we remove the old FS it would be gone.

I'd kind of like to do it .. since it would avoid bifurcating the methodology for a potentially unbounded amount of time (since it really not clear now long it could take to deprecate the old FS). Plus, we would have a great set of tests for it (sine embed file is well tests with the old filesystem).

Regarding, "There might be code size and other issues too".. wouldn't all those issue apply to wasmfs too?

I'm not sure if the issues would affect wasmfs too. Wasmfs is simpler since it's all in wasm. But maybe you have an idea in mind for a good way to do it that avoids issues?

I'm not in a rush to land this, so if you want to look into that feel free to. I'll convert this to a draft.

Oh, one specific issue is that with wasmfs we can use internal APIs to embed the file - I have a TODO for that in this PR. That's something we can't do in the old FS. I guess we could EM_ASM out to JS perhaps there..?

I got inspired and create a bunch of PRs that cleanup file_packager.py and then one final one that implements preloading via wasm memory: #16050. I believe this PR will work with both the old FS code and with wasmfs (assuming wasmfs supports FS_createDataFile). We do followup by removing the dependency on FS_createDataFile perhaps but I think its a good way to introduce wasm-embedded files without any other changes.

sbc100 · 2022-01-14T19:00:49Z

emcc.py

@@ -2620,7 +2620,7 @@ def compile_source_file(i, input_file):


 @ToolchainProfiler.profile_block('calculate system libraries')
-def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs):
+def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs, options):


Better to be specific and just pass embed_files I think.

sbc100 · 2022-01-14T19:02:59Z

tools/file_packager.py

@@ -218,6 +224,9 @@ def main():
    elif arg == '--no-node':
      support_node = False
      leading = ''
+    elif arg == '--wasmfs-c':


Rather than adding two different modes I wonder if we can instead just move all embedding to C/C++ code, I don't see why it shouldn't work for the JS filesystem too.

That way perhaps we could delete the old JS codegen from here?

There is a trade off that the Wasm memory may be more valuable than JS heap memory and it also can't be garbage collected. I'm not sure whether people are preloading file large enough that it would matter, though.

We are talking about embedding only here, and not preloading. In that case, we are comparing embedded binaries in JS as base64 strings with wasm memory. I doubt those strings ever get collected.. and they are a lot larger than the equivalent in wasm memory because they are base64.

The other big advantage of using wasm memory is that it opens the door for zero-copy files (at least in read-only circumstances)... even for the old FS I think this is do-able.

dschuff · 2022-01-14T23:28:11Z

tools/file_packager.py

+        c_output += f'''static const char fileData{counter}[] = "{data}";\n'''
+        c_output += f'''
+FILE* file{counter} = fopen("{dirname}" "/" "{basename}", "wb");
+fwrite(fileData{counter}, 1, {len(binary)}, file{counter});


If I'm not mistaken, doing it this way causes the file data to be loaded into memory as part of the data section, and then copied into different memory (assuming this is a memory-backed FS), which seems unfortunate. Here are a couple of crazy ideas about how to avoid this:

Somehow put the data into a passive segment, and then drop it after copying. I guess that would probably have to involve assembly or compiler builtins, but might be worth it.

Put the data into a custom section instead of the data section. This has the downside that we'd have to go out to JS to get it (and I guess copy it temporarily into wasm memory during initialization). Also I'm not actually clear whether the VM would retain the custom section in its memory or not...

I'm not familiar with how this was done before. Did the old way of embedding avoid this problem somehow? Could we get the data into a JS arraybuffer which could eventually be dropped and GC'ed. I assume that downloading the data from the server rather than having it be embedded isn't an option?

Yes, this has overhead - a copy, as you said.

The old method copies the data from JS base64 encoded text during startup, so is even worse...

A passive segment or a custom section is an interesting idea. But I was actually hoping we can use an internal WasmFS API, basically giving it a pointer and telling it that it "owns" the data. Then the data would simply stay where it is. That's a little odd if it's written to (modifying "static" data) but it is as efficient as it gets, and for this purpose it is safe since the data corresponds 1:1 to a file. But maybe there's a problem with that idea?

I don't think we need any kind of passive segment or special segment do we? Can't we just use an internal API and pass a pointer and size?

Ah right, avoiding copying at all would be ideal. I guess that would work up until there's any kind of append, then you'd have to allocate more space and copy the whole thing. But for read-only it would be great.

Oh, and wrt memory overhead: The data section is part of the wasm file (some of which the engine keeps... somewhere? maybe in memory?), and it's obviously initialized into wasm memory in the data area, and then into wherever it's copied if applicable. Coming from JS, it would be in a JS file (which... again, we don't really know whether the VM keeps that), and once wherever it was copied if applicable (but no extra copy in the wasm memory). So it could actually be less overhead to have it come from JS if we aren't able to re-use the static data as the single copy (e.g. if we think it will be written, or if we are writing into a persistent FS rather than a memory FS).

I think we should probably consider writable preloaded files to be out of scope, no? I can't imagine many folks wanting that... and if they did then it seems reasonable to make a copy of first write.. but aside from that I think a single copy if wasm memory sounds like a good goal.

But I was actually hoping we can use an internal WasmFS API, basically giving it a pointer and telling it that it "owns" the data. Then the data would simply stay where it is.

It would be pretty easy to have a StaticMemoryFile that just contains a pointer to some static memory in the data section. It could easily be mutable, too, just like any non-const global byte array in C. Growing it could allocate the additional data in a vector without copying the original data; there's no need for the data to be contiguous.

Regarding the delivery mechanism; all our data segments (both passive and active) are dropped on startup, either manually in the case of passive segments or automatically in the case of active segments. So there shouldn't be any extra copies of the data sitting around. Using custom sections, OTOH, would require an extra copy to be kept somewhere on the JS heap since the Wasm API exposes their contents.

kripken · 2022-01-25T01:02:03Z

Closing as this is being fixed by other means.

kripken added 14 commits January 12, 2022 17:07

start [ci skip]

8cf3646

Merge remote-tracking branch 'origin/main' into wfembed

0d7e672

work [ci skip]

b7117e9

[ci skip]

77fb351

acks [ci skip]

5c97e00

why not link [ci skip]

dfd4c72

works [ci skip]

9ea79c9

fixes [ci skip]

8f44c3a

Merge remote-tracking branch 'origin/main' into wfembed

c9857e7

better

43af496

test

299a4fe

cleanup

1ff958f

fix hack

8ecce7b

remove hack

7b65d4a

kripken requested review from sbc100 and tlively January 14, 2022 18:44

flake8

832b173

sbc100 reviewed Jan 14, 2022

View reviewed changes

kripken marked this pull request as draft January 14, 2022 22:53

dschuff reviewed Jan 14, 2022

View reviewed changes

sbc100 mentioned this pull request Jan 17, 2022

file_packager.py: Add option to embed file data in wasm binary #16050

Merged

kripken closed this Jan 25, 2022

kripken deleted the wfembed branch January 25, 2022 01:02

[WasmFS] Support --embed-file #16028

[WasmFS] Support --embed-file #16028

Uh oh!

Conversation

kripken commented Jan 14, 2022

Uh oh!

sbc100 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken commented Jan 25, 2022

Uh oh!

Uh oh!