Skip to content

[WasmFS] Support --embed-file #16028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed

[WasmFS] Support --embed-file #16028

wants to merge 15 commits into from

Conversation

kripken
Copy link
Member

@kripken kripken commented Jan 14, 2022

This supports file embedding in a way that we couldn't do with the JS
filesystem: we emit a C file that contains the data + writes it into the FS,
and the WasmFS code calls that during startup. That is, the data is
truly embedded into the wasm file. This is much better than having it
be in JS where it is encoded as text etc.

The specific mechanism here is that wasmfs looks at a weak symbol
__wasmfs_load_embedded. If that is present we call it. The C code
emitted by the file packager defines that code. It contains something
like this:

#include <stdio.h>
#include <sys/stat.h>

void __wasmfs_load_embedded() {
  static const char fileData0[] = "\x23\x69\x6e\x63";
  FILE* file0 = fopen("/" "/" "a.cpp", "wb");
  fwrite(fileData0, 1, 204, file0);
  fclose(file0);
}

(Perhaps this could use internal WasmFS methods for even better
efficiency?)

Aside from the feature itself being useful, this also unblocks a bunch of
interesting tests that depend on file access.

The new package_files() function is needed in emcc.py because now
we call the file packager from 2 places.

Fixes #16014

@kripken kripken requested review from sbc100 and tlively January 14, 2022 18:44
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the idea... but not the running of the compiler at link time. I think we can find an alternative way to go from data to object file.. looking up options now.

with open(temp_c, 'w') as f:
f.write(package_files(options))
shared.check_call([shared.CLANG_CC, temp_c, '-o', temp_o, '-c'] + get_cflags([]))
linker_arguments.append(temp_o)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not keen on adding more codegen or calls to clang during link time.

I've been hoping we can remove all compilation from the link phase at some point.

I wonder what alternatives we might be able to some up with? I think there are various ways to create object files from data files that don't involve running the compiler. @dschuff do you remember what different options we have?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like objcopy is one option, but I'm not sure llvm-objcopy is advanced enough yet.

Maybe we can use assembly source code here and call llvm-mc to at least avoid calling the C compiler? I think I'd be OK with that if it try that llvm-objcopy is not mature enough yet.

Also, how about moving all of this logic into file_packager.py so that all of the logic for generated the object file is encapsulated there? That way folks who want to run file_packager.py themselves can generate and object file to feed to the linker, avoiding this "in-linker" compilation step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue that running the compiler again at link time is expected to be too slow?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case its slower than I think it needs to be yes, but the biggest issue for me is that it breaks the normal layering of a toolchain. I know emcc is historically very magical at link time, so there is precedent, but I'm trying to make it less so where possible.

I think if the logic is moved out to file_packager.py I might feel a little less of a violation since this function could remain unchanged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it to file_packager would be awkward as we'd need to also pass it the relevant cflags. That script is very simple and standalone and not integrated with the main python logic that knows about how emcc compiles or even compilation at all.

But @sbc100 if you want to investigate other options than compiling during the link step that sgtm, I can wait on this PR. I agree this adds a bunch of magic that ideally we'd do without (but I don't see how).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the old FS I guess we can also embed in a similar way, but I'm not sure how you'd get it to run at the same time as it currently does from the old call from JS. There might be code size and other issues too. Overall I think trying to get that working would be work for little gain? Seems simpler to just not do it and then when we remove the old FS it would be gone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd kind of like to do it .. since it would avoid bifurcating the methodology for a potentially unbounded amount of time (since it really not clear now long it could take to deprecate the old FS). Plus, we would have a great set of tests for it (sine embed file is well tests with the old filesystem).

Regarding, "There might be code size and other issues too".. wouldn't all those issue apply to wasmfs too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the issues would affect wasmfs too. Wasmfs is simpler since it's all in wasm. But maybe you have an idea in mind for a good way to do it that avoids issues?

I'm not in a rush to land this, so if you want to look into that feel free to. I'll convert this to a draft.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one specific issue is that with wasmfs we can use internal APIs to embed the file - I have a TODO for that in this PR. That's something we can't do in the old FS. I guess we could EM_ASM out to JS perhaps there..?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got inspired and create a bunch of PRs that cleanup file_packager.py and then one final one that implements preloading via wasm memory: #16050. I believe this PR will work with both the old FS code and with wasmfs (assuming wasmfs supports FS_createDataFile). We do followup by removing the dependency on FS_createDataFile perhaps but I think its a good way to introduce wasm-embedded files without any other changes.

@@ -2620,7 +2620,7 @@ def compile_source_file(i, input_file):


@ToolchainProfiler.profile_block('calculate system libraries')
def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs):
def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs, options):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to be specific and just pass embed_files I think.

@@ -218,6 +224,9 @@ def main():
elif arg == '--no-node':
support_node = False
leading = ''
elif arg == '--wasmfs-c':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding two different modes I wonder if we can instead just move all embedding to C/C++ code, I don't see why it shouldn't work for the JS filesystem too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way perhaps we could delete the old JS codegen from here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a trade off that the Wasm memory may be more valuable than JS heap memory and it also can't be garbage collected. I'm not sure whether people are preloading file large enough that it would matter, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are talking about embedding only here, and not preloading. In that case, we are comparing embedded binaries in JS as base64 strings with wasm memory. I doubt those strings ever get collected.. and they are a lot larger than the equivalent in wasm memory because they are base64.

The other big advantage of using wasm memory is that it opens the door for zero-copy files (at least in read-only circumstances)... even for the old FS I think this is do-able.

@kripken kripken marked this pull request as draft January 14, 2022 22:53
c_output += f'''static const char fileData{counter}[] = "{data}";\n'''
c_output += f'''
FILE* file{counter} = fopen("{dirname}" "/" "{basename}", "wb");
fwrite(fileData{counter}, 1, {len(binary)}, file{counter});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, doing it this way causes the file data to be loaded into memory as part of the data section, and then copied into different memory (assuming this is a memory-backed FS), which seems unfortunate. Here are a couple of crazy ideas about how to avoid this:

  1. Somehow put the data into a passive segment, and then drop it after copying. I guess that would probably have to involve assembly or compiler builtins, but might be worth it.
  2. Put the data into a custom section instead of the data section. This has the downside that we'd have to go out to JS to get it (and I guess copy it temporarily into wasm memory during initialization). Also I'm not actually clear whether the VM would retain the custom section in its memory or not...

I'm not familiar with how this was done before. Did the old way of embedding avoid this problem somehow? Could we get the data into a JS arraybuffer which could eventually be dropped and GC'ed. I assume that downloading the data from the server rather than having it be embedded isn't an option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this has overhead - a copy, as you said.

The old method copies the data from JS base64 encoded text during startup, so is even worse...

A passive segment or a custom section is an interesting idea. But I was actually hoping we can use an internal WasmFS API, basically giving it a pointer and telling it that it "owns" the data. Then the data would simply stay where it is. That's a little odd if it's written to (modifying "static" data) but it is as efficient as it gets, and for this purpose it is safe since the data corresponds 1:1 to a file. But maybe there's a problem with that idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any kind of passive segment or special segment do we? Can't we just use an internal API and pass a pointer and size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, avoiding copying at all would be ideal. I guess that would work up until there's any kind of append, then you'd have to allocate more space and copy the whole thing. But for read-only it would be great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and wrt memory overhead: The data section is part of the wasm file (some of which the engine keeps... somewhere? maybe in memory?), and it's obviously initialized into wasm memory in the data area, and then into wherever it's copied if applicable. Coming from JS, it would be in a JS file (which... again, we don't really know whether the VM keeps that), and once wherever it was copied if applicable (but no extra copy in the wasm memory). So it could actually be less overhead to have it come from JS if we aren't able to re-use the static data as the single copy (e.g. if we think it will be written, or if we are writing into a persistent FS rather than a memory FS).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably consider writable preloaded files to be out of scope, no? I can't imagine many folks wanting that... and if they did then it seems reasonable to make a copy of first write.. but aside from that I think a single copy if wasm memory sounds like a good goal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I was actually hoping we can use an internal WasmFS API, basically giving it a pointer and telling it that it "owns" the data. Then the data would simply stay where it is.

It would be pretty easy to have a StaticMemoryFile that just contains a pointer to some static memory in the data section. It could easily be mutable, too, just like any non-const global byte array in C. Growing it could allocate the additional data in a vector without copying the original data; there's no need for the data to be contiguous.

Regarding the delivery mechanism; all our data segments (both passive and active) are dropped on startup, either manually in the case of passive segments or automatically in the case of active segments. So there shouldn't be any extra copies of the data sitting around. Using custom sections, OTOH, would require an extra copy to be kept somewhere on the JS heap since the Wasm API exposes their contents.

@kripken
Copy link
Member Author

kripken commented Jan 25, 2022

Closing as this is being fixed by other means.

@kripken kripken closed this Jan 25, 2022
@kripken kripken deleted the wfembed branch January 25, 2022 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WasmFS: embed-file support
4 participants