-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[WasmFS] Support --embed-file #16028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
8cf3646
0d7e672
b7117e9
77fb351
5c97e00
dfd4c72
9ea79c9
8f44c3a
c9857e7
43af496
299a4fe
1ff958f
8ecce7b
7b65d4a
832b173
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1135,7 +1135,7 @@ def run(args): | |
logger.debug('stopping after linking to object file') | ||
return 0 | ||
|
||
phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs) | ||
phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs, options) | ||
|
||
phase_link(linker_arguments, wasm_target) | ||
|
||
|
@@ -2620,7 +2620,7 @@ def compile_source_file(i, input_file): | |
|
||
|
||
@ToolchainProfiler.profile_block('calculate system libraries') | ||
def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs): | ||
def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs, options): | ||
extra_files_to_link = [] | ||
# link in ports and system libraries, if necessary | ||
if not settings.SIDE_MODULE: | ||
|
@@ -2630,6 +2630,19 @@ def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, new | |
extra_files_to_link += system_libs.calculate(all_linker_inputs, newargs, forced=state.forced_stdlibs) | ||
linker_arguments.extend(extra_files_to_link) | ||
|
||
if settings.WASMFS and options.embed_files: | ||
# wasmfs file embedding is done via emitting C code that contains the data | ||
# and code to set them up. we add that as another input, like a system | ||
# library, that we compile on the fly here | ||
temp_files = shared.configuration.get_temp_files() | ||
temp_c = temp_files.get(suffix='.c').name | ||
temp_o = unsuffixed(temp_c) + '.o' | ||
temp_files.note(temp_o) | ||
with open(temp_c, 'w') as f: | ||
f.write(package_files(options)) | ||
shared.check_call([shared.CLANG_CC, temp_c, '-o', temp_o, '-c'] + get_cflags([])) | ||
linker_arguments.append(temp_o) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not keen on adding more codegen or calls to clang during link time. I've been hoping we can remove all compilation from the link phase at some point. I wonder what alternatives we might be able to some up with? I think there are various ways to create object files from data files that don't involve running the compiler. @dschuff do you remember what different options we have? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like Maybe we can use assembly source code here and call Also, how about moving all of this logic into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the issue that running the compiler again at link time is expected to be too slow? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this case its slower than I think it needs to be yes, but the biggest issue for me is that it breaks the normal layering of a toolchain. I know emcc is historically very magical at link time, so there is precedent, but I'm trying to make it less so where possible. I think if the logic is moved out to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moving it to file_packager would be awkward as we'd need to also pass it the relevant cflags. That script is very simple and standalone and not integrated with the main python logic that knows about how emcc compiles or even compilation at all. But @sbc100 if you want to investigate other options than compiling during the link step that sgtm, I can wait on this PR. I agree this adds a bunch of magic that ideally we'd do without (but I don't see how). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the old FS I guess we can also embed in a similar way, but I'm not sure how you'd get it to run at the same time as it currently does from the old call from JS. There might be code size and other issues too. Overall I think trying to get that working would be work for little gain? Seems simpler to just not do it and then when we remove the old FS it would be gone. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd kind of like to do it .. since it would avoid bifurcating the methodology for a potentially unbounded amount of time (since it really not clear now long it could take to deprecate the old FS). Plus, we would have a great set of tests for it (sine embed file is well tests with the old filesystem). Regarding, "There might be code size and other issues too".. wouldn't all those issue apply to wasmfs too? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if the issues would affect wasmfs too. Wasmfs is simpler since it's all in wasm. But maybe you have an idea in mind for a good way to do it that avoids issues? I'm not in a rush to land this, so if you want to look into that feel free to. I'll convert this to a draft. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, one specific issue is that with wasmfs we can use internal APIs to embed the file - I have a TODO for that in this PR. That's something we can't do in the old FS. I guess we could EM_ASM out to JS perhaps there..? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I got inspired and create a bunch of PRs that cleanup file_packager.py and then one final one that implements preloading via wasm memory: #16050. I believe this PR will work with both the old FS code and with wasmfs (assuming wasmfs supports FS_createDataFile). We do followup by removing the dependency on |
||
|
||
|
||
@ToolchainProfiler.profile_block('link') | ||
def phase_link(linker_arguments, wasm_target): | ||
|
@@ -2707,33 +2720,43 @@ def phase_emscript(options, in_wasm, wasm_target, memfile): | |
save_intermediate('original') | ||
|
||
|
||
def package_files(options, target='default'): | ||
logger.debug('setting up files') | ||
file_args = ['--from-emcc', '--export-name=' + settings.EXPORT_NAME] | ||
if options.preload_files: | ||
file_args.append('--preload') | ||
file_args += options.preload_files | ||
if options.embed_files: | ||
file_args.append('--embed') | ||
file_args += options.embed_files | ||
if options.exclude_files: | ||
file_args.append('--exclude') | ||
file_args += options.exclude_files | ||
if options.use_preload_cache: | ||
file_args.append('--use-preload-cache') | ||
if settings.LZ4: | ||
file_args.append('--lz4') | ||
if options.use_preload_plugins: | ||
file_args.append('--use-preload-plugins') | ||
if not settings.ENVIRONMENT_MAY_BE_NODE: | ||
file_args.append('--no-node') | ||
wasmfs_c = settings.WASMFS and options.embed_files | ||
if wasmfs_c: | ||
file_args += ['--wasmfs-c'] | ||
file_code = shared.check_call([shared.FILE_PACKAGER, shared.replace_suffix(target, '.data')] + file_args, stdout=PIPE).stdout | ||
if wasmfs_c: | ||
return file_code | ||
else: | ||
options.pre_js = js_manipulation.add_files_pre_js(options.pre_js, file_code) | ||
|
||
|
||
@ToolchainProfiler.profile_block('source transforms') | ||
def phase_source_transforms(options, target): | ||
global final_js | ||
|
||
# Embed and preload files | ||
if len(options.preload_files) or len(options.embed_files): | ||
logger.debug('setting up files') | ||
file_args = ['--from-emcc', '--export-name=' + settings.EXPORT_NAME] | ||
if len(options.preload_files): | ||
file_args.append('--preload') | ||
file_args += options.preload_files | ||
if len(options.embed_files): | ||
file_args.append('--embed') | ||
file_args += options.embed_files | ||
if len(options.exclude_files): | ||
file_args.append('--exclude') | ||
file_args += options.exclude_files | ||
if options.use_preload_cache: | ||
file_args.append('--use-preload-cache') | ||
if settings.LZ4: | ||
file_args.append('--lz4') | ||
if options.use_preload_plugins: | ||
file_args.append('--use-preload-plugins') | ||
if not settings.ENVIRONMENT_MAY_BE_NODE: | ||
file_args.append('--no-node') | ||
file_code = shared.check_call([shared.FILE_PACKAGER, shared.replace_suffix(target, '.data')] + file_args, stdout=PIPE).stdout | ||
options.pre_js = js_manipulation.add_files_pre_js(options.pre_js, file_code) | ||
package_files(options, target) | ||
|
||
# Apply pre and postjs files | ||
if final_js and (options.pre_js or options.post_js): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,7 +21,7 @@ | |
|
||
Usage: | ||
|
||
file_packager TARGET [--preload A [B..]] [--embed C [D..]] [--exclude E [F..]]] [--js-output=OUTPUT.js] [--no-force] [--use-preload-cache] [--indexedDB-name=EM_PRELOAD_CACHE] [--separate-metadata] [--lz4] [--use-preload-plugins] [--no-node] | ||
file_packager TARGET [--preload A [B..]] [--embed C [D..]] [--exclude E [F..]]] [--js-output=OUTPUT.js] [--no-force] [--use-preload-cache] [--indexedDB-name=EM_PRELOAD_CACHE] [--separate-metadata] [--lz4] [--use-preload-plugins] [--no-node] [--wasmfs-c] | ||
|
||
--preload , | ||
--embed See emcc --help for more details on those options. | ||
|
@@ -51,6 +51,11 @@ | |
|
||
--no-node Whether to support Node.js. By default we do, which emits some extra code. | ||
|
||
--wasmfs-c Whether to emit C code for wasmfs. This only supports embedding | ||
(as it literally embeds the data in the C). If you prefer | ||
preloading, you can use that normally and wasmfs will interact | ||
with the JS normally. | ||
|
||
Notes: | ||
|
||
* The file packager generates unix-style file paths. So if you are on windows and a file is accessed at | ||
|
@@ -185,6 +190,7 @@ def main(): | |
lz4 = False | ||
use_preload_plugins = False | ||
support_node = True | ||
wasmfs_c = False | ||
|
||
for arg in sys.argv[2:]: | ||
if arg == '--preload': | ||
|
@@ -218,6 +224,9 @@ def main(): | |
elif arg == '--no-node': | ||
support_node = False | ||
leading = '' | ||
elif arg == '--wasmfs-c': | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than adding two different modes I wonder if we can instead just move all embedding to C/C++ code, I don't see why it shouldn't work for the JS filesystem too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That way perhaps we could delete the old JS codegen from here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a trade off that the Wasm memory may be more valuable than JS heap memory and it also can't be garbage collected. I'm not sure whether people are preloading file large enough that it would matter, though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are talking about embedding only here, and not preloading. In that case, we are comparing embedded binaries in JS as base64 strings with wasm memory. I doubt those strings ever get collected.. and they are a lot larger than the equivalent in wasm memory because they are base64. The other big advantage of using wasm memory is that it opens the door for zero-copy files (at least in read-only circumstances)... even for the old FS I think this is do-able. |
||
wasmfs_c = True | ||
leading = '' | ||
elif arg.startswith('--js-output'): | ||
jsoutput = arg.split('=', 1)[1] if '=' in arg else None | ||
leading = '' | ||
|
@@ -380,6 +389,8 @@ def was_seen(name): | |
|
||
metadata = {'files': []} | ||
|
||
c_output = '' | ||
|
||
# Set up folders | ||
partial_dirs = [] | ||
for file_ in data_files: | ||
|
@@ -392,6 +403,7 @@ def was_seen(name): | |
if partial not in partial_dirs: | ||
code += ('''Module['FS_createPath'](%s, %s, true, true);\n''' | ||
% (json.dumps('/' + '/'.join(parts[:i])), json.dumps(parts[i]))) | ||
c_output += f'mkdir("{partial}", 0700);\n' | ||
partial_dirs.append(partial) | ||
|
||
if has_preloaded: | ||
|
@@ -475,13 +487,35 @@ def was_seen(name): | |
basename = os.path.basename(filename) | ||
if file_['mode'] == 'embed': | ||
# Embed | ||
data = base64_encode(utils.read_binary(file_['srcpath'])) | ||
code += '''var fileData%d = '%s';\n''' % (counter, data) | ||
code += ('''Module['FS_createDataFile']('%s', '%s', decodeBase64(fileData%d), true, true, false);\n''' | ||
% (dirname, basename, counter)) | ||
binary = utils.read_binary(file_['srcpath']) | ||
if not wasmfs_c: | ||
# JS output | ||
data = base64_encode(binary) | ||
code += '''var fileData%d = '%s';\n''' % (counter, data) | ||
code += ('''Module['FS_createDataFile']('%s', '%s', decodeBase64(fileData%d), true, true, false);\n''' | ||
% (dirname, basename, counter)) | ||
else: | ||
# C output. | ||
# convert the binary data into a C escaped string, \xAB for hex code AB | ||
def escape_for_c(char): | ||
if char < 16: | ||
# add a 0 to keep the output in two bytes | ||
return '\\x0' + hex(char)[-1:] | ||
return '\\x' + hex(char)[-2:] | ||
data = ''.join([escape_for_c(char) for char in binary]) | ||
# directories... :( make them | ||
c_output += f'''static const char fileData{counter}[] = "{data}";\n''' | ||
c_output += f''' | ||
FILE* file{counter} = fopen("{dirname}" "/" "{basename}", "wb"); | ||
fwrite(fileData{counter}, 1, {len(binary)}, file{counter}); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I'm not mistaken, doing it this way causes the file data to be loaded into memory as part of the data section, and then copied into different memory (assuming this is a memory-backed FS), which seems unfortunate. Here are a couple of crazy ideas about how to avoid this:
I'm not familiar with how this was done before. Did the old way of embedding avoid this problem somehow? Could we get the data into a JS arraybuffer which could eventually be dropped and GC'ed. I assume that downloading the data from the server rather than having it be embedded isn't an option? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this has overhead - a copy, as you said. The old method copies the data from JS base64 encoded text during startup, so is even worse... A passive segment or a custom section is an interesting idea. But I was actually hoping we can use an internal WasmFS API, basically giving it a pointer and telling it that it "owns" the data. Then the data would simply stay where it is. That's a little odd if it's written to (modifying "static" data) but it is as efficient as it gets, and for this purpose it is safe since the data corresponds 1:1 to a file. But maybe there's a problem with that idea? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need any kind of passive segment or special segment do we? Can't we just use an internal API and pass a pointer and size? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah right, avoiding copying at all would be ideal. I guess that would work up until there's any kind of append, then you'd have to allocate more space and copy the whole thing. But for read-only it would be great. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, and wrt memory overhead: The data section is part of the wasm file (some of which the engine keeps... somewhere? maybe in memory?), and it's obviously initialized into wasm memory in the data area, and then into wherever it's copied if applicable. Coming from JS, it would be in a JS file (which... again, we don't really know whether the VM keeps that), and once wherever it was copied if applicable (but no extra copy in the wasm memory). So it could actually be less overhead to have it come from JS if we aren't able to re-use the static data as the single copy (e.g. if we think it will be written, or if we are writing into a persistent FS rather than a memory FS). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should probably consider writable preloaded files to be out of scope, no? I can't imagine many folks wanting that... and if they did then it seems reasonable to make a copy of first write.. but aside from that I think a single copy if wasm memory sounds like a good goal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It would be pretty easy to have a Regarding the delivery mechanism; all our data segments (both passive and active) are dropped on startup, either manually in the case of passive segments or automatically in the case of active segments. So there shouldn't be any extra copies of the data sitting around. Using custom sections, OTOH, would require an extra copy to be kept somewhere on the JS heap since the Wasm API exposes their contents. |
||
fclose(file{counter}); | ||
''' | ||
|
||
counter += 1 | ||
elif file_['mode'] == 'preload': | ||
# Preload | ||
assert not wasmfs_c, 'wasmfs-c mode only supports embedding' | ||
|
||
counter += 1 | ||
|
||
metadata_el = { | ||
|
@@ -920,6 +954,18 @@ def was_seen(name): | |
})(); | ||
''' % _metadata_template | ||
|
||
if wasmfs_c: | ||
ret = r''' | ||
#include <stdio.h> | ||
#include <sys/stat.h> | ||
|
||
void __wasmfs_load_embedded() { | ||
|
||
%s | ||
|
||
} | ||
''' % c_output | ||
|
||
if force or len(data_files): | ||
if jsoutput is None: | ||
print(ret) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to be specific and just pass embed_files I think.