Skip to content

[WasmFS] Support --embed-file #16028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
69 changes: 46 additions & 23 deletions emcc.py
Original file line number Diff line number Diff line change
Expand Up @@ -1135,7 +1135,7 @@ def run(args):
logger.debug('stopping after linking to object file')
return 0

phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs)
phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs, options)

phase_link(linker_arguments, wasm_target)

Expand Down Expand Up @@ -2620,7 +2620,7 @@ def compile_source_file(i, input_file):


@ToolchainProfiler.profile_block('calculate system libraries')
def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs):
def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, newargs, options):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to be specific and just pass embed_files I think.

extra_files_to_link = []
# link in ports and system libraries, if necessary
if not settings.SIDE_MODULE:
Expand All @@ -2630,6 +2630,19 @@ def phase_calculate_system_libraries(state, linker_arguments, linker_inputs, new
extra_files_to_link += system_libs.calculate(all_linker_inputs, newargs, forced=state.forced_stdlibs)
linker_arguments.extend(extra_files_to_link)

if settings.WASMFS and options.embed_files:
# wasmfs file embedding is done via emitting C code that contains the data
# and code to set them up. we add that as another input, like a system
# library, that we compile on the fly here
temp_files = shared.configuration.get_temp_files()
temp_c = temp_files.get(suffix='.c').name
temp_o = unsuffixed(temp_c) + '.o'
temp_files.note(temp_o)
with open(temp_c, 'w') as f:
f.write(package_files(options))
shared.check_call([shared.CLANG_CC, temp_c, '-o', temp_o, '-c'] + get_cflags([]))
linker_arguments.append(temp_o)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not keen on adding more codegen or calls to clang during link time.

I've been hoping we can remove all compilation from the link phase at some point.

I wonder what alternatives we might be able to some up with? I think there are various ways to create object files from data files that don't involve running the compiler. @dschuff do you remember what different options we have?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like objcopy is one option, but I'm not sure llvm-objcopy is advanced enough yet.

Maybe we can use assembly source code here and call llvm-mc to at least avoid calling the C compiler? I think I'd be OK with that if it try that llvm-objcopy is not mature enough yet.

Also, how about moving all of this logic into file_packager.py so that all of the logic for generated the object file is encapsulated there? That way folks who want to run file_packager.py themselves can generate and object file to feed to the linker, avoiding this "in-linker" compilation step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue that running the compiler again at link time is expected to be too slow?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case its slower than I think it needs to be yes, but the biggest issue for me is that it breaks the normal layering of a toolchain. I know emcc is historically very magical at link time, so there is precedent, but I'm trying to make it less so where possible.

I think if the logic is moved out to file_packager.py I might feel a little less of a violation since this function could remain unchanged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it to file_packager would be awkward as we'd need to also pass it the relevant cflags. That script is very simple and standalone and not integrated with the main python logic that knows about how emcc compiles or even compilation at all.

But @sbc100 if you want to investigate other options than compiling during the link step that sgtm, I can wait on this PR. I agree this adds a bunch of magic that ideally we'd do without (but I don't see how).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the old FS I guess we can also embed in a similar way, but I'm not sure how you'd get it to run at the same time as it currently does from the old call from JS. There might be code size and other issues too. Overall I think trying to get that working would be work for little gain? Seems simpler to just not do it and then when we remove the old FS it would be gone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd kind of like to do it .. since it would avoid bifurcating the methodology for a potentially unbounded amount of time (since it really not clear now long it could take to deprecate the old FS). Plus, we would have a great set of tests for it (sine embed file is well tests with the old filesystem).

Regarding, "There might be code size and other issues too".. wouldn't all those issue apply to wasmfs too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the issues would affect wasmfs too. Wasmfs is simpler since it's all in wasm. But maybe you have an idea in mind for a good way to do it that avoids issues?

I'm not in a rush to land this, so if you want to look into that feel free to. I'll convert this to a draft.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one specific issue is that with wasmfs we can use internal APIs to embed the file - I have a TODO for that in this PR. That's something we can't do in the old FS. I guess we could EM_ASM out to JS perhaps there..?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got inspired and create a bunch of PRs that cleanup file_packager.py and then one final one that implements preloading via wasm memory: #16050. I believe this PR will work with both the old FS code and with wasmfs (assuming wasmfs supports FS_createDataFile). We do followup by removing the dependency on FS_createDataFile perhaps but I think its a good way to introduce wasm-embedded files without any other changes.



@ToolchainProfiler.profile_block('link')
def phase_link(linker_arguments, wasm_target):
Expand Down Expand Up @@ -2707,33 +2720,43 @@ def phase_emscript(options, in_wasm, wasm_target, memfile):
save_intermediate('original')


def package_files(options, target='default'):
logger.debug('setting up files')
file_args = ['--from-emcc', '--export-name=' + settings.EXPORT_NAME]
if options.preload_files:
file_args.append('--preload')
file_args += options.preload_files
if options.embed_files:
file_args.append('--embed')
file_args += options.embed_files
if options.exclude_files:
file_args.append('--exclude')
file_args += options.exclude_files
if options.use_preload_cache:
file_args.append('--use-preload-cache')
if settings.LZ4:
file_args.append('--lz4')
if options.use_preload_plugins:
file_args.append('--use-preload-plugins')
if not settings.ENVIRONMENT_MAY_BE_NODE:
file_args.append('--no-node')
wasmfs_c = settings.WASMFS and options.embed_files
if wasmfs_c:
file_args += ['--wasmfs-c']
file_code = shared.check_call([shared.FILE_PACKAGER, shared.replace_suffix(target, '.data')] + file_args, stdout=PIPE).stdout
if wasmfs_c:
return file_code
else:
options.pre_js = js_manipulation.add_files_pre_js(options.pre_js, file_code)


@ToolchainProfiler.profile_block('source transforms')
def phase_source_transforms(options, target):
global final_js

# Embed and preload files
if len(options.preload_files) or len(options.embed_files):
logger.debug('setting up files')
file_args = ['--from-emcc', '--export-name=' + settings.EXPORT_NAME]
if len(options.preload_files):
file_args.append('--preload')
file_args += options.preload_files
if len(options.embed_files):
file_args.append('--embed')
file_args += options.embed_files
if len(options.exclude_files):
file_args.append('--exclude')
file_args += options.exclude_files
if options.use_preload_cache:
file_args.append('--use-preload-cache')
if settings.LZ4:
file_args.append('--lz4')
if options.use_preload_plugins:
file_args.append('--use-preload-plugins')
if not settings.ENVIRONMENT_MAY_BE_NODE:
file_args.append('--no-node')
file_code = shared.check_call([shared.FILE_PACKAGER, shared.replace_suffix(target, '.data')] + file_args, stdout=PIPE).stdout
options.pre_js = js_manipulation.add_files_pre_js(options.pre_js, file_code)
package_files(options, target)

# Apply pre and postjs files
if final_js and (options.pre_js or options.post_js):
Expand Down
2 changes: 2 additions & 0 deletions system/lib/wasmfs/file.h
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,8 @@ class File : public std::enable_shared_from_this<File> {
std::weak_ptr<File> parent;

// This specifies which backend a file is associated with.
// TODO: Should this be a shared_ptr? Or do we assume backends are never
// deallocated?
backend_t backend;
};

Expand Down
22 changes: 16 additions & 6 deletions system/lib/wasmfs/wasmfs.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,13 @@ std::shared_ptr<Directory> WasmFS::initRootDirectory() {
return rootDirectory;
}

// Initialize files specified by the --preload-file option.
// Set up directories and files in wasmFS$preloadedDirs and
// wasmFS$preloadedFiles from JS. This function will be called before any file
// operation to ensure any preloaded files are eagerly available for use.
void WasmFS::preloadFiles() {
// Debug builds only: add check to ensure preloadFiles() is called once.
// If files are embedded in the program, then this symbol is defined. We will
// call it and it will set those files up.
__attribute__((__weak__))
extern "C" void __wasmfs_load_embedded();

void WasmFS::loadInitialFiles() {
// Debug builds only: add check to ensure loadInitialFiles() is called once.
#ifndef NDEBUG
static std::atomic<int> timesCalled;
timesCalled++;
Expand All @@ -71,6 +72,15 @@ void WasmFS::preloadFiles() {
// Ensure that files are preloaded from the main thread.
assert(emscripten_is_main_runtime_thread());

// First, handle embedded files, if there are any.
if (__wasmfs_load_embedded) {
__wasmfs_load_embedded();
}

// Handle preloaded files.
// Set up directories and files in wasmFS$preloadedDirs and
// wasmFS$preloadedFiles from JS. This function will be called before any file
// operation to ensure any preloaded files are eagerly available for use.
auto numFiles = _wasmfs_get_num_preloaded_files();
auto numDirs = _wasmfs_get_num_preloaded_dirs();

Expand Down
6 changes: 3 additions & 3 deletions system/lib/wasmfs/wasmfs.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@ class WasmFS {
// dev/stderr. Refers to the same std streams in the open file table.
std::shared_ptr<Directory> initRootDirectory();

// Initialize files specified by --preload-file option.
void preloadFiles();
// Initialize files specified by --preload-file and --embed-file options.
void loadInitialFiles();

public:
// Files will be preloaded in this constructor.
// This global constructor has init_priority 100. Please see wasmfs.cpp.
// The current working directory is initialized to the root directory.
WasmFS() : rootDirectory(initRootDirectory()), cwd(rootDirectory) {
preloadFiles();
loadInitialFiles();
}

// This get method returns a locked file table.
Expand Down
3 changes: 2 additions & 1 deletion tests/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -528,6 +528,7 @@ def test_wasm32_unknown_emscripten(self):
# No other configuration is supported, so always run this.
self.do_runf(test_file('wasm32-unknown-emscripten.c'), '')

@also_with_wasmfs # tests file embedding
def test_cube2md5(self):
self.emcc_args += ['--embed-file', 'cube2md5.txt']
shutil.copyfile(test_file('cube2md5.txt'), 'cube2md5.txt')
Expand Down Expand Up @@ -8953,7 +8954,7 @@ def setUp(self):

simd2 = make_run('simd2', emcc_args=['-O2', '-msimd128'])
bulkmem2 = make_run('bulkmem2', emcc_args=['-O2', '-mbulk-memory'])
wasmfs = make_run('wasmfs', emcc_args=['-s', 'WASMFS'])
wasmfs = make_run('wasmfs', emcc_args=['-s', 'WASMFS', '--profiling'])

# SAFE_HEAP/STACK_OVERFLOW_CHECK
core2s = make_run('core2s', emcc_args=['-O2'], settings={'SAFE_HEAP': 1})
Expand Down
56 changes: 51 additions & 5 deletions tools/file_packager.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

Usage:

file_packager TARGET [--preload A [B..]] [--embed C [D..]] [--exclude E [F..]]] [--js-output=OUTPUT.js] [--no-force] [--use-preload-cache] [--indexedDB-name=EM_PRELOAD_CACHE] [--separate-metadata] [--lz4] [--use-preload-plugins] [--no-node]
file_packager TARGET [--preload A [B..]] [--embed C [D..]] [--exclude E [F..]]] [--js-output=OUTPUT.js] [--no-force] [--use-preload-cache] [--indexedDB-name=EM_PRELOAD_CACHE] [--separate-metadata] [--lz4] [--use-preload-plugins] [--no-node] [--wasmfs-c]

--preload ,
--embed See emcc --help for more details on those options.
Expand Down Expand Up @@ -51,6 +51,11 @@

--no-node Whether to support Node.js. By default we do, which emits some extra code.

--wasmfs-c Whether to emit C code for wasmfs. This only supports embedding
(as it literally embeds the data in the C). If you prefer
preloading, you can use that normally and wasmfs will interact
with the JS normally.

Notes:

* The file packager generates unix-style file paths. So if you are on windows and a file is accessed at
Expand Down Expand Up @@ -185,6 +190,7 @@ def main():
lz4 = False
use_preload_plugins = False
support_node = True
wasmfs_c = False

for arg in sys.argv[2:]:
if arg == '--preload':
Expand Down Expand Up @@ -218,6 +224,9 @@ def main():
elif arg == '--no-node':
support_node = False
leading = ''
elif arg == '--wasmfs-c':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding two different modes I wonder if we can instead just move all embedding to C/C++ code, I don't see why it shouldn't work for the JS filesystem too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way perhaps we could delete the old JS codegen from here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a trade off that the Wasm memory may be more valuable than JS heap memory and it also can't be garbage collected. I'm not sure whether people are preloading file large enough that it would matter, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are talking about embedding only here, and not preloading. In that case, we are comparing embedded binaries in JS as base64 strings with wasm memory. I doubt those strings ever get collected.. and they are a lot larger than the equivalent in wasm memory because they are base64.

The other big advantage of using wasm memory is that it opens the door for zero-copy files (at least in read-only circumstances)... even for the old FS I think this is do-able.

wasmfs_c = True
leading = ''
elif arg.startswith('--js-output'):
jsoutput = arg.split('=', 1)[1] if '=' in arg else None
leading = ''
Expand Down Expand Up @@ -380,6 +389,8 @@ def was_seen(name):

metadata = {'files': []}

c_output = ''

# Set up folders
partial_dirs = []
for file_ in data_files:
Expand All @@ -392,6 +403,7 @@ def was_seen(name):
if partial not in partial_dirs:
code += ('''Module['FS_createPath'](%s, %s, true, true);\n'''
% (json.dumps('/' + '/'.join(parts[:i])), json.dumps(parts[i])))
c_output += f'mkdir("{partial}", 0700);\n'
partial_dirs.append(partial)

if has_preloaded:
Expand Down Expand Up @@ -475,13 +487,35 @@ def was_seen(name):
basename = os.path.basename(filename)
if file_['mode'] == 'embed':
# Embed
data = base64_encode(utils.read_binary(file_['srcpath']))
code += '''var fileData%d = '%s';\n''' % (counter, data)
code += ('''Module['FS_createDataFile']('%s', '%s', decodeBase64(fileData%d), true, true, false);\n'''
% (dirname, basename, counter))
binary = utils.read_binary(file_['srcpath'])
if not wasmfs_c:
# JS output
data = base64_encode(binary)
code += '''var fileData%d = '%s';\n''' % (counter, data)
code += ('''Module['FS_createDataFile']('%s', '%s', decodeBase64(fileData%d), true, true, false);\n'''
% (dirname, basename, counter))
else:
# C output.
# convert the binary data into a C escaped string, \xAB for hex code AB
def escape_for_c(char):
if char < 16:
# add a 0 to keep the output in two bytes
return '\\x0' + hex(char)[-1:]
return '\\x' + hex(char)[-2:]
data = ''.join([escape_for_c(char) for char in binary])
# directories... :( make them
c_output += f'''static const char fileData{counter}[] = "{data}";\n'''
c_output += f'''
FILE* file{counter} = fopen("{dirname}" "/" "{basename}", "wb");
fwrite(fileData{counter}, 1, {len(binary)}, file{counter});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, doing it this way causes the file data to be loaded into memory as part of the data section, and then copied into different memory (assuming this is a memory-backed FS), which seems unfortunate. Here are a couple of crazy ideas about how to avoid this:

  1. Somehow put the data into a passive segment, and then drop it after copying. I guess that would probably have to involve assembly or compiler builtins, but might be worth it.
  2. Put the data into a custom section instead of the data section. This has the downside that we'd have to go out to JS to get it (and I guess copy it temporarily into wasm memory during initialization). Also I'm not actually clear whether the VM would retain the custom section in its memory or not...

I'm not familiar with how this was done before. Did the old way of embedding avoid this problem somehow? Could we get the data into a JS arraybuffer which could eventually be dropped and GC'ed. I assume that downloading the data from the server rather than having it be embedded isn't an option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this has overhead - a copy, as you said.

The old method copies the data from JS base64 encoded text during startup, so is even worse...

A passive segment or a custom section is an interesting idea. But I was actually hoping we can use an internal WasmFS API, basically giving it a pointer and telling it that it "owns" the data. Then the data would simply stay where it is. That's a little odd if it's written to (modifying "static" data) but it is as efficient as it gets, and for this purpose it is safe since the data corresponds 1:1 to a file. But maybe there's a problem with that idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any kind of passive segment or special segment do we? Can't we just use an internal API and pass a pointer and size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, avoiding copying at all would be ideal. I guess that would work up until there's any kind of append, then you'd have to allocate more space and copy the whole thing. But for read-only it would be great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and wrt memory overhead: The data section is part of the wasm file (some of which the engine keeps... somewhere? maybe in memory?), and it's obviously initialized into wasm memory in the data area, and then into wherever it's copied if applicable. Coming from JS, it would be in a JS file (which... again, we don't really know whether the VM keeps that), and once wherever it was copied if applicable (but no extra copy in the wasm memory). So it could actually be less overhead to have it come from JS if we aren't able to re-use the static data as the single copy (e.g. if we think it will be written, or if we are writing into a persistent FS rather than a memory FS).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably consider writable preloaded files to be out of scope, no? I can't imagine many folks wanting that... and if they did then it seems reasonable to make a copy of first write.. but aside from that I think a single copy if wasm memory sounds like a good goal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I was actually hoping we can use an internal WasmFS API, basically giving it a pointer and telling it that it "owns" the data. Then the data would simply stay where it is.

It would be pretty easy to have a StaticMemoryFile that just contains a pointer to some static memory in the data section. It could easily be mutable, too, just like any non-const global byte array in C. Growing it could allocate the additional data in a vector without copying the original data; there's no need for the data to be contiguous.

Regarding the delivery mechanism; all our data segments (both passive and active) are dropped on startup, either manually in the case of passive segments or automatically in the case of active segments. So there shouldn't be any extra copies of the data sitting around. Using custom sections, OTOH, would require an extra copy to be kept somewhere on the JS heap since the Wasm API exposes their contents.

fclose(file{counter});
'''

counter += 1
elif file_['mode'] == 'preload':
# Preload
assert not wasmfs_c, 'wasmfs-c mode only supports embedding'

counter += 1

metadata_el = {
Expand Down Expand Up @@ -920,6 +954,18 @@ def was_seen(name):
})();
''' % _metadata_template

if wasmfs_c:
ret = r'''
#include <stdio.h>
#include <sys/stat.h>

void __wasmfs_load_embedded() {

%s

}
''' % c_output

if force or len(data_files):
if jsoutput is None:
print(ret)
Expand Down