Skip to content

APIs for libraries/frameworks/tools to control on-disk compilation cache (NODE_COMPILE_CACHE) #53639

Closed
@joyeecheung

Description

Spinning from #52535 (comment)

Currently, the built-in on-disk compilation cache can only be enabled by NODE_COMPILE_CACHE. It's possible for the end user to control where the NODE_COMPILE_CACHE is stored and so that it's also possible for them to find the cache and clean it up when necessary. That's the simplest enabling mechanism for sure, but from the use cases of v8-compile-cache (a package that monkey-patches the CJS loader, which is a capability that we want to sunset, see #47472). It's also common for library/framework authors to want to enable this in a more flexible manner. So this issue is opened to discuss what an API for this should look like and what the directory structure of the cache should look like.

With the global NODE_COMPILE_CACHE the current cache directory structure looks like this:

- Compile cache directory (from NODE_COMPILE_CACHE)
  - $version_hash1: CRC32 hash of CachedDataVersionTag + NODE_VERESION (maybe we need to add UID too)
  - $version_hash2:
    - $module_hash_1: CRC32 hash of filename + module type.      <--- cache files
    - $module_hash_2: ...
...

For reference v8-compile-cache's cache directory looks like this

- $tmpdir/v8-compile-cache-$uid-$arch-version
  - $main_name.BLOB: filename of the module that `require('v8-compile-cache')`, or process.cwd() if it's not required in-file
  - $main_name.MAP:
  - $main_name.LOCK

And inside the .BLOB files it maintains a module_filename + sha-1 checksum -> cache_data storage. In the documentation it explains:

The cache is entry module specific because it is faster to load the entire code cache into memory at once, than it is to read it from disk on a file-by-file basis.

In my investigation when implementing NODE_MODULE_CACHE though, there's actually not much performance difference in reading on a file-by-file basis, at least when it's implemented using native FS calls and when the file only gets loaded when the corresponding module is about to get compiled (so not all the cache is loaded into the process at once even though the module might not be needed by the application at all - which v8-compile-cache does).

For third-party tooling (e.g. transpilers, package managers) I think the layout that don't distinguish about entrypoints would still be beneficial - as long as the final resolved file path remains the same and its content matches the checksum, and it's still being loaded by the same Node.js version etc., then the cache is going to hit. Then if multiple dependencies in the same project try to enable it, we wouldn't be saving multiple caches on disk even though they are effectively caching the code for the same files (e.g. the end user code needs package foo that resolves to /path/to/foo.js, whose cache gets repeatedly stored in the cache enabled by a transpiler and then again in the cache enabled a package manager that executes a run command).

I wonder if we should just provide the following APIs:

const module = require('node:module');  // Or import it

/**
 * Enable on-disk compiled cache for all user modules being complied in the current Node.js instance
 * after this method is called.
 * If cacheDir is undefined, defaults to the NODE_MODULE_CACHE environment variable.
 * If NODE_MODULE_CACHE isn't set, default to `$TMPDIR/node_compile_cache`.
 * @param {string|undefined} cacheDir
 * @returns {string} The path to the resolved cache directory.
 */
module.enableCompileCache(cacheDir);

/**
 * @returns {string|undefined} The resolved cache directory, if on-disk compiled cache is configured.
 *   Otherwise return undefined.
 */
module.getCompileCacheDir();

process.getCompileCacheDir() would still allow end users to find and clean stale cache to release disk space. We could probably also add a file to the designated directory with a name that's easy to find (e.g. $CACHE_DIR/node_compile_cache_mark) to facilitate this too.

In most use cases, tooling and libraries should simply call module.enableCompileCache() without passing in an argument so that the cache is stored in tmpdir and can be shared with other dependencies by default, and end users can override the default cache directory location with NODE_COMPILE_CACHE. Some more advanced tooling/framework might want more advanced customizations and use their own cache directory, then they can specify it.

Some more powerful APIs are probably needed to allow advanced configuration of the cache storage, but at least the APIs mentioned above would address the use cases of existing v8-compile-cache users. For the more power API, it would be difficult to just think of one that works well without some collaboration with adopters, so ideas welcomed regarding how that should look like :)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions