Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: improve cpSync performance #53541

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented Jun 22, 2024

By moving the function to C++, we should get a significant improvement in performance.

I'm extremely open to suggestions on improving the benchmarks, but with the current state, here are the results:

local benchmarks

macOS M1 Max

                           confidence improvement accuracy (*)    (**)   (***)
fs/bench-cpSync.js n=1            ***     63.75 %      ±10.38% ±13.83% ±18.03%
fs/bench-cpSync.js n=100          ***    607.18 %      ±46.07% ±62.03% ±82.22%
fs/bench-cpSync.js n=10000        ***    690.84 %      ±30.74% ±41.37% ±54.83%

Be aware that when doing many comparisons the risk of a false-positive result increases.
In this case, there are 3 comparisons, you can thus expect the following amount of false-positive results:
  0.15 false positives, when considering a   5% risk acceptance (*, **, ***),
  0.03 false positives, when considering a   1% risk acceptance (**, ***),
  0.00 false positives, when considering a 0.1% risk acceptance (***)

benchmark ci

19:04:42                            confidence improvement accuracy (*)    (**)   (***)
19:04:42 fs/bench-cpSync.js n=1            ***    154.37 %       ±6.67%  ±8.88% ±11.57%
19:04:42 fs/bench-cpSync.js n=100          ***    163.66 %       ±9.09% ±12.22% ±16.16%
19:04:42 fs/bench-cpSync.js n=10000        ***    103.21 %       ±0.96%  ±1.28%  ±1.67%

benchmark ci: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1572/

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. fs Issues and PRs related to the fs subsystem / file system. needs-ci PRs that need a full CI run. typings labels Jun 22, 2024
@anonrig anonrig force-pushed the improve-fs-cp-sync branch 2 times, most recently from f7f3d9b to aab2a36 Compare June 22, 2024 01:48
src/node_file.cc Outdated Show resolved Hide resolved
src/node_file.cc Outdated Show resolved Hide resolved
@anonrig
Copy link
Member Author

anonrig commented Jun 25, 2024

cc @nodejs/performance @nodejs/fs @nodejs/cpp-reviewers

@anonrig anonrig requested a review from lemire June 25, 2024 18:45
@anonrig anonrig force-pushed the improve-fs-cp-sync branch 2 times, most recently from e3e4afc to ea737f7 Compare June 25, 2024 20:15
@anonrig
Copy link
Member Author

anonrig commented Jun 25, 2024

cc @nodejs/tsc

this pull-request is now a semver-major pull-request due to the following changes, and will not be ported to previous versions if we don't make an exception:

  • For the test case of It throws error if symlink in src points to location in dest., we were throwing error code ERR_FS_CP_EINVAL. With std::filesystem it's not possible to get this edge case without a significant performance penalty. We are now throwing ERR_FS_CP_EEXIST which is also correct. (File exists)
  • For the test case of It throws error if symlink in dest points to location in src., we were throwing error code ERR_FS_CP_SYMLINK_TO_SUBDIRECTORY, which checks for similar behavior in previous bullet point but unfortunately it was returning a different error code. Now, we are returning ERR_FS_CP_EEXIST.
  • For the test case of It throws EEXIST error if attempt is made to copy symlink over file. we were throwing error EEXIST and right now we are returning ERR_FS_CP_EEXIST.
  • For the test case of It copies link if it does not point to folder in src., previously it succeeded. Right now, it throws an error. I think the new behavior is more correct. If there is agreement, we have to ditch this pull-request, or go with copying files one by one.

I appreciate if you could share your comments and thoughts on this.

@anonrig anonrig force-pushed the improve-fs-cp-sync branch from ea737f7 to a859abd Compare June 25, 2024 20:23
@anonrig anonrig added the semver-major PRs that contain breaking changes and should be released in the next major version. label Jun 25, 2024
@anonrig anonrig force-pushed the improve-fs-cp-sync branch from a859abd to 4f7d9d4 Compare June 25, 2024 20:31
@richardlau
Copy link
Member

I'm okay with those changes as semver-major. I would be -1 on any sort of exception.

@benjamingr
Copy link
Member

I'm not sure, can't we re-map the thrown errors to the existing codes (e.g. in a catch, even if it's hard/degrates performance in C++ land (why?))?

Wouldn't that only hurt the performance in the error case which isn't the interesting one?

I'm wondering if our users for APIs like cpSync that are less performance sensitive (since they're blocking) don't care more breaking changes than a perf gain.

(Just to be extra explicit - I'm not blocking this PR)

@benjamingr
Copy link
Member

Wait actually reading the code I think I understand since std::filesystem doesn't distinguish between those cases.

@anonrig
Copy link
Member Author

anonrig commented Jun 25, 2024

I'm not sure, can't we re-map the thrown errors to the existing codes (e.g. in a catch, even if it's hard/degrates performance in C++ land (why?))?

These specific errors can only be caught if we traverse the filesystem one by one, and make the necessary copy operations. Since we don't do it, there is no way to know if the EEXIST was thrown from a symlink in src, or a symlink in destination that looks into a path in destination folder. The test cases are same, but they throw different errors.

Wouldn't that only hurt the performance in the error case which isn't the interesting one?

We can catch it, but the problem is, we don't know which file is causing this error.

@benjamingr
Copy link
Member

Thanks, I'm +1 regarding the change and the breakage since I doubt people using cpSync are checking error codes this granularly. We may want to be prudent and trigger a CITGM run but I don't expect it'll find much.

(It should still be semver major though)

@anonrig
Copy link
Member Author

anonrig commented Jun 25, 2024

Wait actually reading the code I think I understand since std::filesystem doesn't distinguish between those cases.

@benjamingr Actually, there is a std::filesystem::filesystem_error which I'm not sure how to catch it without try/catch. https://en.cppreference.com/w/cpp/filesystem/filesystem_error

@anonrig
Copy link
Member Author

anonrig commented Jun 25, 2024

Wait actually reading the code I think I understand since std::filesystem doesn't distinguish between those cases.

@benjamingr Actually, there is a std::filesystem::filesystem_error which I'm not sure how to catch it without try/catch. https://en.cppreference.com/w/cpp/filesystem/filesystem_error

I love systems engineering. We can't use try/catch because we are passing -no-exceptions flag to GCC.

../../src/node_file.cc:2264:3: error: cannot use 'try' with exceptions disabled
  try {
  ^
1 error generated.

@benjamingr
Copy link
Member

That's for the overload that doesn't contain the error_code - and has the same info as the error_code so if it isn't there it doesn't distinguish between those cases. (Also CI should tell us that std::filesystem::copy gives these errors in a cross-platform friendly way)

@anonrig
Copy link
Member Author

anonrig commented Jun 25, 2024

That's for the overload that doesn't contain the error_code - and has the same info as the error_code so if it isn't there it doesn't distinguish between those cases. (Also CI should tell us that std::filesystem::copy gives these errors in a cross-platform friendly way)

Even with that overloaded function, the following code does not compile due to disabled exception handling @benjamingr

  try {
    std::filesystem::copy(src_path, dest_path, options);
  } catch (const std::filesystem::filesystem_error& error) {
    if (error.code() == std::errc::file_exists) {
      std::string message = "File already exists";
      return THROW_ERR_FS_CP_EEXIST(env, message.c_str());
    }

     std::string message = "Unhandled error " +
                          std::to_string(error_code.value()) + ": " +
                          error_code.message();
    return THROW_ERR_FS_CP_EINVAL(env, message.c_str());
  }

@benjamingr
Copy link
Member

benjamingr commented Jun 25, 2024

@anonrig you misunderstand me, sorry - I mean filesystem::copy has several overloads. The one you used in this PR already takes an error code reference here).

If you do that, it doesn't throw and returns the error code through the parameter instead. If you remove that parameter it will throw.

(There are valid important reasons why exceptions are disabled in the codebase, but feel free to disable that for testing)

@anonrig
Copy link
Member Author

anonrig commented Jun 25, 2024

@anonrig you misunderstand me, sorry - I mean filesystem::copy has several overloads. The one you used in this PR already takes an error code reference here).

If you do that, it doesn't throw and returns the error code through the parameter instead. If you remove that parameter it will throw.

(There are valid important reasons why exceptions are disabled in the codebase, but feel free to disable that for testing)

Yes, unfortunately std::error_code does not have path() property, but std::filesystem::filesystem_error does, and with exception handling disabled, it's not possible to get those paths, and make this a non-semver major pull-request.

GGYry8sXEAATdHB

Copy link
Contributor

@Ethan-Arrowood Ethan-Arrowood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent. Im in favor of this remaining a major change. Great work!

src/node_file.cc Outdated Show resolved Hide resolved
src/node_file.cc Outdated Show resolved Hide resolved
Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@LiviaMedeiros LiviaMedeiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with green CI. Since the implementation was adjusted, benchmark rerun would be appreciated.

src/node_file.cc Outdated Show resolved Hide resolved
@@ -218,7 +218,7 @@ function nextdir() {
assert.throws(
() => cpSync(src, dest, mustNotMutateObjectDeep({ recursive: true })),
{
code: 'ERR_FS_CP_EINVAL'
code: 'ERR_FS_CP_EEXIST'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
code: 'ERR_FS_CP_EEXIST'
code: 'ERR_FS_CP_EINVAL'

AFAICT if symlink in src points to location in dest, cp will throw EINVAL due to presumably attempting of copying directory to its subdirectory. This might be a bug in the subdirectory checking implementation (isSrcSubdir(resolvedSrc, resolvedDest)).

Directory structure before copying:

├── copy_15
│   └── link -> node/test/.tmp.0/copy_16
├── copy_16
│   └── link -> node/test/.tmp.0/copy_16

Directory structure after performing regular cp -r node/test/.tmp.0/copy_15 node/test/.tmp.0/copy_16:

├── copy_15
│   └── link -> node/test/.tmp.0/copy_16
├── copy_16
│   ├── copy_15
│   │   └── link -> node/test/.tmp.0/copy_16
│   └── link -> node/test/.tmp.0/copy_16

It's not really related to this particular PR, but i guess creating cyclic symlink like this should be allowed.

Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is nice enough. More detailed errors could be provided as part of a follow-up PR.

@anonrig anonrig force-pushed the improve-fs-cp-sync branch from 4f7d9d4 to 8b0f25f Compare June 25, 2024 22:59
@anonrig
Copy link
Member Author

anonrig commented Jun 25, 2024

LGTM with green CI. Since the implementation was adjusted, benchmark rerun would be appreciated.

@LiviaMedeiros benchmark ci: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1572/

@anonrig anonrig added the request-ci Add this label to start a Jenkins CI on a PR. label Jun 25, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jun 25, 2024
@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with these changes as semver major. -1 on back porting.

@targos
Copy link
Member

targos commented Jun 26, 2024

I'm a little afraid of having a branch with two completely separate implementations of the feature depending on the options passed by the user. This LGTM if the end goal is to get rid of the JS one.

copy_options options = copy_options::copy_symlinks;

// When true timestamps from src will be preserved.
if (preserveTimestamps) options |= copy_options::create_hard_links;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct?

afaict, this will create hard links in the destination directory to the original files. so, while it will appear upon inspecting the filesystem that the files have been copied and the timestamps have been preserved, the files will not have actually been copied, just linked as directory entries in the filesystem to the same file.

this means if you change any of the files in the source directory subsequently, the "files" in the destination will also change. i haven't looked at existing code in detail but am pretty sure this is not how it works with this flag.

i am also guessing without checking that this will only work if source and destination are on same filesystem? 🤔

if i am indeed correct, i guess simplest thing to do for now would be to use the slow path if "preserveTimestamps" is set?

from a quick google, i'm not sure there is an efficient way to copy and preserve timestamps without having to touch each file in a syscall and it just doesn't seem that there is an option rn in std::filesystem to emulate the current behaviour. happy to be wrong though - not a C++ expert.

https://en.cppreference.com/w/cpp/filesystem/copy_options
https://en.wikipedia.org/wiki/Hard_link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i put some benches using the test/fixtures/copy directory from node.js and the same options as above here if anyone wants to try them out - should be easy to get them working on linux or macos.
https://github.com/just-js/lo-bench/blob/main/fs/README.md

some things to note:

on linux core i5, using /dev/shm to reduce filesystem/disk overhead

  • bun is roughly 4x the performance of current node.js 22.3
  • a "zero-overhead" implementation on v8, using lo runtime and v8 fastcalls to call std::filesystem::copy is ~2.5x the performance of current node - can assume this would be the theoretical max of node.js with this change merged i think.
  • bun is still roughly 1.5x the optimal v8 implementation using filesystem::copy

on macos m1 mini, using a ram disk

  • bun is rougly 2.5x performance of current node.js 22.3
  • the "optimal" v8 implementation using fileystem::copy is ~1.3x node.js 22.3
  • bun is still roughly 2x better than "optimal" v8 using filesystem::copy

also, for the same workload:

  • node does ~90k syscalls
  • bun does ~55k
  • std::filesystem::copy does ~65k

so, without testing with other directory structures and sizes/depths but assuming we see similar results, we could assert the following:

  • this change will still be significantly slower than bun
  • there is still no good solution for getting individual file errors back, which seems necessary
  • it can't be used when we need to preserve timestamps (see above)
  • imo, the existing implementation could probably be changed to not make as many unnecessary syscalls, remain in JS and be within a couple of percentage points of throughput of the bun implementation. this might be a better long term approach?

another couple of things to note

  • if we use force: false then the difference between node.js and bun is much smaller
  • the bigger/deeper the directory is, the smaller the difference will be between node and bun as the throughput is syscall heavy.
  • the difference will probably be less again on an actual filesystem. i just used ramdisk in the benches to eliminate that overhead and get a clearer picture of runtime overhead.

let me know if i got anything wrong - i ran through this pretty quickly, but it should be easy to reproduce those results using the link above. 🙏

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

For now, I'll split this pull-request into multiple changes, and try to optimize the existing implementation before moving it into full C++.

#53612

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a different approach: #53614

@nodejs-github-bot
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. fs Issues and PRs related to the fs subsystem / file system. needs-ci PRs that need a full CI run. semver-major PRs that contain breaking changes and should be released in the next major version. typings
Projects
None yet
Development

Successfully merging this pull request may close these issues.