Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test,fs: delay unlink in test-regress-GH-4027.js #14010

Closed

Conversation

jaimecbernardo
Copy link
Contributor

The sequential/test-regress-GH-4027 test is flaky on Windows CI with an increased system load, failing when the watched file is unlinked before the first state of the watched file is retrieved.

After increasing the delay before unlinking and calling setTimeout after watchFile, the flakiness stopped reproducing.

Fixes: #13800

/cc @nodejs/testing

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines
Affected core subsystem(s)

test

The sequential/test-regress-GH-4027 test is flaky with an increased
system load, failing when the watched file is unlinked before the
first state of the watched file is retrieved.

After increasing the delay before unlinking and calling setTimeout
after watchFile, the flakiness stopped reproducing.

Fixes: nodejs#13800
@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Jun 30, 2017
Copy link
Member

@Trott Trott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI is green

I confirmed that changing the test this way (and making a few other modifications so as to run in Node.js 0.8) results in a test that fails (as expected) in 0.8.9 and passes in 0.8.10 (where the bug was fixed). So that's good.

I'd prefer something that eliminates the race condition rather than accommodates it, but that may not be possible. And if it is possible, someone can do it in a subsequent PR.

@Trott
Copy link
Member

Trott commented Jun 30, 2017


fs.watchFile(filename, { interval: 50 }, common.mustCall(function(curr, prev) {
assert.strictEqual(prev.nlink, 1);
assert.strictEqual(curr.nlink, 0);
fs.unwatchFile(filename);
}));

setTimeout(fs.unlinkSync, 300, filename);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common.platformTimeout?

@mscdex mscdex added the fs Issues and PRs related to the fs subsystem / file system. label Jul 1, 2017
@jaimecbernardo
Copy link
Contributor Author

@jasnell thanks for your review. Added the changes you requested in 0de3034 PTAL

@Trott thanks for also testing the original regression and putting the CI to run. From what I see from the CI results, the failures are unrelated to the test changed in the commit.

@Trott
Copy link
Member

Trott commented Jul 1, 2017

CI: https://ci.nodejs.org/job/node-test-pull-request/8922/

@refack
Copy link
Contributor

refack commented Jul 2, 2017

Asking for context: AFAICT the test needs to unlink after two watch intervals. I was not sure if it's exactly 2 or at least 2...
I guess @Trott empirical validation means it breaks with >=2 intervals, but can anyone reason way?

@Trott
Copy link
Member

Trott commented Jul 2, 2017

@refack I'm not sure I understand your question, but I think this is what's going on so maybe this helps?: fs.watchFile() returns asynchronously, at least on some operating systems. In other words, a Watcher object is returned, but it may not be actively watching the filesystem yet. That happens whenever it happens. So by making the setTimeout() longer, it makes it less likely that the unlink will happen before the Watcher is actually receiving events.

@Trott
Copy link
Member

Trott commented Jul 2, 2017

(In other words, I don't think the interval is particularly relevant here.)

@refack
Copy link
Contributor

refack commented Jul 2, 2017

@refack I'm not sure I understand your question, but I think this is what's going on so maybe this helps?: fs.watchFile() returns asynchronously, at least on some operating systems. In other words, a Watcher object is returned, but it may not be actively watching the filesystem yet. That happens whenever it happens. So by making the setTimeout() longer, it makes it less likely that the unlink will happen before the Watcher is actually receiving events.

AFAICT the original bug would happen if fs.watchFile would poll twice (i.e. { interval: 50 }) before the file was unlinked and erroneously give prev.nlink === curr.nlink.
I was not be sure if it took exactly two intervals to cause the error (hence interval=50 timeout=100 == 50*2), or at least two intervals (interval=50, timeout>100).
You showed empeericaly that timeout=300 manifests the bug in 0.8.9, I was wondering if anyone has free brain cycles to reason about the code and prove it analyticaly... (I don't I'm 🍳)

@jaimecbernardo
Copy link
Contributor Author

@refack From my understanding, the test is a regression test that is there to verify that there isn't a regression in this issue nodejs/node-v0.x-archive#4027 which is fixed in nodejs/node-v0.x-archive@39a0836, which landed in https://github.com/nodejs/node-v0.x-archive/tree/v0.8.10-release . So, @Trott ran the test in v0.8.9, where the test is supposed to fail, and in v0.8.10, where the issue is supposed to be fixed and the test is supposed to pass. This is to make sure that the test is still able to detect if the bug from the original issue is reintroduced. So, it's good that the test fails in v0.8.9, as it should fail with or without the change to 300ms.

@Trott
Copy link
Member

Trott commented Jul 3, 2017

@jaimecbernardo I think @refack groks all that. He's asking about the original bug and requirements for triggering it. I didn't look closely at it, just empirically tested to see that the test still functions as expected. So I don't have an answer for him. But that shouldn't stop this from landing. He's looking for a deep dive into the original bug, not how the test works. (And he'll correct me if I'm wrong about any of this! :-D )

@Trott
Copy link
Member

Trott commented Jul 3, 2017

Landed in f1c890a.

Thanks for the contribution! 🎉

@Trott Trott closed this Jul 3, 2017
Trott pushed a commit that referenced this pull request Jul 3, 2017
The sequential/test-regress-GH-4027 test is flaky with an increased
system load, failing when the watched file is unlinked before the
first state of the watched file is retrieved.

After increasing the delay before unlinking and calling setTimeout
after watchFile, the flakiness stopped reproducing.

PR-URL: #14010
Fixes: #13800
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
addaleax pushed a commit that referenced this pull request Jul 11, 2017
The sequential/test-regress-GH-4027 test is flaky with an increased
system load, failing when the watched file is unlinked before the
first state of the watched file is retrieved.

After increasing the delay before unlinking and calling setTimeout
after watchFile, the flakiness stopped reproducing.

PR-URL: #14010
Fixes: #13800
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
@addaleax addaleax mentioned this pull request Jul 11, 2017
addaleax pushed a commit that referenced this pull request Jul 18, 2017
The sequential/test-regress-GH-4027 test is flaky with an increased
system load, failing when the watched file is unlinked before the
first state of the watched file is retrieved.

After increasing the delay before unlinking and calling setTimeout
after watchFile, the flakiness stopped reproducing.

PR-URL: #14010
Fixes: #13800
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
Fishrock123 pushed a commit that referenced this pull request Jul 19, 2017
The sequential/test-regress-GH-4027 test is flaky with an increased
system load, failing when the watched file is unlinked before the
first state of the watched file is retrieved.

After increasing the delay before unlinking and calling setTimeout
after watchFile, the flakiness stopped reproducing.

PR-URL: #14010
Fixes: #13800
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
MylesBorins pushed a commit that referenced this pull request Aug 15, 2017
The sequential/test-regress-GH-4027 test is flaky with an increased
system load, failing when the watched file is unlinked before the
first state of the watched file is retrieved.

After increasing the delay before unlinking and calling setTimeout
after watchFile, the flakiness stopped reproducing.

PR-URL: #14010
Fixes: #13800
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
MylesBorins pushed a commit that referenced this pull request Aug 15, 2017
The sequential/test-regress-GH-4027 test is flaky with an increased
system load, failing when the watched file is unlinked before the
first state of the watched file is retrieved.

After increasing the delay before unlinking and calling setTimeout
after watchFile, the flakiness stopped reproducing.

PR-URL: #14010
Fixes: #13800
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
MylesBorins pushed a commit that referenced this pull request Aug 16, 2017
The sequential/test-regress-GH-4027 test is flaky with an increased
system load, failing when the watched file is unlinked before the
first state of the watched file is retrieved.

After increasing the delay before unlinking and calling setTimeout
after watchFile, the flakiness stopped reproducing.

PR-URL: #14010
Fixes: #13800
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
@MylesBorins MylesBorins mentioned this pull request Aug 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fs Issues and PRs related to the fs subsystem / file system. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants