Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-35189: Retry fnctl calls on EINTR #10413

Merged
merged 1 commit into from
Nov 23, 2018
Merged

bpo-35189: Retry fnctl calls on EINTR #10413

merged 1 commit into from
Nov 23, 2018

Conversation

nierob
Copy link
Contributor

@nierob nierob commented Nov 8, 2018

According to PEP 475, python should re-try system calls if EINTR is
encountered.

https://bugs.python.org/issue35189

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you think that it might be possible to design a test triggering the bug in Lib/test/eintrdata/eintr_tester.py? A test which would block on an operation while being spammed by SIGALRM signals every N milliseconds. And somehow unblock the function call later. You might use two processes for that. See eintr_tester.py which has multiple examples doing that.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please try to add a NEWS entry using the "blurb" command? (python3 -m pip install [--user] blurb)

For the commit message and NEWS entry, I would suggest to list modified functions. And instead of saying "Python should" just describe what the change does, like "Modify the following fnctl function to retry if interrupted by a signal (EINTR error): ...".

@nierob
Copy link
Contributor Author

nierob commented Nov 9, 2018

@vstinner Ok I will update the message and I will try to write tests

@nierob
Copy link
Contributor Author

nierob commented Nov 9, 2018

Ok, I have managed to write some tests but I have feeling that the way it is tested is flaky by design (sleep based synchronization).

@nierob nierob changed the title Issue35189: Retry fnctl calls on EINTR bpo-35189: Retry fnctl calls on EINTR Nov 9, 2018
if (ret < 0) {
PyErr_SetFromErrno(PyExc_OSError);
return NULL;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is something that I do not understand, why before result of PyErr_SetFromErrno(PyExc_OSError) was not returned. The new code is in sync with other modules, now.

/repeats

lock_func(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
lock_func(f, fcntl.LOCK_UN)
except BlockingIOError:
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop doesn't seem reliable. Is it possible to acquire the lock twice here, unblock the lock in the child, to wait until the child started? Something like that. Or maybe use a pipe. I don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way? As far I understand the exception could be raised only if the other process is in the right state (managed to acquire the lock). I do see that it may use yield, after unlocking to improve performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood how file locks work. They are reentrant. Ignore my comment.

Copy link
Contributor Author

@nierob nierob Nov 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I fail to see why this loop seems to be unreliable to you, could you give me a bit more details? There are two ways to get out of the loop, one through interruption and one because of BlockingIOError. The first one is test failure anyway, so it can be ignored, the other one means that the subprocess managed to acquire the lock so we can start the real testing.

break
with kill_on_error(proc):
t0 = time.monotonic()
lock_func(f, fcntl.LOCK_EX)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a race condition that is kind of shared with all other tests in this file. We can not make sure that the function really blocked at least for some time. The race condition would be exposed under heavy load in which the main process was not given resources for at least self.sleep_time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are run sequentially, so it's not an issue. Moreover, regrtest ensures that TESTFN is unique between two working processes running Python tests.

lock_func(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
lock_func(f, fcntl.LOCK_UN)
except BlockingIOError:
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood how file locks work. They are reentrant. Ignore my comment.

self.assertGreaterEqual(dt, self.sleep_time)
self.stop_alarm()
support.unlink(support.TESTFN)
proc.wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed a better version of the test:

    def _lock(self, lock_func, lock_name):
        self.addCleanup(support.unlink, support.TESTFN)

        # Lock TESTFN for 'sleep_time' seconds
        code = '\n'.join((
            "import fcntl, time",
            "with open('%s', 'wb') as f:" % support.TESTFN,
            "   fcntl.%s(f, fcntl.LOCK_EX)" % lock_name,
            "   time.sleep(%s)" % self.sleep_time,
            "   fcntl.%s(f, fcntl.LOCK_UN)" % lock_name,
        ))

        start_time = time.monotonic()
        proc = self.subprocess(code)
        with kill_on_error(proc):
            with open(support.TESTFN, 'wb') as f:
                # wait until the child locked the file
                while True:  # synchronize the subprocess
                    dt = time.monotonic() - start_time
                    if dt > 60.0:
                        raise Exception("failed to sync child in %.1f sec" % dt)
                    try:
                        lock_func(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
                        lock_func(f, fcntl.LOCK_UN)
                        time.sleep(0.001)
                    except BlockingIOError:
                        break

                # the child locks the file for 'sleep_time' seconds
                lock_func(f, fcntl.LOCK_EX)
                dt = time.monotonic() - start_time
                self.assertGreaterEqual(dt, self.sleep_time)

                self.stop_alarm()
                proc.wait()
  • Ensure that TESTFN is always removed
  • Use kill_on_error(proc) on more code
  • put a deadline of 60 seconds on the synchronization with the child
  • fix computation of the time delta

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will apply 1-3, thanks!

I believe 4th is a regression. Only after BlockingIOError we can assume that the subproccess acquired the lock, so start_time should be reset there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I have also realized that:
self.assertGreaterEqual(dt, self.sleep_time)
is a bit too aggressive

time.sleep(0.001)
except BlockingIOError:
break
start_time = time.monotonic()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. You really need to start counting time before spawning the child process. Try to add "sleep(1)" before start_time to simulate a slow buildbot. Please don't add a test which require perfect timing and perfect scheduling, it never happens on our slow buildbots...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, right it is wrong in both ways.

If we re-set the start time then we risk that the reset happens a bit too late. If we do not re-set it we risk to accept regression, because we add "synchronization" time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need a way to confirm that the signal was handled while lock was not interrupted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I start to think that this can be only tested by private API or some other hook in the library

@vstinner vstinner requested a review from pablogsal November 13, 2018 11:38
@vstinner
Copy link
Member

Don't try to overengineer the test. It's fine to have a "best effort" test. We ensure that LOCK_EX has been stuck a few milliseconds while the process received a lot of SIGALRM signals: that's enough.

If you write a perfect unit test, it will fail on buildbots, and I will not merge such test;

@nierob
Copy link
Contributor Author

nierob commented Nov 13, 2018

I do not mind "best effort", my intention was not relay on perfect scheduling nor timing. I have just focused on the other race condition :-)

We ensure that LOCK_EX has been stuck a few milliseconds

That is part of problem, how do you do it? We can not do it reliably as we would have to ensure that subprocess was holding the lock while the main process was trying to acquire it. As I do not think we can enforce that without internals or querying OS, best effort means accepting false negative errors. Fine by me :-)

while the process received a lot of SIGALRM signals

A lot in this case means most likely 1 maybe 2. That is because signal_period is 0.1, while sleep_time is 0.2.


Have you seen that I added 0.9 factor in self.assertGreaterEqual(dt, self.sleep_time * 0.9) ?

The solution you proposed (without re-setting start_time) increases dt, while the other one decreases self.sleep_time. Conceptually these are the same. I slightly prefer mine (surprise!) because it is deterministic (0.02 instead of execution time of a random, somehow unrelated loop) and it is explicit, but I'm fine with changing it back, it is really minor from my perspective.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen that I added 0.9 factor in self.assertGreaterEqual(dt, self.sleep_time * 0.9) ?

Don't do that. In my experience, such magic number always make the test fail randomly on random buildbots. Put start_time what I proposed to do it.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@vstinner
Copy link
Member

We ensure that LOCK_EX has been stuck a few milliseconds

That is part of problem, how do you do it? We can not do it reliably as we would have to ensure that subprocess was holding the lock while the main process was trying to acquire it (...)

This is what I call "best effort". It's not possible to write a perfect synchronization code. Good news: we don't need it.

while the process received a lot of SIGALRM signals

A lot in this case means most likely 1 maybe 2. That is because signal_period is 0.1, while sleep_time is 0.2.

Yep, that's enough. If you are not sure, unapply your C fix and make sure that the test fails.

@vstinner
Copy link
Member

test_eintr is very fragile. They are known race conditions. I would prefer to remove the whole file than adding a new test which has a know race condition.

@vstinner
Copy link
Member

Have you seen that I added 0.9 factor in self.assertGreaterEqual(dt, self.sleep_time * 0.9) ?

Don't do that. In my experience, such magic number always make the test fail randomly on random buildbots. Put start_time what I proposed to do it.

Let me elaborate. I'm maintaining the Python CIs for 5 years and I tried such workaround many times. It always failed in the long term. There is always a buildbot with "bad timing" and the test fails randomly.

@nierob
Copy link
Contributor Author

nierob commented Nov 16, 2018

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@vstinner: please review the changes made to this pull request.

@nierob
Copy link
Contributor Author

nierob commented Nov 21, 2018

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@vstinner: please review the changes made to this pull request.

@vstinner
Copy link
Member

When I tested manually the test, I noticed that the second test method didn't get any signal :-( There is a bug in the base test case! I wrote PR #10637 to fix it. I would prefer to first merge PR #10637, and then rebase this PR on top of master.

@vstinner
Copy link
Member

When I tested manually the test, I noticed that the second test method didn't get any signal :-( There is a bug in the base test case! I wrote PR #10637 to fix it and I just merged it.

Would you mind to rebase your PR on master (git merge or git rebase, it's up to you) to get this fix?

@nierob
Copy link
Contributor Author

nierob commented Nov 23, 2018

Sorry for the delay. No problem.

Modify the following fnctl function to retry if interrupted by a signal
(EINTR): flock, lockf, fnctl
@vstinner
Copy link
Member

I will backport the change to 3.6 and 3.7. First, PR for 3.7: PR #10678.

vstinner added a commit that referenced this pull request Nov 23, 2018
* bpo-35189: Fix eintr_tester.py (GH-10637)

Call setitimer() before each test method, instead of once per test
case, to ensure that signals are sent in each test method.
Previously, only the first method of a testcase class got signals.

Changes:

* Replace setUpClass() with setUp() and replace tearDownClass() with
  tearDown().
* tearDown() now ensures that at least one signal has been sent.
* Replace support.run_unittest() with unittest.main() which has
  a nicer CLI and automatically discover test cases.

(cherry picked from commit aac1f81)

* bpo-35189: Retry fnctl calls on EINTR (GH-10413)

Modify the following fnctl function to retry if interrupted by a signal
(EINTR): flock, lockf, fnctl.

(cherry picked from commit b409ffa)
Co-Authored-By: nierob <nierob@users.noreply.github.com>
vstinner added a commit that referenced this pull request Nov 23, 2018
…-10685)

* bpo-35189: Fix eintr_tester.py (GH-10637)

Call setitimer() before each test method, instead of once per test
case, to ensure that signals are sent in each test method.
Previously, only the first method of a testcase class got signals.

Changes:

* Replace setUpClass() with setUp() and replace tearDownClass() with
  tearDown().
* tearDown() now ensures that at least one signal has been sent.
* Replace support.run_unittest() with unittest.main() which has
  a nicer CLI and automatically discover test cases.

(cherry picked from commit aac1f81)

* bpo-35189: Retry fnctl calls on EINTR (GH-10413)

Modify the following fnctl function to retry if interrupted by a signal
(EINTR): flock, lockf, fnctl.

(cherry picked from commit b409ffa)
Co-Authored-By: nierob <nierob@users.noreply.github.com>
(cherry picked from commit 56742f1)
@nierob
Copy link
Contributor Author

nierob commented Nov 23, 2018

Great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants