bpo-35189: Retry fnctl calls on EINTR #10413

nierob · 2018-11-08T14:03:28Z

According to PEP 475, python should re-try system calls if EINTR is
encountered.

https://bugs.python.org/issue35189

the-knights-who-say-ni · 2018-11-08T14:03:31Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

vstinner

Did you think that it might be possible to design a test triggering the bug in Lib/test/eintrdata/eintr_tester.py? A test which would block on an operation while being spammed by SIGALRM signals every N milliseconds. And somehow unblock the function call later. You might use two processes for that. See eintr_tester.py which has multiple examples doing that.

vstinner

Can you please try to add a NEWS entry using the "blurb" command? (python3 -m pip install [--user] blurb)

For the commit message and NEWS entry, I would suggest to list modified functions. And instead of saying "Python should" just describe what the change does, like "Modify the following fnctl function to retry if interrupted by a signal (EINTR error): ...".

nierob · 2018-11-09T09:26:55Z

@vstinner Ok I will update the message and I will try to write tests

nierob · 2018-11-09T12:43:04Z

Ok, I have managed to write some tests but I have feeling that the way it is tested is flaky by design (sleep based synchronization).

Lib/test/eintrdata/eintr_tester.py

nierob · 2018-11-09T14:08:22Z

Modules/fcntlmodule.c

            if (ret < 0) {
-                PyErr_SetFromErrno(PyExc_OSError);
-                return NULL;


That is something that I do not understand, why before result of PyErr_SetFromErrno(PyExc_OSError) was not returned. The new code is in sync with other modules, now.

/repeats

vstinner · 2018-11-09T14:23:39Z

Lib/test/eintrdata/eintr_tester.py

+                    lock_func(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
+                    lock_func(f, fcntl.LOCK_UN)
+                except BlockingIOError:
+                    break


This loop doesn't seem reliable. Is it possible to acquire the lock twice here, unblock the lock in the child, to wait until the child started? Something like that. Or maybe use a pipe. I don't know.

In what way? As far I understand the exception could be raised only if the other process is in the right state (managed to acquire the lock). I do see that it may use yield, after unlocking to improve performance.

Sorry, I misunderstood how file locks work. They are reentrant. Ignore my comment.

I'm sorry, I fail to see why this loop seems to be unreliable to you, could you give me a bit more details? There are two ways to get out of the loop, one through interruption and one because of BlockingIOError. The first one is test failure anyway, so it can be ignored, the other one means that the subprocess managed to acquire the lock so we can start the real testing.

nierob · 2018-11-13T08:23:01Z

Lib/test/eintrdata/eintr_tester.py

+                    break
+            with kill_on_error(proc):
+                t0 = time.monotonic()
+                lock_func(f, fcntl.LOCK_EX)


Here is a race condition that is kind of shared with all other tests in this file. We can not make sure that the function really blocked at least for some time. The race condition would be exposed under heavy load in which the main process was not given resources for at least self.sleep_time.

Tests are run sequentially, so it's not an issue. Moreover, regrtest ensures that TESTFN is unique between two working processes running Python tests.

vstinner · 2018-11-12T23:51:49Z

Lib/test/eintrdata/eintr_tester.py

+                    lock_func(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
+                    lock_func(f, fcntl.LOCK_UN)
+                except BlockingIOError:
+                    break


Sorry, I misunderstood how file locks work. They are reentrant. Ignore my comment.

vstinner · 2018-11-12T23:53:06Z

Lib/test/eintrdata/eintr_tester.py

+                self.assertGreaterEqual(dt, self.sleep_time)
+                self.stop_alarm()
+        support.unlink(support.TESTFN)
+        proc.wait()


I proposed a better version of the test:

def _lock(self, lock_func, lock_name): self.addCleanup(support.unlink, support.TESTFN) # Lock TESTFN for 'sleep_time' seconds code = '\n'.join(( "import fcntl, time", "with open('%s', 'wb') as f:" % support.TESTFN, " fcntl.%s(f, fcntl.LOCK_EX)" % lock_name, " time.sleep(%s)" % self.sleep_time, " fcntl.%s(f, fcntl.LOCK_UN)" % lock_name, )) start_time = time.monotonic() proc = self.subprocess(code) with kill_on_error(proc): with open(support.TESTFN, 'wb') as f: # wait until the child locked the file while True: # synchronize the subprocess dt = time.monotonic() - start_time if dt > 60.0: raise Exception("failed to sync child in %.1f sec" % dt) try: lock_func(f, fcntl.LOCK_EX | fcntl.LOCK_NB) lock_func(f, fcntl.LOCK_UN) time.sleep(0.001) except BlockingIOError: break # the child locks the file for 'sleep_time' seconds lock_func(f, fcntl.LOCK_EX) dt = time.monotonic() - start_time self.assertGreaterEqual(dt, self.sleep_time) self.stop_alarm() proc.wait()

Ensure that TESTFN is always removed

Use kill_on_error(proc) on more code

put a deadline of 60 seconds on the synchronization with the child

fix computation of the time delta

I will apply 1-3, thanks!

I believe 4th is a regression. Only after BlockingIOError we can assume that the subproccess acquired the lock, so start_time should be reset there.

hmm I have also realized that:
self.assertGreaterEqual(dt, self.sleep_time)
is a bit too aggressive

vstinner · 2018-11-13T11:36:54Z

Lib/test/eintrdata/eintr_tester.py

+                        time.sleep(0.001)
+                    except BlockingIOError:
+                        break
+                start_time = time.monotonic()


This is wrong. You really need to start counting time before spawning the child process. Try to add "sleep(1)" before start_time to simulate a slow buildbot. Please don't add a test which require perfect timing and perfect scheduling, it never happens on our slow buildbots...

Hmmm, right it is wrong in both ways.

If we re-set the start time then we risk that the reset happens a bit too late. If we do not re-set it we risk to accept regression, because we add "synchronization" time.

We would need a way to confirm that the signal was handled while lock was not interrupted.

I start to think that this can be only tested by private API or some other hook in the library

vstinner · 2018-11-13T11:58:20Z

Don't try to overengineer the test. It's fine to have a "best effort" test. We ensure that LOCK_EX has been stuck a few milliseconds while the process received a lot of SIGALRM signals: that's enough.

If you write a perfect unit test, it will fail on buildbots, and I will not merge such test;

nierob · 2018-11-13T13:09:36Z

I do not mind "best effort", my intention was not relay on perfect scheduling nor timing. I have just focused on the other race condition :-)

We ensure that LOCK_EX has been stuck a few milliseconds

That is part of problem, how do you do it? We can not do it reliably as we would have to ensure that subprocess was holding the lock while the main process was trying to acquire it. As I do not think we can enforce that without internals or querying OS, best effort means accepting false negative errors. Fine by me :-)

while the process received a lot of SIGALRM signals

A lot in this case means most likely 1 maybe 2. That is because signal_period is 0.1, while sleep_time is 0.2.

Have you seen that I added 0.9 factor in self.assertGreaterEqual(dt, self.sleep_time * 0.9) ?

The solution you proposed (without re-setting start_time) increases dt, while the other one decreases self.sleep_time. Conceptually these are the same. I slightly prefer mine (surprise!) because it is deterministic (0.02 instead of execution time of a random, somehow unrelated loop) and it is explicit, but I'm fine with changing it back, it is really minor from my perspective.

vstinner

Have you seen that I added 0.9 factor in self.assertGreaterEqual(dt, self.sleep_time * 0.9) ?

Don't do that. In my experience, such magic number always make the test fail randomly on random buildbots. Put start_time what I proposed to do it.

bedevere-bot · 2018-11-13T14:06:30Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

vstinner · 2018-11-13T14:08:32Z

We ensure that LOCK_EX has been stuck a few milliseconds

That is part of problem, how do you do it? We can not do it reliably as we would have to ensure that subprocess was holding the lock while the main process was trying to acquire it (...)

This is what I call "best effort". It's not possible to write a perfect synchronization code. Good news: we don't need it.

while the process received a lot of SIGALRM signals

A lot in this case means most likely 1 maybe 2. That is because signal_period is 0.1, while sleep_time is 0.2.

Yep, that's enough. If you are not sure, unapply your C fix and make sure that the test fails.

vstinner · 2018-11-13T14:09:15Z

test_eintr is very fragile. They are known race conditions. I would prefer to remove the whole file than adding a new test which has a know race condition.

vstinner · 2018-11-13T14:12:49Z

Have you seen that I added 0.9 factor in self.assertGreaterEqual(dt, self.sleep_time * 0.9) ?

Don't do that. In my experience, such magic number always make the test fail randomly on random buildbots. Put start_time what I proposed to do it.

Let me elaborate. I'm maintaining the Python CIs for 5 years and I tried such workaround many times. It always failed in the long term. There is always a buildbot with "bad timing" and the test fails randomly.

nierob · 2018-11-16T07:37:55Z

I have made the requested changes; please review again

bedevere-bot · 2018-11-16T07:37:57Z

Thanks for making the requested changes!

@vstinner: please review the changes made to this pull request.

Lib/test/eintrdata/eintr_tester.py

nierob · 2018-11-21T12:55:01Z

I have made the requested changes; please review again

bedevere-bot · 2018-11-21T12:55:02Z

Thanks for making the requested changes!

@vstinner: please review the changes made to this pull request.

vstinner · 2018-11-21T14:50:30Z

When I tested manually the test, I noticed that the second test method didn't get any signal :-( There is a bug in the base test case! I wrote PR #10637 to fix it. I would prefer to first merge PR #10637, and then rebase this PR on top of master.

vstinner · 2018-11-21T15:34:03Z

When I tested manually the test, I noticed that the second test method didn't get any signal :-( There is a bug in the base test case! I wrote PR #10637 to fix it and I just merged it.

Would you mind to rebase your PR on master (git merge or git rebase, it's up to you) to get this fix?

nierob · 2018-11-23T13:13:16Z

Sorry for the delay. No problem.

Modify the following fnctl function to retry if interrupted by a signal (EINTR): flock, lockf, fnctl

vstinner · 2018-11-23T15:48:50Z

I will backport the change to 3.6 and 3.7. First, PR for 3.7: PR #10678.

* bpo-35189: Fix eintr_tester.py (GH-10637) Call setitimer() before each test method, instead of once per test case, to ensure that signals are sent in each test method. Previously, only the first method of a testcase class got signals. Changes: * Replace setUpClass() with setUp() and replace tearDownClass() with tearDown(). * tearDown() now ensures that at least one signal has been sent. * Replace support.run_unittest() with unittest.main() which has a nicer CLI and automatically discover test cases. (cherry picked from commit aac1f81) * bpo-35189: Retry fnctl calls on EINTR (GH-10413) Modify the following fnctl function to retry if interrupted by a signal (EINTR): flock, lockf, fnctl. (cherry picked from commit b409ffa) Co-Authored-By: nierob <nierob@users.noreply.github.com>

…-10685) * bpo-35189: Fix eintr_tester.py (GH-10637) Call setitimer() before each test method, instead of once per test case, to ensure that signals are sent in each test method. Previously, only the first method of a testcase class got signals. Changes: * Replace setUpClass() with setUp() and replace tearDownClass() with tearDown(). * tearDown() now ensures that at least one signal has been sent. * Replace support.run_unittest() with unittest.main() which has a nicer CLI and automatically discover test cases. (cherry picked from commit aac1f81) * bpo-35189: Retry fnctl calls on EINTR (GH-10413) Modify the following fnctl function to retry if interrupted by a signal (EINTR): flock, lockf, fnctl. (cherry picked from commit b409ffa) Co-Authored-By: nierob <nierob@users.noreply.github.com> (cherry picked from commit 56742f1)

nierob · 2018-11-23T19:39:06Z

Great, thank you!

the-knights-who-say-ni added the CLA not signed label Nov 8, 2018

bedevere-bot added the awaiting review label Nov 8, 2018

vstinner reviewed Nov 8, 2018

View reviewed changes

nierob force-pushed the master branch from 1a2f51d to 061e99b Compare November 9, 2018 12:37

nierob commented Nov 9, 2018

View reviewed changes

Lib/test/eintrdata/eintr_tester.py Outdated Show resolved Hide resolved

nierob force-pushed the master branch from 061e99b to 77f67c1 Compare November 9, 2018 12:48

nierob changed the title ~~Issue35189: Retry fnctl calls on EINTR~~ bpo-35189: Retry fnctl calls on EINTR Nov 9, 2018

nierob commented Nov 9, 2018

View reviewed changes

vstinner reviewed Nov 9, 2018

View reviewed changes

ned-deily removed the CLA not signed label Nov 9, 2018

the-knights-who-say-ni added the CLA signed label Nov 9, 2018

nierob commented Nov 13, 2018

View reviewed changes

vstinner reviewed Nov 13, 2018

View reviewed changes

nierob force-pushed the master branch from 77f67c1 to 213a119 Compare November 13, 2018 11:13

vstinner reviewed Nov 13, 2018

View reviewed changes

vstinner requested a review from pablogsal November 13, 2018 11:38

vstinner requested changes Nov 13, 2018

View reviewed changes

bedevere-bot removed the awaiting review label Nov 13, 2018

bedevere-bot added the awaiting changes label Nov 13, 2018

nierob force-pushed the master branch from 213a119 to ecae3fb Compare November 16, 2018 07:37

bedevere-bot added awaiting change review and removed awaiting changes labels Nov 16, 2018

vstinner reviewed Nov 16, 2018

View reviewed changes

Lib/test/eintrdata/eintr_tester.py Outdated Show resolved Hide resolved

nierob force-pushed the master branch from ecae3fb to 4a10474 Compare November 16, 2018 12:16

Retry fnctl calls on EINTR

0c0aa52

Modify the following fnctl function to retry if interrupted by a signal (EINTR): flock, lockf, fnctl

nierob force-pushed the master branch from 4a10474 to 0c0aa52 Compare November 23, 2018 13:35

vstinner merged commit b409ffa into python:master Nov 23, 2018

bedevere-bot removed the awaiting change review label Nov 23, 2018

vstinner mentioned this pull request Nov 23, 2018

[3.6] bpo-35189: Retry fnctl calls on EINTR (GH-10413) (GH-10678) #10685

Merged

pablogsal mentioned this pull request Nov 26, 2018

bpo-35310: Retry select() calls on EINTR even if deadline has passed #10700

Closed

pablogsal mentioned this pull request Feb 7, 2019

bpo-34572: change _pickle unpickling to use import rather than retrieving from sys.modules #9047

Merged

bpo-35189: Retry fnctl calls on EINTR #10413

bpo-35189: Retry fnctl calls on EINTR #10413

Conversation

nierob commented Nov 8, 2018 • edited by bedevere-bot Loading

the-knights-who-say-ni commented Nov 8, 2018

vstinner left a comment

Choose a reason for hiding this comment

vstinner left a comment

Choose a reason for hiding this comment

nierob commented Nov 9, 2018

nierob commented Nov 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nierob Nov 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Nov 13, 2018

nierob commented Nov 13, 2018

vstinner left a comment

Choose a reason for hiding this comment

bedevere-bot commented Nov 13, 2018

vstinner commented Nov 13, 2018

vstinner commented Nov 13, 2018

vstinner commented Nov 13, 2018

nierob commented Nov 16, 2018

bedevere-bot commented Nov 16, 2018

nierob commented Nov 21, 2018

bedevere-bot commented Nov 21, 2018

vstinner commented Nov 21, 2018

vstinner commented Nov 21, 2018

nierob commented Nov 23, 2018

vstinner commented Nov 23, 2018

nierob commented Nov 23, 2018

nierob commented Nov 8, 2018 •

edited by bedevere-bot

Loading

nierob Nov 13, 2018 •

edited

Loading