Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix another hang under ASAN #391

Merged
merged 1 commit into from
May 9, 2022
Merged

Fix another hang under ASAN #391

merged 1 commit into from
May 9, 2022

Conversation

psychon
Copy link
Contributor

@psychon psychon commented May 9, 2022

We have a long-running program that we want to run under sanitizers for
extra error checking and under faketime to speed up the clock. This
program hangs after a while. Backtraces suggest that the hangs occur
because of recursion in the memory allocator, which apparently locks a
non-recursive mutex.

Specifically, what we see is that due to our use of FAKETIME_NO_CACHE=1,
libfaketime wants to reload the config file inside a (random) call to
clock_gettime(). libfaketime then uses fopen() / fclose() for reading
the config files. These function allocate / free a buffer for reading
data and specifically the call to free() that happens inside fclose()
ends up calling clock_gettime() again. At this point, libfaketime locks
up because it has a time_mutex that is locked and none-recursive.

Sadly, I did not manage to come up with a stand-alone reproducer for
this problem. Also, the above description is from memory after half a
week of vacation, so it might be (partially) wrong.

More information can be found here:

This commit first adds a test case. This new test uses the already
existing libmallocintercept.so to cause calls to clock_gettime() during
memory allocation routines. The difference to the already existing
version is that additionally FAKETIME_NO_CACHE and
FAKETIME_TIMESTAMP_FILE are set. This new test hung with its last output
suggesting a recursive call to malloc:

Called malloc() from libmallocintercept...successfully
Called free() on from libmallocintercept...successfully
Called malloc() from libmallocintercept...Called malloc() from libmallocintercept...

Sadly, gdb was unable to provide a meaningful backtrace for this hang.

Next, I replaced the use of fopen()/fgets()/fgets() with
open()/read()/close(). This code no longer reads the config file
line-by-line, but instead it reads all of it at once and then "filters
out" the result (ignores comment lines, removes end of line markers).

I tried to keep the behaviour of the code the same, but I know at least
one difference: Previously, the config file was read line-by-line and
lines that began with a comment character were immediately ignored. The
new code only reads the config once and then removes comment lines.
Since the buffer that is used contains only 256 characters, it is
possible that config files that were previously parsed correctly now
longer parse. A specific example: if a file begins with 500 '#'
characters in its first line and then a timestamp in the second line,
the old code was able to parse this file while the new code would only
see an empty file.

After this change, the new test no longer hangs. Sadly, I do not
actually know its effect on the "actual bug" that I wanted to fix, but
since there are no longer any calls to fclose(), there cannot be any
hangs inside fclose().

Signed-off-by: Uli Schlachter psychon@znc.in


CC @LtdSauce FYI

We have a long-running program that we want to run under sanitizers for
extra error checking and under faketime to speed up the clock. This
program hangs after a while. Backtraces suggest that the hangs occur
because of recursion in the memory allocator, which apparently locks a
non-recursive mutex.

Specifically, what we see is that due to our use of FAKETIME_NO_CACHE=1,
libfaketime wants to reload the config file inside a (random) call to
clock_gettime(). libfaketime then uses fopen() / fclose() for reading
the config files. These function allocate / free a buffer for reading
data and specifically the call to free() that happens inside fclose()
ends up calling clock_gettime() again. At this point, libfaketime locks
up because it has a time_mutex that is locked and none-recursive.

Sadly, I did not manage to come up with a stand-alone reproducer for
this problem. Also, the above description is from memory after half a
week of vacation, so it might be (partially) wrong.

More information can be found here:

- #365 (comment)
- #365 (comment)

This commit first adds a test case. This new test uses the already
existing libmallocintercept.so to cause calls to clock_gettime() during
memory allocation routines. The difference to the already existing
version is that additionally FAKETIME_NO_CACHE and
FAKETIME_TIMESTAMP_FILE are set. This new test hung with its last output
suggesting a recursive call to malloc:

Called malloc() from libmallocintercept...successfully
Called free() on from libmallocintercept...successfully
Called malloc() from libmallocintercept...Called malloc() from libmallocintercept...

Sadly, gdb was unable to provide a meaningful backtrace for this hang.

Next, I replaced the use of fopen()/fgets()/fgets() with
open()/read()/close(). This code no longer reads the config file
line-by-line, but instead it reads all of it at once and then "filters
out" the result (ignores comment lines, removes end of line markers).

I tried to keep the behaviour of the code the same, but I know at least
one difference: Previously, the config file was read line-by-line and
lines that began with a comment character were immediately ignored. The
new code only reads the config once and then removes comment lines.
Since the buffer that is used contains only 256 characters, it is
possible that config files that were previously parsed correctly now
longer parse. A specific example: if a file begins with 500 '#'
characters in its first line and then a timestamp in the second line,
the old code was able to parse this file while the new code would only
see an empty file.

After this change, the new test no longer hangs. Sadly, I do not
actually know its effect on the "actual bug" that I wanted to fix, but
since there are no longer any calls to fclose(), there cannot be any
hangs inside fclose().

Signed-off-by: Uli Schlachter <psychon@znc.in>
@wolfcw
Copy link
Owner

wolfcw commented May 9, 2022

Thanks! I think the new way how the file is read resembles the previous behaviour close enough, so let's hope this also solves your "actual bug". :-)

@wolfcw wolfcw merged commit e7ca837 into wolfcw:master May 9, 2022
@psychon psychon deleted the asan2 branch May 9, 2022 19:11
@abrudin
Copy link

abrudin commented May 10, 2022

@psychon @wolfcw Just wanted to inform that we have successfully been running libfaketime in our integration tests for years, where we start and restart applications which reads and writes to faketimerc.

After this commit we suddenly started to get errors, where libfaketime crashes when reading from faketimerc with the following message:

libfaketime: In parse_ft_string(), failed to parse FAKETIME timestamp.
Please check specification  with format %Y-%m-%d %T

Using a build from the previous commit (2bfbe19) works fine. I don't know exactly what is going on, if there is some race condition or tearing happening when it tries to read the content of the file, but we don't write anything strange and have not changed anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants