-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add -z/--null-data flag for reading large binary files #993
Comments
This is wrong, because lines are variable length. This is one reason why ripgrep (and GNU grep) do binary data detection. If a NUL byte is found, then by default, ripgrep gives up, because binary data printed to terminals is bad but also because binary data doesn't have a newline distribution similar to plain text files, which can result in allocating a lot of memory. When crawling a large directory, ripgrep should be using a fixed size buffer with standard With that said, ripgrep has other ways of using memory. For example, its parallel searcher may be too eager and fill up heap space with the file tree. Typically this is small enough that it doesn't matter, but you might consider using the Another way ripgrep uses memory is for output. Namely, when using multithreading, ripgrep buffers the entire output of searching a file in memory before dumping it to stdout in order to prevent the interleaving of matches. So for example, if you're searching a file bigger than 8GB with a pattern that matches everything, then you're going to exhaust available memory.
That's not how OOM conditions happen on Linux, and at the very minimum, depends on your system settings for overcommit. Moreover, your suggested error messages are a bit weird. Firstly, "file too big" isn't really related at all; what matters is its distribution of bytes, which can't be known until its read. So the only thing to do is to attempt to search it. Secondly, "your pattern is too complex" is already accounted for. You'll get an error if you try to search with a pattern that is too big (try I don't foresee ripgrep doing anything to make the OOM killer failure mode any better. By the time the OOM killer has ripgrep in its sights, it's too late. It is the user's responsibility to become familiar with the failure modes of OOM killers on systems with OOM killers enabled. Moreover, that ripgrep is the target of an OOM killer does not necessarily imply that ripgrep is the agitator. I note that no part of your issue mentions the observed peak memory usage of ripgrep, which is a critical piece of information when reporting a bug about an OOM condition. If you don't have an OOM killer and your system allocators don't lie, then ripgrep should indeed exit with a more descriptive error. Here are some things to try:
|
On 07/26/2018 02:51 AM, Andrew Gallant wrote:
> so afaiu theoretically the search could be done with very limited memory use in 3 passes for the output, forwards until the pattern, then backwards until a newline, then forwards again while outputting until the next newline
This is wrong, because lines are variable length. This is one reason why ripgrep (and GNU grep) do binary data detection. If a NUL byte is found, then by default, ripgrep gives up, because binary data printed to terminals is bad but also because binary data doesn't have a newline distribution similar to plain text files, which can result in allocating a lot of memory.
Well, even with lines of variable lengths, I'm pretty sure it's possible
to do the job in O(1) memory regardless of the file or contents of the
file. For instance, an algorithm (pseudo-code, obviously not optimized,
nor checked for its likely multiple bugs, just for the idea):
```
fn handle_file(f: File, r: Regex) {
loop {
// Find the next occurence of the pattern
loop {
if let Some(c) = read_1_char_from_file(f) {
match r.feed_1_char_to_FSA(c) {
FoundPattern => break,
NotFoundYet => (),
}
} else {
return;
}
}
// Figure out the beginning of the line
loop {
if let Some(c) = read_1_char_backwards_from_file(f) {
if c == '\n' {
break;
}
} else {
break;
}
}
// Output until the end of the line
loop {
if let Some(c) = read_1_char_from_file(f) {
putchar(c);
if c == '\n' {
break;
}
} else {
break;
}
}
}
}
```
This requires the ability to seek a file backwards, but for files on
disk this shouldn't be an issue? (apart from the added code complexity,
of course)
When crawling a large directory, ripgrep should be using a fixed size buffer with standard `read` calls. There are only two ways that this buffer can increase beyond its default 8KB capacity: 1) it needs to make room for contextual lines, which only happens when the `-A/-B/-C` flags are used or 2) there is a line longer than 8KB.
With that said, ripgrep has other ways of using memory. For example, its parallel searcher may be too eager and fill up heap space with the file tree. Typically this is small enough that it doesn't matter, but you might consider using the `-j1` flag (to force single threaded mode) to see if that helps memory usage.
Here I'm grepping through a single 100G-long partition pseudo-file,
`/dev/nvme0n1p3`, so I guess that's not the issue
> Or ripgrep could display an error message saying “Sorry we couldn't allocate enough memory because {your file is too big, your pattern is too complex, …}”, so that the reason why ripgrep is OOM'ing is clear.
That's not how OOM conditions happen on Linux, and at the very minimum, depends on your system settings for overcommit.
I don't foresee ripgrep doing anything to make the OOM killer failure mode any better. By the time the OOM killer has ripgrep in its sights, it's too late. It is the user's responsibility to become familiar with the failure modes of OOM killers on systems with OOM killers enabled. Moreover, that ripgrep is the target of an OOM killer does not necessarily imply that ripgrep is the agitator. I note that no part of your issue mentions the observed peak memory usage of ripgrep, which is a critical piece of information when reporting a bug about an OOM condition.
Indeed, it was behind the paste to the dmesg output of the OOM killer,
sorry for not having thought of making it easier to see!
The line relevant to ripgrep there is:
```
uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
0 30187 4206947 1573363 5151 14 0 0 rg
```
If you don't have an OOM killer and your system allocators don't lie, then ripgrep should indeed exit with a more descriptive error.
Here are some things to try:
1. Use `-j1` to force single threaded mode.
2. Compare `--mmap` with `--no-mmap`.
3. Investigate the memory usage of GNU grep on the same directory.
4. Investigate the memory usage of ripgrep and GNU grep with the `-a` flag set.
Oh sorry it looks like I forgot the `-a` in my previous message's command!
# Detailed analysis
The command run is searching for `qwertyuiop` on `/dev/nvme0n1p3` (ie. a
single 100G-long file). Each section below will be named
`exec-and-options`, which will correspond to a run of `exec-and-options
qwertyuipo /dev/nvme0n1p3`. Measurements made with the `psrecord` python
package.
## `grep`
Returns with no output. Memory and CPU usage: http://xelpaste.net/1uV0rQ
## `rg`
Returns with no output in less than 1 second, so no graph.
## `rg -a`
OOM-killed, displayed on console: “Killed”. http://xelpaste.net/u3nMpg
## `rg -j1 -a`
OOM-killed, displayed on console: “Killed”. http://xelpaste.net/69UhjH
BTW, I don't really get why it's still at ~200% CPU, but anyway.
## `rg --mmap -a`
OOM-killed, same message. http://xelpaste.net/Udv1jV
## `rg --no-mmap -a`
OOM-killed, same message. http://xelpaste.net/7dqK8w
## `grep -a`
OOM-killed, displayed on console “grep: memory exhausted”. Note I think
this is already a much better error message than “Killed” :)
http://xelpaste.net/NqHnr1
## `grep --only-matching --byte-offset --binary --text`
OOM-killed, same message. http://xelpaste.net/9L87gY
## `grep --only-matching --byte-offset --binary -z --text`
Works! But not reproducible with `rg` due to no `--byte-offset` or `-z`
parameter I could easily find.
http://xelpaste.net/uVToz4
## `grep --byte-offset --binary -z --text`
Works too, gives a bit of context. http://xelpaste.net/3Xqfiz
…------
Given the fact that the memory allocation seems to be done by blocks of
4G (I guess the size of the line, for a line buffer), I guess the above
(slower due to being 3-pass) algorithm for using only a fixed amount of
memory could be triggered when trying to allocate more than eg. 1G of
memory? Either that, or just refusing to allocate a 1G line buffer
without a special flag, given it's quite sure that's not what the user
actually wants.
What do you think about that?
|
Ugh. Looks like GitHub doesn't like Markdown by email. So some text is hidden under the “signature” marker, I'll paste it here for ease of reading: Given the fact that the memory allocation seems to be done by blocks of 4G (I guess the size of the line, for a line buffer), I guess the above (slower due to being 3-pass) algorithm for using only a fixed amount of memory could be triggered when trying to allocate more than eg. 1G of memory? Either that, or just refusing to allocate a 1G line buffer without a special flag, given it's quite sure that's not what the user actually wants. What do you think about that? |
Let me stop you right there and finish that thought: "I'm pretty sure it's possible... after years of work and research to rewrite ripgrep and its regex engine from the ground up." By "research" I mean, "figure out how to efficiently use a finite state machine in streaming mode." Your example code doesn't cut it unfortunately. I don't know of any extant production grade regex engine that uses finite state machines and satisfies the criterion of being able to handle arbitrarily sized inputs in constant space while still meeting the requirements of ripgrep. I have an extensive write up here: rust-lang/regex#425 (comment) Can we drop the elementary lessons on finite state machines? Thanks. Let's please focus on the problem.
Yes, that's critical. When the
Weird.
I would agree, and I find the behavior difference here interesting. If GNU grep can detect a failed With that said, the fact that GNU grep is exhausting memory here to me means that ripgrep is working as intended.
OK, now we're getting somewhere. This basically confirms my diagnosis. Namely, the This is good news, because my work on libripgrep should make it much easier to support the same I can't think of any simple work around until then unfortunately. (Other than, of course, just using grep, which is probably going to perform similarly as ripgrep for this sort of task given that it's assuredly I/O bound.) Note that I suspect the |
> I'm pretty sure it's possible
Let me stop you right there and finish that thought:
"I'm pretty sure it's possible... after years of work and research to rewrite ripgrep and its regex engine from the ground up." By "research" I mean, "figure out how to efficiently using a finite state machine in streaming mode." Your example code doesn't cut it unfortunately. I don't know of any extant production grade regex engine that uses finite state machines and satisfies the criterion of being able to handle arbitrarily sized inputs in constant space while still meeting the requirements of ripgrep.
I have an extensive write up here: rust-lang/regex#425 (comment)
I don't want to argue with that in the general case. However, for search
parameters as simple as the `qwertyuiop` given in this example, I'm
positive it can be done in 3-pass with a constant buffer, with the
stupid algorithm I gave above. Not saying the code I'd write would be as
fast as ripgrep's, but at least it would eventually give an output.
Actually, I'm almost sure that what I'm saying is also what you wrote in
your write-up.
Also, I'd like to mention that in my first post I already had the “which
would likely be best, but quite potentially require large changes to the
codebase”.
Can we drop the elementary lessons on finite state machines? Thanks. Let's please focus on the problem.
Can we drop aggressiveness too? I must say I've felt like the first
three words of your first answer being “This is wrong” didn't set the
right tone, but I hoped it was due to my mis-using English in the top
post, given you also appeared to think I was searching in a big
directory while I thought I had written that it was on a partition file.
Now… well, let's say I'd hope we could talk in a friendly manner.
Anyway, I must say I likely won't answer again if I feel once more like
we're just misunderstanding each other and I'm just bothering you :)
> BTW, I don't really get why it's still at ~200% CPU, but anyway.
Weird. `-j1` on my machine shows it is using 100%. This actually sounds like a bug, but probably unrelated to this issue.
I thought of reporting it as a bug, but given it was reported by
`psrecord`, a software which I don't know at all, I thought it may just
as well be a bug of the tool. I'll try to have a look at htop next time
I get to that computer :)
> OOM-killed, displayed on console “grep: memory exhausted”. Note I think
this is already a much better error message than “Killed” :)
I would agree, and I find the behavior difference here interesting. If GNU grep can detect a failed `malloc`, then I'd expect ripgrep to be able to do the same, and subsequently abort on its own. One possible difference is that GNU grep is using the system allocator while ripgrep uses jemalloc. I'm not sure off hand if that would make a difference.
With that said, the fact that GNU grep is exhausting memory here to me means that ripgrep is working as intended.
Well, it depends on the goal of ripgrep. If it's to be “grep, but
faster”, then indeed. However, the readme also mentions “fewer bugs”,
and I'd argue handling OOM failures better than grep would be something
going in this direction.
By “handling OOM failures better than grep”, I'm thinking mostly of
dropping back to the slower regex engine when it could be detected that
the faster regex engine would likely trigger an OOM (eg. when it tries
to allocate >= 1GB RAM), if I understood correctly and your extensive
write-up mentioned that the slower regex engine was less memory-consuming.
> `grep --only-matching --byte-offset --binary -z --text`
>
> Works! But not reproducible with `rg` due to no `--byte-offset` or `-z`
> parameter I could easily find.
>
> http://xelpaste.net/uVToz4
>
> `grep --byte-offset --binary -z --text`
>
> Works too, gives a bit of context. http://xelpaste.net/3Xqfiz
OK, now we're getting somewhere. This basically confirms my diagnosis. Namely, the `-z` flag is the key ingredient, because it effectively change's GNU grep's line terminator from `\n` to `\x00`. Given a raw data dump, the distribution of `\x00` is probably quite common, which means GNU grep _just happens_ to work well here. If you had a data dump where both `\n` and `\x00` were rare, then you'd be witnessing the same sort of memory exhaustion.
This is good news, because my work on libripgrep should make it much easier to support the same `-z/--null-data` flag that you're using in GNU grep, so I'll accept this as a bug that should hopefully get fixed when I pull in libripgrep.
I can't think of any simple work around until then unfortunately. (Other than, of course, just using grep, which is probably going to perform similarly as ripgrep for this sort of task given that it's assuredly I/O bound.)
Note that I suspect the `--byte-offset` and `--binary` flags are superfluous here. You should be able to just use `-z -a` to get a working grep.
That's most likely indeed, the `--byte-offset` was useful for me to then
use `dd` and try to figure out what was around the pattern in the
partition image.
Anyway, supporting the `-z` flag sounds like a good work-around, even
though it isn't solving the underlying issue of memory consumption
(that, granted, would be really hard to solve) :)
Again, thank you for your work on ripgrep!
|
Apologies for the aggression, my frustration got the better of me. Your comments have not be easy to read or respond to. As such, I will try to be brief and to the point. The I will not be spending my time improving the failure mode of OOM conditions. I agree that GNU grep's failure mode is better. I will not be beholden to nitpicks about particular wordings of project goals. (The word "goal" doesn't appear once in the project README, largely to avoid that style of argumentation, which I find incredibly frustrating as a maintainer.) I would be happy if someone else wanted to improve this failure mode, so long as it didn't impose any significant complications on the code. I don't see any reason to track this particular issue. It is technically true that some special cases could be implemented in constant space. We don't need to go over how this could be done in a simplistic scenario, because that's not what's interesting. What is interesting are the practical ramifications of implementing those special cases along with maintaining the general case implementation. This is why I jumped to the general case; if it can be solved then all searches can benefit regardless of whether they are special or not. As such, I will not be investing any of my time in the near future to fixing special cases. My long term goal is to attempt to address this in the regex engine, if it's possible. If that effort succeeds, then it may bear fruit that ripgrep can benefit from. I don't see any reason to track this particular issue, as the next immediate work here comes in the regex engine, which is already being tracked. |
I compiled a version of ripgrep that uses the system allocator, and my guess seems to be right: it yields a better error message. With jemalloc (default):
With system allocator:
It will be a while before ripgrep can move to the system allocator by default, but when that's possible, we might pursue that option. Playing around with it a little bit does actually seem to cause some small performance regressions in cases, but we can cross that bridge when we get there. |
OK, I decided to do an experiment and actually look at what's going on in terms of allocation, since I wanted to know whether moving to the system allocator would actually be a plausible solution to this problem. The short answer is, no, it wouldn't be. Here is the strace with the system allocator:
Notice that we get an Now here's the strace for jemalloc:
This is where get OOM killed. In order to make sense of what's happening here, we actually need to inspect our system's
In my case, my system is using the default, which is indeed So, does this mean switching to the system allocator gives us a nice error message? The answer here is, "probably, if overcommit is set to the default, but otherwise, no, it does not." For example, if we set
Which means we're back to bad error messages, even with the system allocator. Moreover, GNU grep exhibits exactly the same behavior as ripgrep with Interestingly, it is possible to get a good error message even with jemalloc, and this can be done by setting
Interestingly, jemalloc actually seems to react to the change in the So... I don't think there is much to be done for the OOM condition specifically. If you want better out-of-memory errors, that it seems like you just need to disable overcommit. |
Meta-talk
Sorry about that! I re-read my first post and indeed I at the same time left parts of the issue template unfilled and forgot to mention the
Sorry also if I appeared to be arguing on the wording of the README too, I only wanted to point out that if the goal of ripgrep was also to have fewer failing conditions, then such a change would be a step in the right direction -- to the cost of much more complex code indeed. Most “improvements” are trade-offs between code complexity and the usefulness of the improvement, and I can understand that this one doesn't pass your threshold, I was just surprised to read that an OOM was “working as intended” :) Long-term goal
Oh I didn't know that! this sounds great indeed :) Medium-term work-around(short-term work-around being setting the system to no overcommit)
Hmm… my reading of your post sounds like it would be a solution for the default case, which should be the most frequent one? Documentation about overcommit seems (to me) to imply that it should not be set to 1 by default in any non-specialist distribution, and people manually changing their overcommit settings should expect the effects of these settings (ie. the kernel lying to applications requiring memory) to potentially change the behaviour of applications.
Apart from using the system allocator (great finding!), I can think of two potential other options:
In both cases, I must say I have no idea of how hard it would be to integrate in jemalloc or the system allocator (even though idea 1 could likely be done with a wrapper around
|
Thanks for the follow up. I don't have the energy to debate the various details or your proposed solutions (none of which I'd be comfortable with), but I think ripgrep's current behavior on OOM is consistent with your |
… wait, you brought up
Did I completely misunderstand your post on this matter? Anyway, this changes nothing to the fact that you feel uncomfortable with the two proposed solutions, and this ends the debate, so let's have this issue track only |
|
This commit updates the CHANGELOG to reflect all the work done to make libripgrep a reality. * Closes #162 (libripgrep) * Closes #176 (multiline search) * Closes #188 (opt-in PCRE2 support) * Closes #244 (JSON output) * Closes #416 (Windows CRLF support) * Closes #917 (trim prefix whitespace) * Closes #993 (add --null-data flag) * Closes #997 (--passthru works with --replace) * Fixes #2 (memory maps and context handling work) * Fixes #200 (ripgrep stops when pipe is closed) * Fixes #389 (more intuitive `-w/--word-regexp`) * Fixes #643 (detection of stdin on Windows is better) * Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird) * Fixes #764 (coalesce color escapes) * Fixes #922 (memory maps failing is no big deal) * Fixes #937 (color escapes no longer used for empty matches) * Fixes #940 (--passthru does not impact exit status) * Fixes #1013 (show runtime CPU features in --version output)
This commit updates the CHANGELOG to reflect all the work done to make libripgrep a reality. * Closes #162 (libripgrep) * Closes #176 (multiline search) * Closes #188 (opt-in PCRE2 support) * Closes #244 (JSON output) * Closes #416 (Windows CRLF support) * Closes #917 (trim prefix whitespace) * Closes #993 (add --null-data flag) * Closes #997 (--passthru works with --replace) * Fixes #2 (memory maps and context handling work) * Fixes #200 (ripgrep stops when pipe is closed) * Fixes #389 (more intuitive `-w/--word-regexp`) * Fixes #643 (detection of stdin on Windows is better) * Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird) * Fixes #764 (coalesce color escapes) * Fixes #922 (memory maps failing is no big deal) * Fixes #937 (color escapes no longer used for empty matches) * Fixes #940 (--passthru does not impact exit status) * Fixes #1013 (show runtime CPU features in --version output)
@BurntSushi, thank you for adding |
What version of ripgrep are you using?
Version 0.8.1 as built by NixOS
Describe your question, feature request, or bug.
Hmm, so I think I just hit an issue similar (but not identical) to #458, when running
rg [a string not on the disk] /dev/nvme0n1p3
. (this being a ~100G partition, on an 8G-RAM machine). The difference being that, here, rg gets oom-killed in a matter around 1min.Dmesg output of the OOM killer http://xelpaste.net/1P14FN.
If this is a bug, what are the steps to reproduce the behavior?
Run
rg [some search pattern] [some big partition]
. That said I can't reproduce on the computer on which I'm typing this, which hassda*
partitions, and for which search finishes in a matter of seconds, so there may be something special aboutnvme0n1p*
?If this is a bug, what is the actual behavior?
Ripgrep gets OOM-killed
If this is a bug, what is the expected behavior?
Either ripgrep could have managed to actually perform the search (the pattern was something very simple, no regex or anything, so afaiu theoretically the search could be done with very limited memory use in 3 passes for the output, forwards until the pattern, then backwards until a newline, then forwards again while outputting until the next newline), which would likely be best, but quite potentially require large changes to the codebase.
Or ripgrep could display an error message saying “Sorry we couldn't allocate enough memory because {your file is too big, your pattern is too complex, …}”, so that the reason why ripgrep is OOM'ing is clear. And if not possible, even a “OOM” message would still be better than the SIGKILL rg currently receives through the OOM.
Anyway, these are just nitpicks, thank you for ripgrep!
The text was updated successfully, but these errors were encountered: