Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte pattern detection works inconsistently #1838

Closed
bdlmt opened this issue Mar 31, 2021 · 12 comments
Closed

Byte pattern detection works inconsistently #1838

bdlmt opened this issue Mar 31, 2021 · 12 comments
Labels
doc An issue with or an improvement to documentation. question An issue that is lacking clarity on one or more points. rollup A PR that has been merged with many others in a rollup.

Comments

@bdlmt
Copy link

bdlmt commented Mar 31, 2021

What version of ripgrep are you using?

ripgrep 12.1.1 (rev 7cb2113)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

Initially, I found the bug while using 11.0.2:

ripgrep 11.0.2
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

sudo dpkg -i ./ripgrep_12.1.1_amd64.deb

What operating system are you using ripgrep on?

Ubuntu 20.04.2 LTS

Linux local 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Describe your bug.

Using command line mode, escaping from Unicode mode in order to scan for bytes doesn't seem to work as documented.

What are the steps to reproduce the behavior?

Simple to reproduce. A scan of /usr/bin will suffice.

What is the actual behavior?

An rg scan of /usr/bin for ELF magic values returns 0 results, while there should be over 1000 matching files:

user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/
user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/*
user@local:~$

Using the same byte pattern, a direct scan of individual files works as expected:

user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/ls
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/ed
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/file
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/find
Binary file matches (found "\u{0}" byte around offset 7)

Reducing the set of files scanned seems to change the results when using a filename wildcard, but not when just just specifying the directory:

user@local:~$ mkdir test
user@local:~/test$ cd test
user@local:~/test$ cp /usr/bin/ls .
user@local:~/test$ cp /usr/bin/ed . 
user@local:~/test$ cp /usr/bin/file .
user@local:~/test$ cp /usr/bin/find .
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./*
Binary file ./ls matches (found "\u{0}" byte around offset 7)

Binary file ./find matches (found "\u{0}" byte around offset 7)

Binary file ./file matches (found "\u{0}" byte around offset 7)

Binary file ./ed matches (found "\u{0}" byte around offset 7)
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./ls 
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./ed
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./file 
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./find 
Binary file matches (found "\u{0}" byte around offset 7)

Removing the \x00 bytes from the tail of the pattern changes ripgrep's behavior, and seems to work as expected:

user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/ | wc -l
0
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' /usr/bin/ | wc -l
1092

Here are some debug outputs from the 4-file ~/test directory:

user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./ --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' ./* --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
Binary file ./ls matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

Binary file ./find matches (found "\u{0}" byte around offset 7)

Binary file ./file matches (found "\u{0}" byte around offset 7)

Binary file ./ed matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' ./ --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
Binary file ./ls matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

Binary file ./find matches (found "\u{0}" byte around offset 7)

Binary file ./file matches (found "\u{0}" byte around offset 7)

Binary file ./ed matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' ./ls --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
Binary file matches (found "\u{0}" byte around offset 7)

What is the expected behavior?

ripgrep should consistently match on files in a search path which contain a specified byte pattern, using the documented method.

@bdlmt
Copy link
Author

bdlmt commented Mar 31, 2021

Further test results, when copying all of /usr/bin into a local test directory.

Somehow, there are more pattern hits after a copy:

user@local:~$ mkdir test2
user@local:~$ cd test2/
user@local:~/test2$ cp /usr/bin/* .
cp: -r not specified; omitting directory '/usr/bin/X11'
user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' /usr/bin | wc -l
1092
user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' | wc -l
1468

The full byte pattern search still doesn't match any files, unless individual files are targeted:

user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' | wc -l
0
user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./ls | wc -l
1

And here's some more --debug output:

user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 31, 2021

At first, I thought this might just be some confusion at how ripgrep handles binary files. The guide's section on binary data tries to explain the different modes. But since you are enabling binary mode via -uuu in all of your searches, that should cause things to behave consistently.

On mobile at the moment, but I suspect there may be a bug in how ripgrep is detecting and handling binary data specifically when a match occurs. I'll take a look when I'm back on my workstation.

Out of curiosity, do your results change if you add the --no-mmap flag? What about --mmap?

@BurntSushi
Copy link
Owner

(I believe there were changes in this area between the 11 and 12 releases, so please make sure you're doing all your testing with the latest release.)

@bdlmt
Copy link
Author

bdlmt commented Mar 31, 2021

(I believe there were changes in this area between the 11 and 12 releases, so please make sure you're doing all your testing with the latest release.)

All reported test results have been with latest.
(I was using 11.0.2 initially, but replicated all results in 12.1.1 for the bug report.)

At first, I thought this might just be some confusion at how ripgrep handles binary files. The guide's section on binary data tries to explain the different modes. But since you are enabling binary mode via -uuu in all of your searches, that should cause things to behave consistently.

On mobile at the moment, but I suspect there may be a bug in how ripgrep is detecting and handling binary data specifically when a match occurs. I'll take a look when I'm back on my workstation.

Out of curiosity, do your results change if you add the --no-mmap flag? What about --mmap?

The issue seems unchanged by '--no-mmap':

user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' --no-mmap | wc -l
0

But the results do change with '--mmap':

user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' --mmap | wc -l
1447

1447 seems to be the correct number of matches for that directory.

I tested using '--mmap' with a much larger set of files as well, and again got the expected number of matches.
(Though, slower than I expected, but maybe that's a side effect of escaping out of Unicode mode to do byte searches with regex.)

Here's a debug run with '--mmap' over the test dir with only 4 files:

$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' --mmap --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexesBinary file ls matches (found "\u{0}" byte around offset 7)


Binary file file matches (found "\u{0}" byte around offset 7)

Binary file find matches (found "\u{0}" byte around offset 7)

Binary file ed matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

@BurntSushi
Copy link
Owner

(Though, slower than I expected, but maybe that's a side effect of escaping out of Unicode mode to do byte searches with regex.)

Yeah, forcing memory maps when searching a bunch of tiny files actually results in some pretty serious overhead that slows down the entire enterprise.

In any case, I am able to reproduce this on my own /usr/bin directory, so I'll investigate this as soon as I can. Thanks for the great bug report!

@bdlmt
Copy link
Author

bdlmt commented Mar 31, 2021

In any case, I am able to reproduce this on my own /usr/bin directory, so I'll investigate this as soon as I can.

If I were to attempt to investigate and fix the issue, where's your best guess to start looking?
I'm still new to Rust, but willing to dig in.

Thanks for the great bug report!

No problem; thank you for creating and maintaining a great project.

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 31, 2021

OK, so I took a look at this, and yeesh, this is actually behaving as intended. Which is kind of scary, because the UX here is just totally abysmal. This is a perfect storm, basically.

TL;DR - If you put --no-mmap in your ripgrep config file (or alias or whatever), then you'll get consistent behavior. For the specific case of searching for explicit binary data, you almost always want the -a flag. Otherwise, a \x00 byte will never match anything, by construction.

So the top-level issue here is that binary detection/handling fundamentally works differently depending on whether memory maps are used or not.

The underlying motivation for binary handling is a desire for ripgrep to treat files as "text." Of course, ripgrep, like any grep, can also search binary files. The main problem is that you generally don't want to accidentally print out results from a binary file because it can muck with your terminal via escape sequences and what not. On top of that, there is the issue that "text" files are typically line oriented where as binary data isn't. That means a single "line" in a binary data (typically a meaningless concept) can actually be quite large. Since ripgrep requires that each line be able to fit in memory, it uses a trick: it converts \x00 bytes into line terminators in an effort to make binary data "look" like text. When this happens, the file is also marked as "binary." If a match occurs, you get the "match found but it's a binary file so we aren't going to print anything" message.

Only looking at the above, we can explain why rg -uuu 'foo\x00\x00' finds nothing and why rg -uuu 'foo' finds something:

  • The third -u flag is synonymous with --binary. This causes ripgrep to disable auto-filtering of binary data. The main UX concern here is that if no match is found, then one should be able to conclude that no matches exist.
  • Since the -a/--text flag is not enabled, ripgrep does not treat everything as if it were "text." Instead, it tries to do binary detection. As stated above, this has a dual purpose: to prevent dumping arbitrary binary data to your terminal, but also to avoid using gratuitous amounts of memory for long lines in binary files. Thus, all \x00 bytes in every file searched are rewritten to \n bytes. This is what explains why foo\x00 won't match: \x00 can never match because all such bytes are re-mapped to line terminators. But foo contains no NUL bytes, and thus can match.

All of that behavior was lifted exactly from GNU grep. Including the NUL byte rewriting. However, GNU grep makes it much harder to search for NUL bytes. It doesn't recognize things like \x00, and inserting a NUL byte into a shell string is weird.

If that weren't confusing enough as it is, there's another catch here. When you search a single file directly, ripgrep will typically switch over to using memory maps to read the file since it tends to be faster in that case. The issue with memory maps is that we can't apply the above technique in the same way without destroying the performance benefit gained by using memory maps in the first place. In particular:

  1. We can't search the entire file for \x00 without adverse consequences. e.g., In particularly big files, in the worst case, this could require reading the entire file twice. When it comes to memory maps, they aren't searched in blocks, but rather, as a single contiguous slice. Thus binary detection is itself different for memory maps.
  2. When using memory maps, we never copy the data from the actual file just for searching. Therefore, there is no real opportunity to rewrite \x00 bytes as line terminators, even if we could find all such \x00 bytes.

So, with memory maps, we do binary detection slightly differently:

  1. We look for a \x00 byte in the first N bytes of a file.
  2. When printing matches or contextual lines, we search only those specific lines for a \x00 byte. In this way, we prevent dumping binary data to the terminal.

Thus, when you search a file directly, you get matches because ripgrep is using memory maps for searching and thus doesn't rewrite \x00 bytes as line terminators. So your pattern finds a hit.

So what could we do differently here? I don't think documentation is going to really help much here. It's such an oddball case. We could add to the relevant section on the guide with this particular case study. And in particular, we could say something like, "add --no-mmap to your config file to make binary data handling consistent." The performance improvement from memory maps isn't that great, after all.

In terms of behavior changes, here are some things I can think of in no particular order. But I also include reasons why they problematic:

  1. We could force the --no-mmap strategy to have the same binary handling logic as the --mmap strategy. e.g., By only looking for binary data in the first N bytes and otherwise only searching matching lines. This would at least avoid dumping binary data to your terminal. However, this gives up on the heuristic for avoiding large heap usages due to long "lines" in binary data. Memory maps don't have this problem since the OS handles paging things in and out of memory transparently. Having the heap explode when searching binary data seems unfortunate.
  2. We could get rid of memory maps entirely. The problem here is that they do have some nice perf benefits. They also enable certain configurations to run very very fast that wouldn't otherwise be possible without memory maps.
  3. We could somehow force the memory map binary handling to match the non-memory map handling. The only way I can think to do this would be to chunk up the memory map, copy data from it and rewrite its bytes like we do in the non-mmap case. But this is adding a lot of extra costs to the memory map handling that we can safely ignore precisely because of the nature of memory maps.
  4. We could forcefully disable the use of memory maps whenever --binary mode is enabled. But this is enabled by default when searching a specific file, so this would effectively shut off memory maps completely.

@BurntSushi BurntSushi added doc An issue with or an improvement to documentation. question An issue that is lacking clarity on one or more points. labels Mar 31, 2021
BurntSushi added a commit that referenced this issue Mar 31, 2021
This message will emit the binary detection mechanism being used for
each file.

This does not noticeably increases the number of log messages, as the
'trace' level is already used for emitting messages for every file
searched.

This trace message was added in the course of investigating #1838.
@bdlmt
Copy link
Author

bdlmt commented Mar 31, 2021

Thank you for your detailed explanation!

For the specific case of searching for explicit binary data, you almost always want the -a flag. Otherwise, a \x00 byte will never match anything, by construction.

Aha! I missed that in the docs.

That means a single "line" in a binary data (typically a meaningless concept) can actually be quite large. Since ripgrep requires that each line be able to fit in memory, it uses a trick: it converts \x00 bytes into line terminators in an effort to make binary data "look" like text.

Okay, got it.

The third -u flag is synonymous with --binary.

After reading through your response and the man page, I'm not able to rectify this behavior:

user@local:~$ rg -a --binary '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
0
user@local:~$ rg -a -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
1081
user@local:~$ rg -a --binary --mmap '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
1080
user@local:~$ rg -a -uuu --mmap '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
1081
user@local:~$ rg -a --binary --no-mmap '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
0
user@local:~$ rg -a -uuu --no-mmap '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
1081

-uuu and --binary seem to have different behavior when combined with -a, --mmap, and --no-mmap.

Assuming I understand your explanation correctly, the -uuu cases are behaving as expected, while the --binary cases don't seem to be.

(Edit: rearranged the test cases to make more sense side-by-side.)

@bdlmt
Copy link
Author

bdlmt commented Mar 31, 2021

In terms of behavior changes, here are some things I can think of in no particular order. But I also include reasons why they problematic:

  1. We could force the --no-mmap strategy to have the same binary handling logic as the --mmap strategy. e.g., By only looking for binary data in the first N bytes and otherwise only searching matching lines. This would at least avoid dumping binary data to your terminal. However, this gives up on the heuristic for avoiding large heap usages due to long "lines" in binary data. Memory maps don't have this problem since the OS handles paging things in and out of memory transparently. Having the heap explode when searching binary data seems unfortunate.
  2. We could get rid of memory maps entirely. The problem here is that they do have some nice perf benefits. They also enable certain configurations to run very very fast that wouldn't otherwise be possible without memory maps.
  3. We could somehow force the memory map binary handling to match the non-memory map handling. The only way I can think to do this would be to chunk up the memory map, copy data from it and rewrite its bytes like we do in the non-mmap case. But this is adding a lot of extra costs to the memory map handling that we can safely ignore precisely because of the nature of memory maps.
  4. We could forcefully disable the use of memory maps whenever --binary mode is enabled. But this is enabled by default when searching a specific file, so this would effectively shut off memory maps completely.

Would it make sense to have this as an option as well?
5. Use memory map mode whenever --binary mode is enabled. Treat a binary file as a long line and let the OS handle paging. No \x00 replacement overhead would be required, and the heap wouldn't explode on large binaries.

@BurntSushi
Copy link
Owner

@bdlmt The -a/--text and --binary flag are mutually exclusive things. They actually override each other. So -a --binary enables binary searching while --binary -a makes everything searched as if it were text. Of course, the former is equivalent to --binary without -a and the latter is equivalent to -a without --binary.

  1. Use memory map mode whenever --binary mode is enabled. Treat a binary file as a long line and let the OS handle paging. No \x00 replacement overhead would be required, and the heap wouldn't explode on large binaries.

It wouldn't. Aside from the downside of this seriously regressing performance when --binary is enabled, it's also just plain impossible. Memory maps can only be used in a subset of cases. They can't be used on streams and certain files (like /proc/cpuinfo) can't be memory mapped. If it was just a matter of performance regressing, then it would make sense to list it for completeness. But it's actually impossible. Standard read syscalls are the only universal way to search data.

@bdlmt
Copy link
Author

bdlmt commented Mar 31, 2021

The -a/--text and --binary flag are mutually exclusive things. They actually override each other. So -a --binary enables binary searching while --binary -a makes everything searched as if it were text. Of course, the former is equivalent to --binary without -a and the latter is equivalent to -a without --binary.

I see now. I read the man page incorrectly. I misunderstood this line under --binary:

This flag can be disabled with --no-binary. It overrides the -a/--text flag.

I thought "It" referred to --no-binary, not --binary. I should have realized I had it wrong, based on the intended nature of -a/--text and --binary.

They can't be used on streams and certain files (like /proc/cpuinfo) can't be memory mapped.

Good point; I wasn't thinking far enough outside my recent use case.

@bdlmt
Copy link
Author

bdlmt commented Apr 1, 2021

  1. We could force the --no-mmap strategy to have the same binary handling logic as the --mmap strategy. e.g., By only looking for binary data in the first N bytes and otherwise only searching matching lines. This would at least avoid dumping binary data to your terminal. However, this gives up on the heuristic for avoiding large heap usages due to long "lines" in binary data. Memory maps don't have this problem since the OS handles paging things in and out of memory transparently. Having the heap explode when searching binary data seems unfortunate.

From a UX point of view, behavior change option 1 seems best. Mixing -a/--text with -uuu to enable text mode during a binary search in order to successfully match on raw \x00 bytes is a bit awkward and not intuitive. The change would also address the unexpected disparity between using -a/--text with --binary (mutually exclusive) vs. using -a/--text with -uuu (text mode enabled on binaries), because -a/--text would no longer be necessary when matching on \x00 bytes.

From a performance point of view, it seems like leaving it as-is would be best, unless there's a way to split a binary file into multiple "lines" without replacing bytes.

As a user, I like the sound of improving the UX, but obviously not if there's a significant performance hit.

BurntSushi added a commit that referenced this issue Nov 25, 2023
Basically, unless the -a/--text flag is given, it is generally always an
error to search for an explicit NUL byte because the binary detection
will prevent it from matching.

Fixes #1838
@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label Nov 25, 2023
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Nov 28, 2023
14.0.2 (2023-11-27)
===================
This is a patch release with a few small bug fixes.

Bug fixes:

* [BUG #2654](BurntSushi/ripgrep#2654):
  Fix `deb` release sha256 sum file.
* [BUG #2658](BurntSushi/ripgrep#2658):
  Fix partial regression in the behavior of `--null-data --line-regexp`.
* [BUG #2659](BurntSushi/ripgrep#2659):
  Fix Fish shell completions.
* [BUG #2662](BurntSushi/ripgrep#2662):
  Fix typo in documentation for `-i/--ignore-case`.


14.0.1 (2023-11-26)
===================
This a patch release meant to fix `cargo install ripgrep` on Windows.

Bug fixes:

* [BUG #2653](BurntSushi/ripgrep#2653):
  Include `pkg/windows/Manifest.xml` in crate package.


14.0.0 (2023-11-26)
===================
ripgrep 14 is a new major version release of ripgrep that has some new
features, performance improvements and a lot of bug fixes.

The headlining feature in this release is hyperlink support. In this release,
they are an opt-in feature but may change to an opt-out feature in the future.
To enable them, try passing `--hyperlink-format default`. If you use [VS Code],
then try passing `--hyperlink-format vscode`. Please [report your experience
with hyperlinks][report-hyperlinks], positive or negative.

[VS Code]: https://code.visualstudio.com/
[report-hyperlinks]: BurntSushi/ripgrep#2611

Another headlining development in this release is that it contains a rewrite
of its regex engine. You generally shouldn't notice any changes, except for
some searches may get faster. You can read more about the [regex engine rewrite
on my blog][regex-internals]. Please [report your performance improvements or
regressions that you notice][report-perf].

[report-perf]: BurntSushi/ripgrep#2652

Finally, ripgrep switched the library it uses for argument parsing. Users
should not notice a difference in most cases (error messages have changed
somewhat), but flag overrides should generally be more consistent. For example,
things like `--no-ignore --ignore-vcs` work as one would expect (disables all
filtering related to ignore rules except for rules found in version control
systems such as `git`).

[regex-internals]: https://blog.burntsushi.net/regex-internals/

**BREAKING CHANGES**:

* `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to
  `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`.
  Instead, they only partially override `-C`.

Build process changes:

* ripgrep's shell completions and man page are now created by running ripgrep
with a new `--generate` flag. For example, `rg --generate man` will write a
man page in `roff` format on stdout. The release archives have not changed.
* The optional build dependency on `asciidoc` or `asciidoctor` has been
dropped. Previously, it was used to produce ripgrep's man page. ripgrep now
owns this process itself by writing `roff` directly.

Performance improvements:

* [PERF #1746](BurntSushi/ripgrep#1746):
  Make some cases with inner literals faster.
* [PERF #1760](BurntSushi/ripgrep#1760):
  Make most searches with `\b` look-arounds (among others) much faster.
* [PERF #2591](BurntSushi/ripgrep#2591):
  Parallel directory traversal now uses work stealing for faster searches.
* [PERF #2642](BurntSushi/ripgrep#2642):
  Parallel directory traversal has some contention reduced.

Feature enhancements:

* Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo,
  Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V
* [FEATURE #665](BurntSushi/ripgrep#665):
  Add a new `--hyperlink-format` flag that turns file paths into hyperlinks.
* [FEATURE #1709](BurntSushi/ripgrep#1709):
  Improve documentation of ripgrep's behavior when stdout is a tty.
* [FEATURE #1737](BurntSushi/ripgrep#1737):
  Provide binaries for Apple silicon.
* [FEATURE #1790](BurntSushi/ripgrep#1790):
  Add new `--stop-on-nonmatch` flag.
* [FEATURE #1814](BurntSushi/ripgrep#1814):
  Flags are now categorized in `-h/--help` output and ripgrep's man page.
* [FEATURE #1838](BurntSushi/ripgrep#1838):
  An error is shown when searching for NUL bytes with binary detection enabled.
* [FEATURE #2195](BurntSushi/ripgrep#2195):
  When `extra-verbose` mode is enabled in zsh, show extra file type info.
* [FEATURE #2298](BurntSushi/ripgrep#2298):
  Add instructions for installing ripgrep using `cargo binstall`.
* [FEATURE #2409](BurntSushi/ripgrep#2409):
  Added installation instructions for `winget`.
* [FEATURE #2425](BurntSushi/ripgrep#2425):
  Shell completions (and man page) can be created via `rg --generate`.
* [FEATURE #2524](BurntSushi/ripgrep#2524):
  The `--debug` flag now indicates whether stdin or `./` is being searched.
* [FEATURE #2643](BurntSushi/ripgrep#2643):
  Make `-d` a short flag for `--max-depth`.
* [FEATURE #2645](BurntSushi/ripgrep#2645):
  The `--version` output will now also contain PCRE2 availability information.

Bug fixes:

* [BUG #884](BurntSushi/ripgrep#884):
  Don't error when `-v/--invert-match` is used multiple times.
* [BUG #1275](BurntSushi/ripgrep#1275):
  Fix bug with `\b` assertion in the regex engine.
* [BUG #1376](BurntSushi/ripgrep#1376):
  Using `--no-ignore --ignore-vcs` now works as one would expect.
* [BUG #1622](BurntSushi/ripgrep#1622):
  Add note about error messages to `-z/--search-zip` documentation.
* [BUG #1648](BurntSushi/ripgrep#1648):
  Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail.
* [BUG #1701](BurntSushi/ripgrep#1701):
  Fix bug where some flags could not be repeated.
* [BUG #1757](BurntSushi/ripgrep#1757):
  Fix bug when searching a sub-directory didn't have ignores applied correctly.
* [BUG #1891](BurntSushi/ripgrep#1891):
  Fix bug when using `-w` with a regex that can match the empty string.
* [BUG #1911](BurntSushi/ripgrep#1911):
  Disable mmap searching in all non-64-bit environments.
* [BUG #1966](BurntSushi/ripgrep#1966):
  Fix bug where ripgrep can panic when printing to stderr.
* [BUG #2046](BurntSushi/ripgrep#2046):
  Clarify that `--pre` can accept any kind of path in the documentation.
* [BUG #2108](BurntSushi/ripgrep#2108):
  Improve docs for `-r/--replace` syntax.
* [BUG #2198](BurntSushi/ripgrep#2198):
  Fix bug where `--no-ignore-dot` would not ignore `.rgignore`.
* [BUG #2201](BurntSushi/ripgrep#2201):
  Improve docs for `-r/--replace` flag.
* [BUG #2288](BurntSushi/ripgrep#2288):
  `-A` and `-B` now only each partially override `-C`.
* [BUG #2236](BurntSushi/ripgrep#2236):
  Fix gitignore parsing bug where a trailing `\/` resulted in an error.
* [BUG #2243](BurntSushi/ripgrep#2243):
  Fix `--sort` flag for values other than `path`.
* [BUG #2246](BurntSushi/ripgrep#2246):
  Add note in `--debug` logs when binary files are ignored.
* [BUG #2337](BurntSushi/ripgrep#2337):
  Improve docs to mention that `--stats` is always implied by `--json`.
* [BUG #2381](BurntSushi/ripgrep#2381):
  Make `-p/--pretty` override flags like `--no-line-number`.
* [BUG #2392](BurntSushi/ripgrep#2392):
  Improve global git config parsing of the `excludesFile` field.
* [BUG #2418](BurntSushi/ripgrep#2418):
  Clarify sorting semantics of `--sort=path`.
* [BUG #2458](BurntSushi/ripgrep#2458):
  Make `--trim` run before `-M/--max-columns` takes effect.
* [BUG #2479](BurntSushi/ripgrep#2479):
  Add documentation about `.ignore`/`.rgignore` files in parent directories.
* [BUG #2480](BurntSushi/ripgrep#2480):
  Fix bug when using inline regex flags with `-e/--regexp`.
* [BUG #2505](BurntSushi/ripgrep#2505):
  Improve docs for `--vimgrep` by mentioning footguns and some work-arounds.
* [BUG #2519](BurntSushi/ripgrep#2519):
  Fix incorrect default value in documentation for `--field-match-separator`.
* [BUG #2523](BurntSushi/ripgrep#2523):
  Make executable searching take `.com` into account on Windows.
* [BUG #2574](BurntSushi/ripgrep#2574):
  Fix bug in `-w/--word-regexp` that would result in incorrect match offsets.
* [BUG #2623](BurntSushi/ripgrep#2623):
  Fix a number of bugs with the `-w/--word-regexp` flag.
* [BUG #2636](BurntSushi/ripgrep#2636):
  Strip release binaries for macOS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc An issue with or an improvement to documentation. question An issue that is lacking clarity on one or more points. rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

No branches or pull requests

2 participants