RzShell: refactor string, regex and byte search #4919

Rot127 · 2025-02-20T19:47:07Z

Your checklist for this pull request

I've read the guidelines for contributing to this repository
I made sure to follow the project's coding style
I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
I've added tests that prove my fix is effective or that my feature works (if possible)
I've updated the rizin book with the relevant information (if needed)

Supersedes #4762

Detailed description

Changes made

Moves all legacy search commands to RzShell (only commands, inside they still do their string parsing on arguments).
Refactor string and byte search
Move to RzShell
Moves: / to /z.
Add support for Unicode and EBCDIC string search.
Add support for (Unicode) regex string search.
Add support for byte string regex search /xr.
Add more details to the search help messages.
Offsets of the search hits align with the actual encoding. Not with the UTF-8 encoding.
Dispatches memory chunks for search into threads.
Changes to ps
- Adds extra arguments to specify encoding (also EBCDIC).
- Add additional delimiter argument (stop at first non-printable).
- Document it more.
- Add psu alias for ps utf8
Changes to Settings
- Remove str.search.max_uni_blocks - Effectively a metric the user should not know about; adds too much complexity.
- str.search.max_threads -> search.max_threads - This is a general setting for the search now.
- str.search.raw_alignment -> search.str.raw_alignment - Unify settings (only used for RzBin search.).
- str.search.encoding -> str.encoding - Valid for all string interpretations.
- str.search.min_length -> search.str.min_length - Unify settings.
- str.search.buffer_size -> search.str.max_length - Unify settings.
- str.search.max_region_size -> search.str.max_region_size - Unify settings.
- str.search.check_ascii_freq -> search.str.check_ascii_freq - Unify settings.
Removed commands
- /! - Because the command modifiers are not properly handled in RzShell yet and the advantage of this one is dubious (IMHO).
- /f - Modifier and obsolete, because search is dispatched into threads.
- /b - Modifier and obsolete, because search is dispatched into threads.
- /+ - Because no idea what it does. Seems not particular useful.
- /e - Replaced with regex search in bytes and string search.
- /w - All Unicode is searched now properly with /z.
Make some changes to the string escaping, so it works reliably with Unicode characters.
- The RzStrEscOptions were inconsistently used.
  E.g. show_asciidot (replace non-printable ascii with dot) was ignored for \n, \t etc.
- Defined Unicode code points are escaped now with \U00hhhhhh. All other non-printable bytes are escaped with \xhh. There are still some exceptions (when legacy escape functions are used) but most places are ok now.
General
- Fix inconsistencies in Unicode decoders/encoders and checkers. They now either return 0 on an invalid decode or the number of bytes the code point requires.
- Add many unit tests for Unicode related logic.
- Update Unicode tables to Version 16.
- Add helper to check code points.
- Escaped strings now escape valid Unicode code points to /Uhhhhhh (if not requested otherwise by the user) and invalid code points to /xhh.
- Add helper functions for hexadecimal strings and bits.

TODO Overview

Open issues

GCC-12 flag: 6109a0b
Search hits of strings should contain length in characters/code points and graphemes. cc @kazarmy
Add option to define search strategy (graph algos, address window sizes etc.).
Add option to define what are "unprintable character" exceptions (e.g. \t, \n etc.).

Documentation

Update book about search:
- API
- commands
- examples
- edge cases (e.g. diacritics, regex search PCRE2)

Test plan

Tests were added

Closing issues

closes #4910

…pe annotations. Part 1/9. Likely won't build in between parts. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

This commit adds several improvements, updates and fixes to Unicode related logic. - Update Unicode tables to version 16. - Escaped strings now escape valid Unicode code points to /Uhhhhhh and invalid code points to /xhh. - Generally applies RzStrEscOptions way more consistently. The legacy escape is still used at some places though. - Fix inconsistencies in Unicode decoders/encoders and checkers. They now either return 0 on an invalid decode or the number of bytes the code point requires. - Add many unit tests for Unicode related logic. - Add helpers to check code points. Part 2/9. Likely won't build in between parts. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

This commit changes several settings. The main reason is to have them contained in one search group, and not spread over the search and string group. This becomes important with the search refactor, since the search is now also more contained in a single module and can make use of the more settings. - Remove str.search.max_uni_blocks - Effectively a metric the user should not know about; adds too much complexity. Also not documented. - str.search.encoding -> str.encoding - Valid for all string interpretations. - str.search.max_threads -> search.max_threads - This is a general setting for the search now. - str.search.raw_alignment -> search.str.raw_alignment - Unify settings (only used for RzBin search.). - str.search.min_length -> search.str.min_length - Unify settings. - str.search.buffer_size -> search.str.max_length - Unify settings. - str.search.max_region_size -> search.str.max_region_size - Unify settings. - str.search.check_ascii_freq -> search.str.check_ascii_freq - Unify settings. Part 3/9. Likely won't build in between parts. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

- This commit adds the ability to print any supported string encoding with 'ps' (also EBCDIC). - Adds alias 'psu' for 'ps utf8' - It also allows to select unprintable characters as string delimeter. Part 4/9. Likely won't build in between parts. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

The commit moves the new and legacy search commands to RzShell, adds more details to the search help, deletes some undocumented or unnecessary commands and adds the stubs for the new search handler implementions. Legacy commands still do their string parsing on arguments and are not touched. The new searches (string and bytes) ahve their actual implementation in the following commmits. Renamed and replaced commands: - Renamed '/' -> '/z' - Replaced '/e' -> '/z' or '/xr' - Replaced '/w' - All Unicode is searched now properly with '/z'. Removed commands: - '/!' - Because the command modifiers are not properly handled in RzShell yet and the advantage of this one is dubious. - '/f' - Modifiers are obsolete, because search is dispatched into threads. - '/b' - Modifiers are obsolete, because search is dispatched into threads. - '/+' - Because no idea what it does. Seems not particular useful. Part 5/9. Likely won't build in between parts. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

…ad). Adds the core implementation of the new search. The rough architecture is the following: A search for a certain type of information (strings, bytes, keys etc.) creates a collections of items to search for (byte patterns, regular expressions etc.). Then specifies some settings how the search (number of threads, maxum hits...) and the finding is performed (string length, inverse match etc.). It also defines a search space, which is currently only the IO buffer. But can be anything in the future, like a graphs or the knowledge base. The search splits up the search space into windows (for IO: address ranges) and dispatches each search window into a 'find()' thread. The 'find()' handler (provided by a specific search implementation) checks the given window and produces search hits matching the elements in the search collection. The main search handler collects the hits of the dispatched workers and returns them to the user. Note: The byte and string search implementations are added in the next two commits. Part 6/9. Likely won't build in between parts. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

Adds the byte search implementation of the new search. The normal byte search works just as before. But adds way more examples in the help message and more test cases. Additionally, it adds an regex byte search. Part 7/9. Likely won't build in between parts. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

Adds the new string search implementation, fixes many bugs and makes performance improvements. - Adds support to search reliably for all supported encodings (fixes non-ASCII string search). - Fixes some wrong assumptions what valid code points are (e.g. 0x000000ff is a valid code point in UTF-32/UTF-16 BE). - Adds several '/z' command options how to perform the string search (literal, regex, extended regex, caseless). - Checks every decoded code point for validity to improve correctness. - Improves performance of string decoding by not writing to the heap in all cases. Part 8/9. Likely won't build in between parts. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

…search refactor. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

notxvilka

LGTM.

Rot127 requested review from ret2libc, thestr4ng3r, yossizap, wargio and kazarmy as code owners February 20, 2025 19:47

github-actions bot added infrastructure rz-asm rz-bin rz-test RzBin RzAnalysis RzDebug API RzCore RzType DWARF RzUtil RzCons RzSearch RzSocket labels Feb 20, 2025

This comment was marked as resolved.

Sign in to view

Rot127 mentioned this pull request Feb 20, 2025

RzShell: refactor string, regex and byte search #4762

Closed

39 tasks

This comment was marked as resolved.

Sign in to view

Rot127 and others added 6 commits February 21, 2025 12:51

String/Hex-Search 1/9: Add hexadecimal number helpers, doxygen and ty…

67ca6b0

…pe annotations. Part 1/9. Likely won't build in between parts. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

Rot127 and others added 2 commits February 21, 2025 12:53

Rot127 force-pushed the dist-fuzz-rz-search branch from de92480 to e1c4129 Compare February 21, 2025 17:53

This comment was marked as resolved.

Sign in to view

Rot127 force-pushed the dist-fuzz-rz-search branch from e1c4129 to 4d49f17 Compare February 21, 2025 18:04

Rot127 and others added 2 commits February 21, 2025 13:12

String/Hex-Search 9/9: Fix all tests affected by the string and byte …

754f152

…search refactor. Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>

Fix segfault if sorted or unsorted lines are empty.

37a85a5

Rot127 force-pushed the dist-fuzz-rz-search branch from 4d49f17 to 37a85a5 Compare February 21, 2025 18:12

notxvilka approved these changes Feb 21, 2025

View reviewed changes

notxvilka added merge-when-green ready Ready to be merged labels Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RzShell: refactor string, regex and byte search #4919

RzShell: refactor string, regex and byte search #4919

Rot127 commented Feb 20, 2025 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

notxvilka left a comment •

edited

Loading

RzShell: refactor string, regex and byte search #4919

Are you sure you want to change the base?

RzShell: refactor string, regex and byte search #4919

Conversation

Rot127 commented Feb 20, 2025 • edited Loading

Changes made

TODO Overview

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

notxvilka left a comment • edited Loading

Choose a reason for hiding this comment

Rot127 commented Feb 20, 2025 •

edited

Loading

notxvilka left a comment •

edited

Loading