Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RzShell: refactor string, regex and byte search #4919

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

Rot127
Copy link
Member

@Rot127 Rot127 commented Feb 20, 2025

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository
  • I made sure to follow the project's coding style
  • I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
  • I've added tests that prove my fix is effective or that my feature works (if possible)
  • I've updated the rizin book with the relevant information (if needed)

Supersedes #4762

Detailed description

Changes made

  • Moves all legacy search commands to RzShell (only commands, inside they still do their string parsing on arguments).
  • Refactor string and byte search
  • Move to RzShell
  • Moves: / to /z.
  • Add support for Unicode and EBCDIC string search.
  • Add support for (Unicode) regex string search.
  • Add support for byte string regex search /xr.
  • Add more details to the search help messages.
  • Offsets of the search hits align with the actual encoding. Not with the UTF-8 encoding.
  • Dispatches memory chunks for search into threads.
  • Changes to ps
    • Adds extra arguments to specify encoding (also EBCDIC).
    • Add additional delimiter argument (stop at first non-printable).
    • Document it more.
    • Add psu alias for ps utf8
  • Changes to Settings
    • Remove str.search.max_uni_blocks - Effectively a metric the user should not know about; adds too much complexity.
    • str.search.max_threads -> search.max_threads - This is a general setting for the search now.
    • str.search.raw_alignment -> search.str.raw_alignment - Unify settings (only used for RzBin search.).
    • str.search.encoding -> str.encoding - Valid for all string interpretations.
    • str.search.min_length -> search.str.min_length - Unify settings.
    • str.search.buffer_size -> search.str.max_length - Unify settings.
    • str.search.max_region_size -> search.str.max_region_size - Unify settings.
    • str.search.check_ascii_freq -> search.str.check_ascii_freq - Unify settings.
  • Removed commands
    • /! - Because the command modifiers are not properly handled in RzShell yet and the advantage of this one is dubious (IMHO).
    • /f - Modifier and obsolete, because search is dispatched into threads.
    • /b - Modifier and obsolete, because search is dispatched into threads.
    • /+ - Because no idea what it does. Seems not particular useful.
    • /e - Replaced with regex search in bytes and string search.
    • /w - All Unicode is searched now properly with /z.
  • Make some changes to the string escaping, so it works reliably with Unicode characters.
    • The RzStrEscOptions were inconsistently used.
      E.g. show_asciidot (replace non-printable ascii with dot) was ignored for \n, \t etc.
    • Defined Unicode code points are escaped now with \U00hhhhhh. All other non-printable bytes are escaped with \xhh. There are still some exceptions (when legacy escape functions are used) but most places are ok now.
  • General
    • Fix inconsistencies in Unicode decoders/encoders and checkers. They now either return 0 on an invalid decode or the number of bytes the code point requires.
    • Add many unit tests for Unicode related logic.
    • Update Unicode tables to Version 16.
    • Add helper to check code points.
    • Escaped strings now escape valid Unicode code points to /Uhhhhhh (if not requested otherwise by the user) and invalid code points to /xhh.
    • Add helper functions for hexadecimal strings and bits.

TODO Overview

Open issues

  • GCC-12 flag: 6109a0b
  • Search hits of strings should contain length in characters/code points and graphemes. cc @kazarmy
  • Add option to define search strategy (graph algos, address window sizes etc.).
  • Add option to define what are "unprintable character" exceptions (e.g. \t, \n etc.).

Documentation

  • Update book about search:
    • API
    • commands
    • examples
    • edge cases (e.g. diacritics, regex search PCRE2)

Test plan

Tests were added

Closing issues

closes #4910

notxvilka

This comment was marked as resolved.

@notxvilka

This comment was marked as resolved.

@notxvilka

This comment was marked as resolved.

Rot127 and others added 6 commits February 21, 2025 12:51
…pe annotations.

Part 1/9. Likely won't build in between parts.

Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>
This commit adds several improvements, updates and fixes to Unicode related logic.

- Update Unicode tables to version 16.
- Escaped strings now escape valid Unicode code points to /Uhhhhhh and invalid code points to /xhh.
- Generally applies RzStrEscOptions way more consistently. The legacy escape is still used at some places though.
- Fix inconsistencies in Unicode decoders/encoders and checkers. They now either return 0 on an invalid decode or the number of bytes the code point requires.
- Add many unit tests for Unicode related logic.
- Add helpers to check code points.

Part 2/9. Likely won't build in between parts.

Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>
This commit changes several settings. The main reason is to have them
contained in one search group, and not spread over the search and string group.

This becomes important with the search refactor, since the search is now also
more contained in a single module and can make use of the more settings.

- Remove str.search.max_uni_blocks - Effectively a metric the user should not know about; adds too much complexity. Also not documented.

- str.search.encoding         -> str.encoding - Valid for all string interpretations.
- str.search.max_threads      -> search.max_threads - This is a general setting for the search now.
- str.search.raw_alignment    -> search.str.raw_alignment - Unify settings (only used for RzBin search.).
- str.search.min_length       -> search.str.min_length - Unify settings.
- str.search.buffer_size      -> search.str.max_length - Unify settings.
- str.search.max_region_size  -> search.str.max_region_size - Unify settings.
- str.search.check_ascii_freq -> search.str.check_ascii_freq - Unify settings.

Part 3/9. Likely won't build in between parts.

Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>
- This commit adds the ability to print any supported string encoding
with 'ps' (also EBCDIC).
- Adds alias 'psu' for 'ps utf8'
- It also allows to select unprintable characters as string delimeter.

Part 4/9. Likely won't build in between parts.

Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>
The commit moves the new and legacy search commands to RzShell,
adds more details to the search help,
deletes some undocumented or unnecessary commands and adds the stubs for
the new search handler implementions.

Legacy commands still do their string parsing on arguments and are not touched.
The new searches (string and bytes) ahve their actual implementation in the following commmits.

Renamed and replaced commands:

- Renamed '/'  -> '/z'
- Replaced '/e' -> '/z' or '/xr'
- Replaced '/w' - All Unicode is searched now properly with '/z'.

Removed commands:

- '/!' - Because the command modifiers are not properly handled in RzShell yet and the advantage of this one is dubious.
- '/f' - Modifiers are obsolete, because search is dispatched into threads.
- '/b' - Modifiers are obsolete, because search is dispatched into threads.
- '/+' - Because no idea what it does. Seems not particular useful.

Part 5/9. Likely won't build in between parts.

Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>
…ad).

Adds the core implementation of the new search.

The rough architecture is the following:

A search for a certain type of information (strings, bytes, keys etc.)
creates a collections of items to search for (byte patterns, regular expressions etc.).

Then specifies some settings how the search (number of threads, maxum hits...)
and the finding is performed (string length, inverse match etc.).

It also defines a search space, which is currently only the IO buffer.
But can be anything in the future, like a graphs or the knowledge base.

The search splits up the search space into windows (for IO: address ranges)
and dispatches each search window into a 'find()' thread.

The 'find()' handler (provided by a specific search implementation)
checks the given window and produces search hits matching the elements in the search collection.

The main search handler collects the hits of the dispatched workers
and returns them to the user.

Note: The byte and string search implementations are added in the next two commits.

Part 6/9. Likely won't build in between parts.

Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>
Rot127 and others added 2 commits February 21, 2025 12:53
Adds the byte search implementation of the new search.

The normal byte search works just as before.
But adds way more examples in the help message and more test cases.

Additionally, it adds an regex byte search.

Part 7/9. Likely won't build in between parts.

Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>
Adds the new string search implementation, fixes many bugs and makes performance improvements.

- Adds support to search reliably for all supported encodings (fixes non-ASCII string search).
- Fixes some wrong assumptions what valid code points are (e.g. 0x000000ff is a valid code point in UTF-32/UTF-16 BE).
- Adds several '/z' command options how to perform the string search (literal, regex, extended regex, caseless).
- Checks every decoded code point for validity to improve correctness.
- Improves performance of string decoding by not writing to the heap in all cases.

Part 8/9. Likely won't build in between parts.

Co-authored-by: wargio <deroad@kumo.xn--q9jyb4c>
@Rot127 Rot127 force-pushed the dist-fuzz-rz-search branch from de92480 to e1c4129 Compare February 21, 2025 17:53
@Rot127

This comment was marked as resolved.

@Rot127 Rot127 force-pushed the dist-fuzz-rz-search branch from e1c4129 to 4d49f17 Compare February 21, 2025 18:04
@Rot127 Rot127 force-pushed the dist-fuzz-rz-search branch from 4d49f17 to 37a85a5 Compare February 21, 2025 18:12
Copy link

@notxvilka notxvilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@notxvilka notxvilka added merge-when-green ready Ready to be merged labels Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rz-bin heap-buffer-overflow
2 participants