scripts: add rule-prevalence.py to detect FP-prone rules#2898
scripts: add rule-prevalence.py to detect FP-prone rules#2898kami922 wants to merge 1 commit intomandiant:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a valuable new tool for Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Code Review
This pull request introduces a new script, rule-prevalence.py, to analyze the frequency of capa rule matches across a set of freeze files. This is a useful utility for identifying potentially false-positive-prone rules. The script is well-structured and uses rich for clear output.
My review includes two main points for improvement:
- Implementing the
--thresholdcommand-line option, which is mentioned in the pull request description but is currently missing. - A suggestion to use
pathlib.Pathdirectly inargparsefor cleaner path handling.
Overall, this is a valuable addition to the project's tooling.
|
|
||
| rate = (hit_count / total * 100) if total > 0 else 0 | ||
| rate_str = f"{rate:.0f}%" | ||
| row_style = "red" if rate >= 50 else "" |
There was a problem hiding this comment.
The PR description mentions a configurable --threshold, but a hardcoded value of 50 is used here. To implement this feature, you should:
- Add a
--thresholdargument inmain():
parser.add_argument("--threshold", type=int, default=50, help="threshold percentage to highlight rules (default: 50)")- Update
render_table's signature to accept the threshold:
def render_table(counts: dict[str, int], rules: capa.rules.RuleSet, total: int, quiet: bool, threshold: int) -> None:- Pass the threshold from
main()torender_table():
render_table(counts, rules, total=len(frz_paths), quiet=args.quiet, threshold=args.threshold)- Use the
thresholdparameter here.
| row_style = "red" if rate >= 50 else "" | |
| row_style = "red" if rate >= threshold else "" |
| parser.add_argument("input", type=str, help="path to directory containing .frz files") | ||
| parser.add_argument( | ||
| "-r", "--rules", type=str, default=None, help="path to rules directory (uses ./rules if not set)" | ||
| ) |
There was a problem hiding this comment.
Since this project uses Python 3.10+, you can use pathlib.Path directly as a type for argparse arguments. This simplifies path handling, as you would no longer need to explicitly convert the string arguments to Path objects later in the code (e.g., Path(args.input) on line 159 and Path(args.rules) on line 173 would become args.input and args.rules respectively).
| parser.add_argument("input", type=str, help="path to directory containing .frz files") | |
| parser.add_argument( | |
| "-r", "--rules", type=str, default=None, help="path to rules directory (uses ./rules if not set)" | |
| ) | |
| parser.add_argument("input", type=Path, help="path to directory containing .frz files") | |
| parser.add_argument( | |
| "-r", "--rules", type=Path, default=None, help="path to rules directory (uses ./rules if not set)" | |
| ) |
Adds scripts/rule-prevalence.py which runs capa rules against a directory of .frz freeze files and reports how often each rule matches. Rules matching >= 50% of reference files are highlighted red as potential false-positive warnings. Resolves: mandiant#424
9ffad94 to
31ce4a8
Compare
|
@mr-tz , @mike-hunhoff @williballenthin Hello I would like to ask question here.
|
|
since the recommendations are for two different things, using separate commits from the GitHub UI (with appropriate commit messages) is probably a better approach. |
|
i'd propose that we close this PR for now. writing the script to identify the prevalence is not the blocker to #424. rather, its the collection and maintenance of a representative sample set that we evaluate in CI. |
closes #424
Added
scripts/rule-prevalence.py,a small utility that runs capa rules against a folder of.frzfreeze files and reports how often each rule fires. The idea is to help rule reviewers spot rules that match too broadly across clean reference samples (a signal for potential false positives). Rules hitting >= 50% of samples are highlighted in red. Supports--quietto hide zero-hit rules and--thresholdto tune the cutoff.Usage:
Tests:
pytests passes locally.
Checklist