Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor LLM model selection and attack surface analysis #233

Open
wants to merge 21 commits into
base: release/2.2.0
Choose a base branch
from

Conversation

psyray
Copy link
Contributor

@psyray psyray commented Nov 11, 2024

Summary

Fixes #232

Enhance the LLM model selection process and attack surface analysis by introducing a new modal for model selection and implementing a new API endpoint. Refactor existing functions to support these changes and improve error handling. Update tests to reflect the new functionalities and remove outdated tests.

New Features:

  • Introduce a new modal for selecting LLM models for attack surface analysis, allowing users to choose from a list of available models.
    image

  • Implement a new API endpoint to fetch available LLM models and the currently selected model.
    image

  • Display a recommended list of models in the LLM Toolkit
    image

  • Add a download progress bar during model download
    image

Enhancements:

  • Refactor the show_attack_surface_modal function to include model selection and handle errors more robustly.
  • Remove hardcoded LLM definitions and prompts from definitions.py and move them to a new configuration file.
  • Add new classes and methods for managing LLM configurations and generating vulnerability reports and attack suggestions.

Tests:

  • Update tests to cover new functionalities and remove obsolete tests related to LLM attack suggestions.

Todo

  • Test Attack Surface
  •  Test Report Generation

The changes involve refactoring the codebase to replace references to "GPT" with "LLM" for generating vulnerability reports and related functionalities. This includes renaming functions, classes, variables, and configuration settings to reflect the new terminology. The update affects various components such as tasks, views, models, templates, and configuration files across the application.
- Introduced a new modal for selecting LLM models for attack surface analysis, allowing users to choose from a list of available models.
- Added functionality to update the selected model in the database before proceeding with analysis.
- Implemented a new API endpoint to fetch available LLM models and the currently selected model.
- Refactored the show_attack_surface_modal function to include model selection and handle errors more robustly.
- Removed hardcoded LLM definitions and prompts from definitions.py and moved them to a new configuration file.
- Added new classes and methods for managing LLM configurations and generating vulnerability reports and attack suggestions.
- Updated tests to cover new functionalities and removed obsolete tests related to LLM attack suggestions.
Copy link
Contributor

sourcery-ai bot commented Nov 11, 2024

Reviewer's Guide by Sourcery

This PR enhances the LLM (Language Learning Model) integration by introducing a new model selection modal and implementing a more robust API endpoint. The changes include refactoring the GPT-specific code into a more generic LLM framework, improving error handling, and adding support for multiple LLM providers including OpenAI and Ollama.

Updated Class Diagram for LLM Vulnerability Report

classDiagram
    class LLMVulnerabilityReport {
        +String url_path
        +String title
        +Text description
        +Text impact
        +Text remediation
        +Text references
        +String formatted_description()
        +String formatted_impact()
        +String formatted_remediation()
        +String formatted_references()
    }
    class Vulnerability {
        +String name
        +String http_url
        +Text description
        +Text impact
        +Text remediation
        +Text references
        +boolean is_llm_used
        +String formatted_description()
        +String formatted_impact()
        +String formatted_remediation()
        +String formatted_references()
    }
    LLMVulnerabilityReport --|> Vulnerability : updates
    note for LLMVulnerabilityReport "Replaces GPTVulnerabilityReport with LLM support"
Loading

File-Level Changes

Change Details Files
Refactored GPT-specific code into a generic LLM framework
  • Created new LLM configuration module with model definitions and system prompts
  • Implemented base LLM generator class with common functionality
  • Added specific generators for vulnerability reports and attack suggestions
  • Created validators for LLM input/output data
web/reNgine/llm/config.py
web/reNgine/llm/llm.py
web/reNgine/llm/validators.py
web/reNgine/llm/utils.py
Enhanced model selection with new modal interface
  • Added new modal for LLM model selection with detailed model information
  • Implemented model capabilities display and selection UI
  • Added support for regenerating and deleting attack surface analysis
web/static/custom/custom.js
web/scanEngine/templates/scanEngine/settings/llm_toolkit.html
Improved vulnerability reporting and display
  • Added Markdown to HTML conversion for vulnerability reports
  • Enhanced report formatting with Bootstrap styling
  • Updated vulnerability model to use text field for references
web/reNgine/llm/utils.py
web/templates/report/template.html
web/startScan/models.py
Updated API endpoints and handlers
  • Renamed GPT-specific endpoints to LLM endpoints
  • Added new endpoint for LLM model management
  • Improved error handling and response formatting
web/api/views.py
web/api/urls.py

Assessment against linked issues

Issue Objective Addressed Explanation
#232 Allow using the selected Ollama LLM model for attack surface detection instead of hardcoding llama2-uncensored

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@psyray psyray changed the base branch from master to release/2.1.1 November 11, 2024 12:23
web/api/views.py Fixed Show fixed Hide fixed
web/reNgine/llm/validators.py Fixed Show fixed Hide fixed
web/reNgine/tasks.py Fixed Show fixed Hide fixed
web/reNgine/tasks.py Fixed Show fixed Hide fixed
web/reNgine/tasks.py Fixed Show fixed Hide fixed
web/tests/test_llm.py Fixed Show fixed Hide fixed
web/static/custom/custom.js Fixed Show fixed Hide fixed
web/static/custom/custom.js Fixed Show fixed Hide fixed
@psyray psyray self-assigned this Nov 11, 2024
@psyray psyray added bug Something isn't working enhancement New feature or request labels Nov 11, 2024
@psyray psyray linked an issue Nov 11, 2024 that may be closed by this pull request
3 tasks
- Enhanced logging in tasks.py to include both the title and path of vulnerabilities when exceptions occur.
- Updated variable naming from llm to vuln for clarity in vulnerability handling.
- Minor text punctuation corrections in custom.js to ensure consistency in user messages.
- Remove unused import
- Improved the UI of the LLM toolkit by updating the layout and adding badges for model status and capabilities.
- Refactored the model management logic to use a centralized API call for fetching model data, improving error handling and reducing code duplication.
- Updated the model requirements configuration to enhance readability and consistency in the description of model capabilities.
- Adjusted the modal size for displaying model options to provide a better user experience.
- Introduced a method to convert markdown to HTML with added Bootstrap classes in the LLMAttackSuggestion class.
- Implemented HTML sanitization using DOMPurify in various JavaScript functions to prevent XSS vulnerabilities.
…on features

- Updated the URL validation function to correctly escape backslashes in the regex pattern.
- Enhanced the send_llm__attack_surface_api_request function to support additional parameters for model selection and analysis options.
- Introduced new functions regenerateAttackSurface, deleteAttackSurfaceAnalysis, and showAttackSurfaceModal to manage attack surface analysis lifecycle, including regeneration and deletion.
- Refactored the modal handling logic to improve user interaction for model selection and analysis display.
- Added input validation for model names in the LLMAttackSuggestionGenerator class and updated the attack suggestion generation process to accommodate model-specific requests.
- Enhanced markdown rendering by adding new extensions for better list handling and definition lists support.
- Improved HTML formatting by converting ordered lists to unordered and cleaning up line breaks.
- Removed "Beta" label from the LLM Toolkit page title and modal dialog title in the UI.
- Replaced the get_vulnerability_llm_report function with llm_vulnerability_report for generating and storing vulnerability reports using LLM.
- Enhanced the LLM vulnerability report generation process by splitting it into distinct sections: technical description, business impact, remediation steps, and references.
- Updated the data model to store references as text fields instead of using a separate VulnerabilityReference model.
- Improved the HTML rendering of vulnerability descriptions, impacts, and remediations by converting markdown to HTML with proper styling.
- Refactored the LLM response handling to use a dictionary format for easier manipulation and storage.
- Removed redundant code and streamlined the process of updating vulnerabilities with LLM-generated data.
- Adjusted the configuration and prompts for LLM to support more detailed and structured report generation.
@psyray psyray requested a review from AnonymousWP November 12, 2024 22:41
@psyray psyray marked this pull request as ready for review November 12, 2024 22:41
web/reNgine/llm/llm.py Fixed Show fixed Hide fixed
web/reNgine/llm/llm.py Fixed Show fixed Hide fixed
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @psyray - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

  • Avoid exposing raw exception messages in API responses (link)

Overall Comments:

  • Consider adding more comprehensive test coverage for the new LLM functionality, particularly around error handling and edge cases.
  • The key LLM interface methods would benefit from more detailed docstrings documenting expected inputs, outputs, and error scenarios.
Here's what I looked at during the review
  • 🟡 General issues: 3 issues found
  • 🔴 Security: 1 blocking issue, 1 other issue
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

web/static/custom/custom.js Show resolved Hide resolved
web/api/views.py Show resolved Hide resolved
web/api/views.py Show resolved Hide resolved
web/api/views.py Show resolved Hide resolved
if query.strip():
qs = qs & self.special_lookup(query.strip())
elif '|' in search_value:
qs = Subdomain.objects.none()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Incorrect model class used for empty queryset initialization

This should be EndPoint.objects.none() to maintain type consistency with the rest of the method.

return Response({"dorks": serializer.data})
def get(self, request, format=None):
req = self.request
scan_id = safe_int_cast(req.query_params.get('scan_id'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Comment on lines +1597 to +1598
scan_id = safe_int_cast(req.query_params.get('scan_id'))
if scan_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Use named expression to simplify assignment and conditional (use-named-expression)

Suggested change
scan_id = safe_int_cast(req.query_params.get('scan_id'))
if scan_id:
if scan_id := safe_int_cast(req.query_params.get('scan_id')):

def get(self, request, format=None):
req = self.request
scan_id = safe_int_cast(req.query_params.get('scan_id'))
type = req.query_params.get('type')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Don't assign to builtin variable type (avoid-builtin-shadow)


ExplanationPython has a number of builtin variables: functions and constants that
form a part of the language, such as list, getattr, and type
(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:

list = [1, 2, 3]

However, this is considered poor practice.

  • It will confuse other developers.
  • It will confuse syntax highlighters and linters.
  • It means you can no longer use that builtin for its original purpose.

How can you solve this?

Rename the variable something more specific, such as integers.
In a pinch, my_list and similar names are colloquially-recognized
placeholders.

Comment on lines +1631 to +1632
scan_id = safe_int_cast(req.query_params.get('scan_id'))
if scan_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Use named expression to simplify assignment and conditional (use-named-expression)

Suggested change
scan_id = safe_int_cast(req.query_params.get('scan_id'))
if scan_id:
if scan_id := safe_int_cast(req.query_params.get('scan_id')):

Comment on lines +1715 to +1720
subdomain_names = []

for id in subdomain_ids:
subdomain_names.append(Subdomain.objects.get(id=id).name)
for id in subdomain_ids:
subdomain_names.append(Subdomain.objects.get(id=id).name)

if subdomain_names:
return Response({'status': True, "results": subdomain_names})
if subdomain_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
subdomain_names = []
for id in subdomain_ids:
subdomain_names.append(Subdomain.objects.get(id=id).name)
for id in subdomain_ids:
subdomain_names.append(Subdomain.objects.get(id=id).name)
if subdomain_names:
return Response({'status': True, "results": subdomain_names})
if subdomain_names:
if subdomain_names := [
Subdomain.objects.get(id=id).name for id in subdomain_ids
]:

@psyray
Copy link
Contributor Author

psyray commented Nov 12, 2024

Too many sourcery review here.
Because it scan all the modified files.
It's not related to this PR.
Will fix this later in a security review

@psyray psyray changed the title feat: enhance LLM model selection and attack surface analysis fix: refactor LLM model selection and attack surface analysis Nov 12, 2024
Uncommented the code responsible for generating technical, impact, and remediation sections in the LLMVulnerabilityReportGenerator class.
web/reNgine/llm/llm.py Fixed Show fixed Hide fixed
web/reNgine/llm/llm.py Fixed Show fixed Hide fixed
web/reNgine/llm/llm.py Fixed Show fixed Hide fixed
- Removed extensive data from targetApp.json, including historical IPs, related domains, registrars, domain registrations, WHOIS status, nameservers, DNS records, and domain info.
- Updated auth.json to modify permission names and codenames, and removed several user and group entries.
- Added new todo notes in recon_note.json and updated existing ones.
- Updated dashboard.json to modify project details and remove unused settings.
- Modified scanEngine.json to update YAML configurations and add a new interesting lookup model.
- Enabled remote debugging in docker-compose.dev.yml.
- Updated a test in test_vulnerability.py to patch a different method for generating vulnerability reports
Overview
- Improved handling of CVE references by parsing string representations of arrays and generating appropriate HTML content.
- Updated the template to handle references more flexibly, displaying them as a list or paragraph based on their format.
- Converted markdown to HTML for various sections in the LLM vulnerability report.
- Removed redundant code in the LLM vulnerability report generator.
- Simplified the get_refs_str method in the Vulnerability model to return references directly.
Modified the llm_vulnerability_report function in tasks.py to convert the references field from a list to a single string before converting it to HTML.
Copy link
Member

@AnonymousWP AnonymousWP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't have time yet to do an extensive test, but this is what I initially found. Will continue tomorrow.

Adjusted the logic for displaying the dropdown menu in the LLM Toolkit settings to only show when a model is not selected.
- Introduced a new modal for adding and managing models in the LLM Toolkit, allowing users to view and select recommended models with detailed information on RAM requirements.
- Enhanced the model download process with progress tracking and error handling, including the ability to cancel downloads.
- Implemented a new API endpoint to fetch available models from the Ollama library, caching the results for improved performance.
- Updated the server configuration to support streaming responses for model downloads, improving user feedback during long operations.
- Added new recommended models to the configuration, providing descriptions and size options for each model.
- Introduced WebSocket support for real-time updates during model operations, including download and deletion processes.
- Refactored JavaScript to enhance user interaction with model operations using SweetAlert2 for better UI feedback.
- Added new API endpoints and views to handle WebSocket connections and model operations.
- Updated server configuration to support ASGI and WebSocket connections, including changes to the entrypoint script and Nginx configuration.
- Enhanced the Django application with Channels for WebSocket communication, including necessary package updates and settings adjustments.
- Added setup for OllamaSettings in TestOllamaManager to initialize test environment.
- Enhanced test cases in TestOllamaManager to include additional mock setups and assertions for model management.
- Refactored TestLLMVulnerabilityReport and TestLLMAttackSuggestion to improve test coverage and clarity, including renaming methods and updating mock responses.
- Made minor adjustments to HTML templates and JavaScript comments for clarity and consistency.
- Removed unnecessary selected field from OllamaDetailManager view defaults.
- Removed the "Cancel Download" button and its associated functionality from the llm_toolkit.html template.
- Added a newline in the OllamaManager class in views.py for formatting consistency.
@psyray psyray changed the base branch from release/2.1.1 to release/2.2.0 November 17, 2024 22:54
Updated the UI text to include the model name in the download progress popup.
Modified API endpoint usage in JavaScript to include the selected model in the URL path.
- Improved URL handling by encoding model names in API requests across multiple JavaScript files.
- Modified the URL pattern in the Django application to support path parameters for model names.
@psyray psyray changed the title fix: refactor LLM model selection and attack surface analysis feat: refactor LLM model selection and attack surface analysis Nov 18, 2024
@psyray psyray requested a review from 0b3ud November 25, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
2 participants