Skip to content

Conversation

@PStarH
Copy link
Contributor

@PStarH PStarH commented May 26, 2025

Email Error Reporting System Implementation

Overview

This PR implements a comprehensive email error reporting system for the training process. The system is designed to monitor various stages of model training and send detailed error notifications when issues occur, enabling quick problem detection and resolution.

System Architecture

Email Service Implementation

  • Created a new EmailService class to handle all email notifications
  • Implemented HTML email template system for formatted error messages
  • Added support for log file attachments and system resource information
  • Configured SMTP settings for reliable email delivery

Major Features

1. Model Download Stage

  • Added error notifications in the _download_model method:
    • When the model path doesn't exist after download
    • When exceptions occur during the download process
    • Including model name, error message, and specific step context

2. Weight Merging Stage

  • Added error notifications in the merge_weights method:
    • When the base model path doesn't exist
    • When the personal training model output path doesn't exist
    • When merged model files are not found
    • Including model name, error message, process step context, and log file (if available)

3. Model Conversion Stage

  • Added error notifications in the convert_model method:
    • When merged model output doesn't exist
    • When model conversion script execution fails
    • When GGUF model file is not found
    • Including model name, error message, process step context, and log file (if available)

Error Notification Features

Content Structure

Each error notification email includes:

  • Model name and version information
  • Detailed error message with stack trace
  • Specific step and context where the error occurred
  • System resource information (e.g., memory usage, disk space)
  • Related log files (when available)

Smart Log Handling

  • Automatic extraction of relevant log sections
  • Intelligent truncation of large log files
  • Highlighting of error-related log entries
  • Support for multiple log file attachments

Testing

Automated Tests

Implemented unit tests for:

  • Email service configuration
  • Template rendering
  • Log file handling
  • Error message formatting

Scenario Testing

The following scenarios have been tested:

  1. Model download failure
  2. Base model or training output missing
  3. Weight merging failure
  4. Model conversion failure
  5. GGUF file generation failure

All scenarios correctly send error notification emails with complete information.

Impact

This new email notification system will:

  • Provide immediate alerts for training process issues
  • Deliver comprehensive error context for faster debugging
  • Reduce system monitoring overhead
  • Enable proactive problem resolution
  • Improve overall training process reliability and maintainability

Future Enhancements

Potential future improvements:

  • Customizable notification templates
  • Error categorization and priority levels
  • Dashboard integration for error tracking
  • Advanced log analysis features

kevin-mindverse and others added 16 commits May 13, 2025 14:20
…rse#353)

* Join AI Network -> Export your Second Me

* Default Synthesis Mode -> high
Default Epoch ->  3

* Set Thinking-Mode Default Value

* Better Display Of ReadMe

* default value of thinking mode

* Set Default value of enableL0Retrival to false
* optimize doc

* better code

---------

Co-authored-by: kevinaimonster <kevinaimonster@gmail.com>
* feat: file display and deletion related

* fix: file saving logic
Co-authored-by: kevinaimonster <kevinaimonster@gmail.com>
…erse#360)

* Provide a selection of Chinese mirror websites for Huggingface

* Remove default values
* feat:add current file to model download progress

* git ignore yarn

* feat: show current downloading

---------

Co-authored-by: Ye Xiangle <yexiangle@mail.mindverse.ai>
Co-authored-by: kevinaimonster <kevinaimonster@gmail.com>
* feature: add China image configuration options

* feature: update Python and Poetry mirror configurations to use Tsinghua and Aliyun

* feature: improve Poetry mirror configuration to check for existing sources

* feature: rename setup-china to setup-cn and update pip mirror to Aliyun

* Add MD file

* Modify GitHub download

* Change to Chinese Mainland

---------

Co-authored-by: yanmuyuan <2216646664@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants