Skip to content

Conversation

@yashuatla
Copy link
Owner

@yashuatla yashuatla commented Jun 23, 2025

PR Summary

Refactor Model Configuration and Training System

Overview

This PR refactors the model configuration system to use a unified approach for base and thinking models, adds CUDA availability checking, and updates the data pipeline to use configuration objects instead of environment variables.

Change Types

Type Description
Enhancement Added CUDA availability checking functionality
Refactor Unified model configuration interfaces and updated stores
Refactor Removed environment variable dependencies in data pipeline

Affected Modules

Module / File Change Description
.eslintrc.js Added TypeScript and Node.js import resolver configuration
src/service/modelConfig.ts Refactored model config interfaces and added thinking config update function
src/service/train.ts Added CUDA availability check and refactored training interfaces
src/store/*Store.ts Updated stores for model config and training with new properties and methods
L2/data_pipeline/data_prep/* Removed .env dependencies and switched to user_llm_config object

Notes for Reviewers

  • The localStorage key changed from 'trainingConfig' to 'trainingParams'
  • Model name property changed from 'baseModel' to 'model_name'
  • Training status handling has been updated with new suspended/failed states

yexiangle and others added 24 commits April 23, 2025 20:46
* fix:fix l1 save problem

* fix:simplify the code

* fix:delete no use import

* fix:delete useless data
* fix password update logic, if there's more than one load

* update fix
* fix: modify thinking_model loading configuration

* feat: realize thinkModel ui

* feat:store

* feat: add combined_llm_config_dto

* add thinking_model_config & database migration

* directly add thinking model to user_llm_config

* delete thinking model repo dto service

* delete thinkingmodel table migration

* add is_cot config

* feat: allow define  is_cot

* feat: simplify logs info

* feat: add training model

* feat: fix is_cot problem

* fix: fix chat message

* fix: fix progress error

* fix: disable no settings thinking

* feat: add thinking warning

* fix: fix start service error

* feat:fix init trainparams problem

* feat: change playGround prompt

* feat: Add Dimension Mismatch Handling for ChromaDB (mindverse#157) (mindverse#207)

* Fix Issue mindverse#157

Add chroma_utils.py to manage chromaDB and added docs for explanation

* Add logging and debugging process

- Enhanced the`reinitialize_chroma_collections` function in`chroma_utils.py` to properly check if collections exist before attempting to delete them, preventing potential errors when collections don't exist.
- Improved error handling in the`_handle_dimension_mismatch` method in`embedding_service.py` by adding more robust exception handling and verification steps after reinitialization.
- Enhanced the collection initialization process in`embedding_service.py` to provide more detailed error messages and better handle cases where collections still have incorrect dimensions after reinitialization.
- Added additional verification steps to ensure that collection dimensions match the expected dimension after creation or retrieval.
- Improved logging throughout the code to provide more context in error messages, making debugging easier.

* Change topics_generator timeout to 30 (mindverse#263)

* quick fix

* fix: shade -> shade_merge_info (mindverse#265)

* fix: shade -> shade_merge_info

* add convert array

* quick fix import error

* add log

* add heartbeat

* new strategy

* sse version

* add heartbeat

* zh to en

* optimize code

* quick fix convert function

* Feat/new branch management (mindverse#267)

* feat: new branch management

* feat: fix multi-upload

* optimize contribute management

---------

Co-authored-by: Crabboss Mr <1123357821@qq.com>
Co-authored-by: Ye Xiangle <yexiangle@mail.mindverse.ai>
Co-authored-by: Xinghan Pan <sampan090611@gmail.com>
Co-authored-by: doubleBlack2 <108928143+doubleBlack2@users.noreply.github.com>
Co-authored-by: kevin-mindverse <kevin@mindverse.ai>
Co-authored-by: KKKKKKKevin <115385420+kevin-mindverse@users.noreply.github.com>
* feat: replace tutorial link

* replace video link

---------

Co-authored-by: kevin-mindverse <kevin@mindverse.ai>
* Add CUDA support

- CUDA detection
- Memory handling
- Ollama model release after training

* Fix logging issue

added cuda support flag so log accurately reflected cuda toggle

* Update llama.cpp rebuild

Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run

* Improved vram management

Enabled memory pinning and optimizer state offload

* Fix CUDA check

rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support

* Added fast restart and fixed CUDA check command

Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild

Fixed make docker-check-cuda command to correctly reflect cuda support

* Added docker-compose.gpu.yml

Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification

* Fixed cuda toggle

Last push accidentally broke cuda toggle

* Code review fixes

Fixed errors resulting from removed code:
- Added return save_path to end of save_hf_model function
- Rolled back download_file_with_progress function

* Update Makefile

Use cuda by default when using docker-restart-backend-fast

* Minor cleanup

Removed unnecessary makefile command and fixed gpu logging

* Delete .gpu_selected

* Simplified cuda training code

- Removed dtype setting to let torch automatically handle it
- Removed vram logging
- Removed Unnecessary/old comments

* Fixed gpu/cpu selection

Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu

* Fix Ollama embedding error

Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones

* Fixed model selection & memory error

Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly
…rse#279)

* feature: use uv to setup python environment

* TrainProcessService add singleten method: get_instance
New section for FAQ doc
* feature: use uv to setup python environment

* TrainProcessService add singleten method: get_instance

* feat: fix code

* Added CUDA support (mindverse#228)

* Add CUDA support

- CUDA detection
- Memory handling
- Ollama model release after training

* Fix logging issue

added cuda support flag so log accurately reflected cuda toggle

* Update llama.cpp rebuild

Changed llama.cpp to only check if cuda support is enabled and if so rebuild during the first build rather than each run

* Improved vram management

Enabled memory pinning and optimizer state offload

* Fix CUDA check

rewrote llama.cpp rebuild logic, added manual y/n toggle if user wants to enable cuda support

* Added fast restart and fixed CUDA check command

Added make docker-restart-backend-fast to restart the backend and reflect code changes without causing a full llama.cpp rebuild

Fixed make docker-check-cuda command to correctly reflect cuda support

* Added docker-compose.gpu.yml

Added docker-compose.gpu.yml to fix error on machines without nvidia gpu and made sure "\n" is added before .env modification

* Fixed cuda toggle

Last push accidentally broke cuda toggle

* Code review fixes

Fixed errors resulting from removed code:
- Added return save_path to end of save_hf_model function
- Rolled back download_file_with_progress function

* Update Makefile

Use cuda by default when using docker-restart-backend-fast

* Minor cleanup

Removed unnecessary makefile command and fixed gpu logging

* Delete .gpu_selected

* Simplified cuda training code

- Removed dtype setting to let torch automatically handle it
- Removed vram logging
- Removed Unnecessary/old comments

* Fixed gpu/cpu selection

Made "make docker-use-gpu/cpu" command work with .gpu_selected flag and changed "make docker-restart-backend-fast" command to respect flag instead of always using gpu

* Fix Ollama embedding error

Added custom exception class for Ollama embeddings, which seemed to be returning keyword arguments while the Python exception class only accepts positional ones

* Fixed model selection & memory error

Fixed training defaulting to 0.5B model regardless of selection and fixed "free(): double free detected in tcache 2" error caused by cuda flag being passed incorrectly

* fix: train service singlten

---------

Co-authored-by: Zachary Pitroda <30330004+zpitroda@users.noreply.github.com>
* fix: adjustment status order

* fix: adjustment train status

* fix: split the status of service and train
* Update README.md

Changed the updated tutorial link

* Update README.md with FAQ

New section for FAQ doc
* fix: adjustment status order

* fix: adjustment train status

* fix: split the status of service and train

* feat: adjustment train rule
* feat: what? no llama.cpp

* add cache
});
})
.catch((error) => {
console.error(error.message || 'Failed to fetch model config');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Silent API Error Handling.

The fetchModelConfig function silently fails by only logging errors to console without propagating them, which could hide API failures from the UI.

Current Code (Diff):

-         console.error(error.message || 'Failed to fetch model config');
+         console.error(error.message || 'Failed to fetch model config');
+         throw error; // Propagate error to caller
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
console.error(error.message || 'Failed to fetch model config');
console.error(error.message || 'Failed to fetch model config');
throw error; // Propagate error to caller

const preStatus = get().status;

//Only trained and running can be interchanged.
if (statusRankMap[status] < statusRankMap[preStatus]) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Potential undefined property access.

Accessing statusRankMap[status] without checking if status exists in the map could cause runtime errors if an invalid status is provided

Current Code (Diff):

-     if (statusRankMap[status] < statusRankMap[preStatus]) {
+     if (status in statusRankMap && preStatus in statusRankMap && statusRankMap[status] < statusRankMap[preStatus]) {
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
if (statusRankMap[status] < statusRankMap[preStatus]) {
if (status in statusRankMap && preStatus in statusRankMap && statusRankMap[status] < statusRankMap[preStatus]) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants