Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 24, 2025

Two issues causing analysis instability: (1) PostgreSQL rejecting artist names containing NULL bytes from corrupted metadata, (2) ONNX Runtime BFCArena allocation failures from GPU memory fragmentation accumulating across track analyses.

Changes

New utility module tasks/memory_utils.py:

  • sanitize_string_for_db() - strips NULL bytes and control characters before database writes
  • cleanup_cuda_memory() - forces CUDA cache clearing and garbage collection
  • cleanup_onnx_session() - explicit session disposal with immediate GC
  • handle_onnx_memory_error() - detects allocation errors, triggers cleanup, enables retry
  • SessionRecycler - recreates sessions every N tracks to prevent cumulative leaks

Database string sanitization:

  • Applied to app_helper_artist.py before all artist name insertions
  • Example: "Tyler\x00YoungBoy\x00Ty Dolla""TylerYoungBoyTy Dolla"

Memory management in analyzers:

  • tasks/clap_analyzer.py, tasks/mulan_analyzer.py - error detection with single retry after cleanup
  • tasks/analysis.py - session recycling every 20 tracks, CUDA cleanup after albums

Usage:

from tasks.memory_utils import handle_onnx_memory_error, cleanup_cuda_memory

try:
    outputs = session.run(None, inputs)
except Exception as e:
    if handle_onnx_memory_error(e, "segment_processing"):
        cleanup_cuda_memory(force=True)
        outputs = session.run(None, inputs)  # Retry once
    else:
        raise

Root cause: GPU memory fragmenting from incomplete cleanup between inferences. BFCArena allocator fails to find contiguous blocks despite sufficient total memory. Python GC timing doesn't align with CUDA memory lifecycle. Solution: explicit cleanup at strategic points + periodic session recreation.

Documentation: See docs/MEMORY_FIXES.md for implementation details and monitoring guidance.

Original prompt

In @NeptuneHub/AudioMuse-AI in the devel branch we added some fix to try resolving some memory allocation error on clap model when running on gpu, but different user still sharing thet this is keep randomly happen snd even if then the analyssi o nother song continue is annoying. Last user shared just today this:

Seems to be working. These errors appear every so often but not breaking the analyzing process:

Failed to upsert artist mapping for 'Tyler, The CreatorYoungBoy Never Broke AgainTy Dolla $ign': A string literal cannot contain NUL (0x00) characters.

2025-12-23 18:57:48.061508042 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running FusedConv node. Name:'Conv__132' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1125366016

DO RESEARCH On INTERNET
Analyize the error deeply! Try to think different and understood WHY this error leep happening. Is multiple reason ? The name of the file or some metadata that keep something crashing and then the memroy is not free up ? Peobably we are focusing kn the effect and not on the cause !


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI self-assigned this Dec 24, 2025
Copilot AI and others added 4 commits December 24, 2025 21:57
- Created memory_utils.py with CUDA cleanup and session recycling
- Added string sanitization to prevent NUL byte errors in PostgreSQL
- Updated app_helper_artist.py to sanitize artist names
- Enhanced CLAP analyzer with memory error handling and retry logic
- Added explicit CUDA memory cleanup after operations
- Implemented session recycling to prevent memory accumulation
- All changes tested and verified

Co-authored-by: NeptuneHub <171392533+NeptuneHub@users.noreply.github.com>
- Enhanced mulan_analyzer.py with memory error handling
- Added explicit CUDA cleanup after operations
- Improved session cleanup in unload_mulan_model()
- Added retry logic for memory allocation errors
- Cleanup intermediate tensors during processing
- All syntax validated and tests pass

Co-authored-by: NeptuneHub <171392533+NeptuneHub@users.noreply.github.com>
- Created docs/MEMORY_FIXES.md with detailed explanations
- Documented root causes of both issues
- Provided usage examples and testing procedures
- Added monitoring guidance and future improvements
- Complete reference for understanding the fixes

Co-authored-by: NeptuneHub <171392533+NeptuneHub@users.noreply.github.com>
- Created PR_SUMMARY.md with quick overview
- Includes code examples and usage
- Testing status and expected impact
- Review checklist for maintainers
- Ready for final review and testing

Co-authored-by: NeptuneHub <171392533+NeptuneHub@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix memory allocation errors on clap model for GPU Fix PostgreSQL NUL byte errors and ONNX GPU memory allocation failures Dec 24, 2025
Copilot AI requested a review from NeptuneHub December 24, 2025 22:04
@NeptuneHub
Copy link
Owner

@copilot I would address this PR against devel branch

Copy link
Contributor Author

Copilot AI commented Dec 24, 2025

@copilot I would address this PR against devel branch

I understand you'd like this PR targeted against the devel branch. However, I don't have access to modify the PR's target branch directly - that needs to be changed through the GitHub UI.

The current branch is based on commit 0e675db (tag v0.8.3) which shows "Merge pull request #241 from NeptuneHub/devel".

Could you please:

  1. Update the PR target branch to devel through GitHub, OR
  2. Let me know if you'd like me to create a new branch based on devel if that branch exists in the repository?

Note: The devel branch is not currently available in this clone - only origin/copilot/fix-memory-allocation-errors is present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants