Skip to content

Conversation

@Alex-Wengg
Copy link
Contributor

@Alex-Wengg Alex-Wengg commented Nov 3, 2025


================================================================================
BENCHMARK RESULTS
================================================================================
   Dataset: librispeech test-clean
   Model: parakeet-v3
   Device: NPU
   Files processed: 2620
   fluidaudio v3 (multi) Average WER |   2.6%
   Average WER: 3.7%
   Median WER: 0.0%
   Average CER: 1.9%
   Median CER: 0.0%
   Median RTFx: 23.5x
   Overall RTFx: 25.7x (19452.5s / 756.4s)
   Benchmark runtime: 789.8s
Results saved to: v3.json
   Normalization: OpenAI Whisper English
================================================================================

================================================================================
BENCHMARK RESULTS
================================================================================
   Dataset: librispeech test-clean
   Model: parakeet-v2
   Device: NPU
   Files processed: 2620
   fluidAudio v2 (EN) Average WER: 2.2%  
   Average WER: 2.8%
   Median WER: 0.0%
   Average CER: 1.1%
   Median CER: 0.0%
   Median RTFx: 25.7x
   Overall RTFx: 27.4x (19452.5s / 711.2s)
   Benchmark runtime: 764.2s
Results saved to: v2_100.json
   Normalization: OpenAI Whisper English

@claude
Copy link

claude bot commented Nov 3, 2025

Pull Request Review: Add Parakeet v3 Support

Overview

This PR adds support for the Parakeet v3 model alongside the existing v2 model, with improved benchmarking metrics (WER and CER). The implementation is generally solid with good performance results, but there are several areas that need attention.


Code Quality & Best Practices

Strengths:

  • Clean separation of model configurations in model_configs.hpp
  • Good use of const-correctness and inline constants
  • Proper use of structured bindings and modern C++ features
  • Consistent naming conventions
  • Helpful comments explaining format differences

Issues:

1. Magic Number Without Documentation (src/eddy_c.cpp:379)
The value 8192 is a magic number with no explanation. Should be documented as a named constant and consider extracting both v2 (1024) and v3 (8192) blank token IDs to model_configs.hpp.

2. Silent Fallback Logic (src/eddy_c.cpp:379)
All non-8192 values default to v2, which could hide configuration errors. Should validate that blank_token_id matches expected values and fail explicitly on mismatch.

3. Windows-Only Legacy Path Handling (src/eddy_c.cpp:401-406)
Only checks for v2 legacy path, not v3. Should either support both or be documented as migration code with a deprecation plan.


Potential Bugs

1. CRITICAL: Vocabulary Size Validation Missing (src/models/parakeet-v2/tokenizer.cpp:24-34)
The v3 format includes vocab_size metadata, but it is never validated against the actual loaded vocabulary. If the vocab file is corrupted or truncated, the mismatch will not be detected. Add validation after loading to check vocab_.size() matches expected vocab_size.

2. Integer Overflow Risk (src/models/parakeet-v2/tokenizer.cpp:39)
std::stoul can throw std::out_of_range for values larger than ULONG_MAX. Should wrap in try-catch or validate before conversion.

3. Benchmark Metric Calculation Changed (benchmarks/benchmark.py:401-404)
The PR changes from corpus-level WER to per-file average WER. These are mathematically different metrics:

  • Per-file average weights all files equally
  • Corpus-level WER weights by word count (standard in ASR)
  • Impact: Benchmark results are not comparable to previous versions
  • Fix: Either keep corpus-level WER as primary metric or document this breaking change prominently

Performance Considerations

Positive:

  • Excellent benchmark results: v3 shows 23.5x median RTFx
  • Good WER/CER metrics (3.7% / 1.9% for v3)

Concerns:

1. Redundant String Construction (src/eddy_c.cpp:388-394)
Creates unnecessary string copy for comparison. Consider using std::string_view.

2. Double Iteration Over Vocabulary
First loop finds max_id, second loop populates vocab. Could be optimized to single pass if vocab loading is on critical path.


Security Concerns

1. Unchecked File Path Concatenation (src/eddy_c.cpp:412-416)
If model_dir comes from user input (which it can via config.model_dir), there is no path traversal validation. User could pass ../../sensitive_dir as model_dir. Fix: Validate that model_dir does not contain ../ or canonicalize the path before use.

2. No Input Validation on blank_token_id
C++ side (eddy_c.cpp:421) allows arbitrary positive values without validation.


Test Coverage

CRITICAL ISSUE: No automated tests for this functionality

The PR adds significant new functionality but includes:

  • No unit tests for v3 tokenizer format parsing
  • No integration tests for v3 model loading
  • No tests for blank_token_id inference logic
  • No tests for model_dir resolution with cache special value
  • No tests for vocabulary size validation
  • Only manual benchmark results in PR description

Recommendations:

  1. Add unit tests for Tokenizer::load() with both v2 and v3 format fixtures
  2. Add tests for edge cases: empty vocab, malformed JSON, wrong vocab_size
  3. Add tests for eddy_parakeet_create() with different blank_token_id values
  4. Add CI job to run benchmarks on subset of data to catch regressions

Additional Issues

1. Inconsistent Default Behavior
C++ defaults to v2 when blank_token_id is invalid while Python defaults to v2 explicitly. These should align.

2. Missing Documentation

  • No updates to README.md about v3 support
  • No mention of how to use v3 in docs/

3. Breaking Change Not Highlighted
The WER calculation change (corpus-level to per-file average) is a breaking change that makes old benchmarks incomparable. This should be documented in the PR description and commit message.


Recommendations Summary

Must Fix (Blocking):

  1. Add validation for blank_token_id values (reject invalid values)
  2. Add vocabulary size validation for v3 format
  3. Fix or document the WER metric change
  4. Add basic unit tests for new functionality

Should Fix:

  1. Extract magic numbers (8192, 1024) to named constants
  2. Add path validation for user-provided model_dir
  3. Handle v3 legacy path on Windows or document v2-only behavior
  4. Update README.md with v3 usage instructions

Nice to Have:

  1. Optimize vocabulary loading (single-pass iteration)
  2. Use std::string_view to avoid unnecessary copies
  3. Add CI integration tests

Conclusion

This is a solid PR that successfully adds v3 support with good performance results. However, it has several security and correctness issues that should be addressed before merging. The lack of automated tests is concerning for a core feature addition.

Recommendation: Request changes - address critical issues (validation, tests, breaking change documentation) before merging.

Great work on the implementation! The code is clean and the benchmarks show excellent performance. With the suggested fixes, this will be a strong addition to the codebase.


Review generated with assistance from Claude Code

Copy link
Member

@BrandonWeng BrandonWeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@Alex-Wengg Alex-Wengg marked this pull request as ready for review November 3, 2025 02:10
@Alex-Wengg Alex-Wengg merged commit 5bdbbb5 into main Nov 3, 2025
2 checks passed
@BrandonWeng BrandonWeng deleted the parakeet-v3 branch November 3, 2025 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants