Skip to content

Conversation

@Alex-Wengg
Copy link
Contributor

@Alex-Wengg Alex-Wengg commented Nov 3, 2025

FLEURS Multilingual ASR Benchmark Results
Language                  | WER     | Ref WER  | CER     | RTFx    | Samples
────────────────────────────────────────────────────────────────────────────────────────────────
Bulgarian (Bulgaria)      | 16.8%   | 12.6%    | 4.7%    | 41.7x   | 350
Czech (Czechia)           | 18.5%   | 11.0%    | 5.3%    | 43.1x   | 350
Danish (Denmark)          | 25.4%   | 18.4%    | 9.3%    | 44.0x   | 930
German (Germany)          | 7.4%    | 5.0%     | 2.9%    | 42.8x   | 350
Greek (Greece)            | 42.7%   | 20.7%    | 15.0%   | 37.2x   | 650
English (US)              | 6.1%    | 4.9%     | 3.0%    | 41.9x   | 350
Spanish (Spain)           | 5.4%    | 3.5%     | 2.8%    | 43.1x   | 350
Estonian (Estonia)        | 20.8%   | 17.7%    | 4.9%    | 43.4x   | 893
Finnish (Finland)         | 16.8%   | 13.2%    | 3.7%    | 41.5x   | 918
French (France)           | 7.7%    | 5.2%     | 3.2%    | 40.6x   | 350
Croatian (Croatia)        | 17.8%   | 12.5%    | 5.8%    | 41.0x   | 350
Hungarian (Hungary)       | 20.7%   | 15.7%    | 6.4%    | 41.1x   | 905
Italian (Italy)           | 4.3%    | 3.0%     | 2.1%    | 43.6x   | 350
Lithuanian (Lithuania)    | 24.6%   | 20.4%    | 6.7%    | 40.4x   | 986
Latvian (Latvia)          | 30.6%   | 22.8%    | 8.1%    | 42.6x   | 851
Maltese (Malta)           | 25.3%   | 20.5%    | 9.2%    | 41.3x   | 926
Dutch (Netherlands)       | 9.8%    | 7.5%     | 3.3%    | 37.5x   | 350
Polish (Poland)           | 10.5%   | 7.3%     | 3.1%    | 37.3x   | 350
Romanian (Romania)        | 17.5%   | 12.4%    | 5.9%    | 38.9x   | 883
Russian (Russia)          | 9.9%    | 5.5%     | 2.5%    | 39.7x   | 350
Slovak (Slovakia)         | 11.1%   | 8.8%     | 3.5%    | 43.7x   | 350
Slovenian (Slovenia)      | 28.1%   | 24.0%    | 9.4%    | 38.7x   | 834
Swedish (Sweden)          | 18.9%   | 15.1%    | 5.6%    | 41.5x   | 759
Ukrainian (Ukraine)       | 10.7%   | 6.8%     | 2.9%    | 39.3x   | 350
────────────────────────────────────────────────────────────────────────────────────────────────
AVERAGE                   | 17.0%   | 12.7%    | 5.4%    | 41.1x   | 24 langs
────────────────────────────────────────────────────────────────────────────────────────────────

@claude
Copy link

claude bot commented Nov 3, 2025

Code Review: FLEURS Multilingual ASR Benchmark

Thank you for this comprehensive benchmark implementation! This is a well-structured addition that brings multilingual testing capabilities to the project. Here's my detailed review:


Strengths

Code Quality & Structure

  • Excellent documentation: Clear module docstrings, comprehensive comments referencing the Swift implementation source
  • Well-organized architecture: Clean separation between Python orchestration and C++ native implementation
  • Consistent with codebase style: Follows the existing benchmark patterns established in benchmarks/benchmark.py
  • Type hints: Good use of typing throughout the Python code

Implementation Quality

  • Dual implementation approach: Smart to provide both Python and C++ (2-3x faster) versions
  • Proper error handling: Good exception handling with informative error messages
  • Resource management: Proper cleanup in finally block (benchmark_fleurs.py:872-874)
  • Incremental download: Smart validation and resume logic for corrupted/partial downloads

🔍 Issues Found

🔴 Critical Issues

1. Security: Unvalidated Network Downloads (benchmark_fleurs.py:202-250)

Severity: HIGH

The code downloads files from HuggingFace without proper validation:

  • No SSL certificate verification options exposed
  • No hash verification for downloaded files
  • Downloads are trusted implicitly

Recommendation:

# Add checksum validation
EXPECTED_CHECKSUMS = {
    "en_us.trans.txt": "sha256:...",
    # ...
}

# Verify after download
import hashlib
def verify_file(path, expected_hash):
    sha256 = hashlib.sha256()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(8192), b''):
            sha256.update(chunk)
    if sha256.hexdigest() != expected_hash:
        raise ValueError(f"Checksum mismatch for {path}")

2. Bug: Naive Audio Resampling (benchmark_fleurs.py:382-391)

Severity: MEDIUM

Linear interpolation for audio resampling introduces artifacts:

# This is not production-quality resampling
audio_data = np.interp(
    np.linspace(0, len(audio_data), new_length),
    np.arange(len(audio_data)),
    audio_data
)

Impact: May degrade benchmark accuracy for non-16kHz samples

Recommendation: Use proper resampling:

# Add to dependencies
import librosa
# or
from scipy.signal import resample_poly

# Proper resampling
if sample_rate != 16000:
    audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)

3. Bug: Missing Dependency Declaration

The Python script imports soundfile and numpy but these may not be in project dependencies. Verify they're in pyproject.toml or similar.

🟡 Medium Issues

4. Performance: Inefficient Text Processing (C++: benchmark_fleurs.cpp:94-115)

The normalize_text function processes character by character. For production use at scale:

// Consider using whisper-normalizer or a regex-based approach
// Current approach is O(n) but with many branches

5. Error Handling: Silent Failures (benchmark_fleurs.py:461-463)

Errors during processing are caught and counted as "skipped" but:

  • Stack traces aren't logged (only available with --debug)
  • No way to retry or investigate failures later

Recommendation:

except Exception as e:
    error_log = self.cache_dir / "errors.log"
    with open(error_log, 'a') as f:
        f.write(f"{sample.sample_id}: {e}\n")
    if self.debug:
        import traceback
        traceback.print_exc()
    skipped_count += 1

6. Code Quality: Hardcoded Model Version (benchmark_fleurs.py:825)

blank_token_id=8192,  # Parakeet v3

This is hardcoded for v3 but the docstring says "Parakeet v2". Ensure consistency or make configurable.

🟢 Minor Issues

7. Path Inconsistency (benchmark_fleurs.py:142-145)

Windows uses %LOCALAPPDATA%/eddy/datasets/FLEURS but non-Windows defaults to ~/Library/Application Support/FluidAudio/FLEURS (different base path). Should this be ~/Library/Application Support/eddy/FLEURS for consistency?

8. Documentation: Missing Requirements

The script shows example usage but doesn't document:

  • Python version requirements
  • Required dependencies (soundfile, numpy, etc.)
  • How to install dependencies (uv run? pip install?)

9. Code Smell: Unused Import (benchmark_fleurs.py:22)

import time  # Not used in the code

10. CMakeLists.txt: Missing Comments (examples/cpp/CMakeLists.txt:13-14)

The new executable is added without a comment explaining its purpose (while others have context).


🎯 Performance Considerations

Good:

  • ✅ C++ native implementation option (2-3x faster)
  • ✅ Batch processing with proper cleanup
  • ✅ Caching mechanism to avoid re-downloads

Suggestions:

  • Consider adding progress bars for downloads (use tqdm)
  • For the C++ version, consider parallel processing of samples
  • Memory usage could be optimized by processing samples in smaller batches rather than loading all

🔒 Security Assessment

Concerns:

  1. Network downloads without integrity checks (see Critical Issue Parakeet TDT v2 implementation with chunking, #1)
  2. Path traversal: Uses user input for file paths (mitigated by using Path but still worth noting)
  3. Resource exhaustion: No limit on download size (could download unexpectedly large files)

Good:

  • ✅ No shell command injection risks
  • ✅ No obvious SQL injection vectors
  • ✅ Uses encoding='utf-8' consistently

🧪 Test Coverage

⚠️ Major Gap:

No automated tests provided. For a benchmark of this size (1,486 additions), tests are essential.

Recommendations:

  1. Unit tests for:

    • normalize_text() function
    • WER/CER calculation
    • Download validation logic
  2. Integration tests for:

    • End-to-end benchmark with mock data
    • C++ <-> Python interop
  3. Example test structure:

# tests/test_fleurs_benchmark.py
import pytest
from benchmark_fleurs import normalize_text, SUPPORTED_LANGUAGES

def test_normalize_text():
    assert normalize_text("Hello, World!") == "hello world"
    
def test_supported_languages():
    assert "en_us" in SUPPORTED_LANGUAGES
    assert len(SUPPORTED_LANGUAGES) == 24

📝 Best Practices & Recommendations

Do:

  1. ✅ Add checksums for downloaded files
  2. ✅ Use proper audio resampling library
  3. ✅ Add automated tests
  4. ✅ Document dependencies clearly
  5. ✅ Add error logging
  6. ✅ Standardize cache directory paths across platforms

Consider:

  • Adding a requirements.txt or updating pyproject.toml
  • Using tqdm for download progress
  • Adding retry logic for network failures
  • Implementing concurrent downloads for multiple languages

🎖️ Overall Assessment

Rating: 7/10

This is a solid, well-thought-out implementation that follows the existing codebase patterns and provides valuable multilingual testing capabilities. The dual Python/C++ approach is smart, and the code quality is generally good.

Primary concerns:

  1. Security (unvalidated downloads)
  2. Audio resampling quality
  3. Missing test coverage

Recommendation:Approve with minor revisions

Please address the critical issues (security validation, audio resampling) before merging. The medium/minor issues can be addressed in follow-up PRs if needed.


📚 Additional Notes

Great job documenting the Swift source as the reference implementation! This makes the code much easier to review and maintain. The benchmark results in the PR description look promising and consistent with expectations for multilingual ASR.

Example benchmark run to verify:

uv run python benchmark_fleurs.py --languages en_us --samples 10 --device NPU

Review conducted following CLAUDE.md guidelines. Note: Per CLAUDE.md, I have not committed any changes.

Copy link
Member

@BrandonWeng BrandonWeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@Alex-Wengg Alex-Wengg marked this pull request as ready for review November 3, 2025 20:13
@Alex-Wengg Alex-Wengg merged commit 6100f7a into main Nov 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants