add fleur test #5

Alex-Wengg · 2025-11-03T16:33:20Z

FLEURS Multilingual ASR Benchmark Results
Language                  | WER     | Ref WER  | CER     | RTFx    | Samples
────────────────────────────────────────────────────────────────────────────────────────────────
Bulgarian (Bulgaria)      | 16.8%   | 12.6%    | 4.7%    | 41.7x   | 350
Czech (Czechia)           | 18.5%   | 11.0%    | 5.3%    | 43.1x   | 350
Danish (Denmark)          | 25.4%   | 18.4%    | 9.3%    | 44.0x   | 930
German (Germany)          | 7.4%    | 5.0%     | 2.9%    | 42.8x   | 350
Greek (Greece)            | 42.7%   | 20.7%    | 15.0%   | 37.2x   | 650
English (US)              | 6.1%    | 4.9%     | 3.0%    | 41.9x   | 350
Spanish (Spain)           | 5.4%    | 3.5%     | 2.8%    | 43.1x   | 350
Estonian (Estonia)        | 20.8%   | 17.7%    | 4.9%    | 43.4x   | 893
Finnish (Finland)         | 16.8%   | 13.2%    | 3.7%    | 41.5x   | 918
French (France)           | 7.7%    | 5.2%     | 3.2%    | 40.6x   | 350
Croatian (Croatia)        | 17.8%   | 12.5%    | 5.8%    | 41.0x   | 350
Hungarian (Hungary)       | 20.7%   | 15.7%    | 6.4%    | 41.1x   | 905
Italian (Italy)           | 4.3%    | 3.0%     | 2.1%    | 43.6x   | 350
Lithuanian (Lithuania)    | 24.6%   | 20.4%    | 6.7%    | 40.4x   | 986
Latvian (Latvia)          | 30.6%   | 22.8%    | 8.1%    | 42.6x   | 851
Maltese (Malta)           | 25.3%   | 20.5%    | 9.2%    | 41.3x   | 926
Dutch (Netherlands)       | 9.8%    | 7.5%     | 3.3%    | 37.5x   | 350
Polish (Poland)           | 10.5%   | 7.3%     | 3.1%    | 37.3x   | 350
Romanian (Romania)        | 17.5%   | 12.4%    | 5.9%    | 38.9x   | 883
Russian (Russia)          | 9.9%    | 5.5%     | 2.5%    | 39.7x   | 350
Slovak (Slovakia)         | 11.1%   | 8.8%     | 3.5%    | 43.7x   | 350
Slovenian (Slovenia)      | 28.1%   | 24.0%    | 9.4%    | 38.7x   | 834
Swedish (Sweden)          | 18.9%   | 15.1%    | 5.6%    | 41.5x   | 759
Ukrainian (Ukraine)       | 10.7%   | 6.8%     | 2.9%    | 39.3x   | 350
────────────────────────────────────────────────────────────────────────────────────────────────
AVERAGE                   | 17.0%   | 12.7%    | 5.4%    | 41.1x   | 24 langs
────────────────────────────────────────────────────────────────────────────────────────────────

claude · 2025-11-03T16:36:29Z

Code Review: FLEURS Multilingual ASR Benchmark

Thank you for this comprehensive benchmark implementation! This is a well-structured addition that brings multilingual testing capabilities to the project. Here's my detailed review:

✅ Strengths

Code Quality & Structure

Excellent documentation: Clear module docstrings, comprehensive comments referencing the Swift implementation source
Well-organized architecture: Clean separation between Python orchestration and C++ native implementation
Consistent with codebase style: Follows the existing benchmark patterns established in benchmarks/benchmark.py
Type hints: Good use of typing throughout the Python code

Implementation Quality

Dual implementation approach: Smart to provide both Python and C++ (2-3x faster) versions
Proper error handling: Good exception handling with informative error messages
Resource management: Proper cleanup in finally block (benchmark_fleurs.py:872-874)
Incremental download: Smart validation and resume logic for corrupted/partial downloads

🔍 Issues Found

🔴 Critical Issues

1. Security: Unvalidated Network Downloads (benchmark_fleurs.py:202-250)

Severity: HIGH

The code downloads files from HuggingFace without proper validation:

No SSL certificate verification options exposed
No hash verification for downloaded files
Downloads are trusted implicitly

Recommendation:

# Add checksum validation
EXPECTED_CHECKSUMS = {
    "en_us.trans.txt": "sha256:...",
    # ...
}

# Verify after download
import hashlib
def verify_file(path, expected_hash):
    sha256 = hashlib.sha256()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(8192), b''):
            sha256.update(chunk)
    if sha256.hexdigest() != expected_hash:
        raise ValueError(f"Checksum mismatch for {path}")

2. Bug: Naive Audio Resampling (benchmark_fleurs.py:382-391)

Severity: MEDIUM

Linear interpolation for audio resampling introduces artifacts:

# This is not production-quality resampling
audio_data = np.interp(
    np.linspace(0, len(audio_data), new_length),
    np.arange(len(audio_data)),
    audio_data
)

Impact: May degrade benchmark accuracy for non-16kHz samples

Recommendation: Use proper resampling:

# Add to dependencies
import librosa
# or
from scipy.signal import resample_poly

# Proper resampling
if sample_rate != 16000:
    audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)

3. Bug: Missing Dependency Declaration

The Python script imports soundfile and numpy but these may not be in project dependencies. Verify they're in pyproject.toml or similar.

🟡 Medium Issues

4. Performance: Inefficient Text Processing (C++: benchmark_fleurs.cpp:94-115)

The normalize_text function processes character by character. For production use at scale:

// Consider using whisper-normalizer or a regex-based approach
// Current approach is O(n) but with many branches

5. Error Handling: Silent Failures (benchmark_fleurs.py:461-463)

Errors during processing are caught and counted as "skipped" but:

Stack traces aren't logged (only available with --debug)
No way to retry or investigate failures later

Recommendation:

except Exception as e:
    error_log = self.cache_dir / "errors.log"
    with open(error_log, 'a') as f:
        f.write(f"{sample.sample_id}: {e}\n")
    if self.debug:
        import traceback
        traceback.print_exc()
    skipped_count += 1

6. Code Quality: Hardcoded Model Version (benchmark_fleurs.py:825)

blank_token_id=8192,  # Parakeet v3

This is hardcoded for v3 but the docstring says "Parakeet v2". Ensure consistency or make configurable.

🟢 Minor Issues

7. Path Inconsistency (benchmark_fleurs.py:142-145)

Windows uses %LOCALAPPDATA%/eddy/datasets/FLEURS but non-Windows defaults to ~/Library/Application Support/FluidAudio/FLEURS (different base path). Should this be ~/Library/Application Support/eddy/FLEURS for consistency?

8. Documentation: Missing Requirements

The script shows example usage but doesn't document:

Python version requirements
Required dependencies (soundfile, numpy, etc.)
How to install dependencies (uv run? pip install?)

9. Code Smell: Unused Import (benchmark_fleurs.py:22)

import time  # Not used in the code

10. CMakeLists.txt: Missing Comments (examples/cpp/CMakeLists.txt:13-14)

The new executable is added without a comment explaining its purpose (while others have context).

🎯 Performance Considerations

Good:

✅ C++ native implementation option (2-3x faster)
✅ Batch processing with proper cleanup
✅ Caching mechanism to avoid re-downloads

Suggestions:

Consider adding progress bars for downloads (use tqdm)
For the C++ version, consider parallel processing of samples
Memory usage could be optimized by processing samples in smaller batches rather than loading all

🔒 Security Assessment

Concerns:

Network downloads without integrity checks (see Critical Issue Parakeet TDT v2 implementation with chunking, #1)
Path traversal: Uses user input for file paths (mitigated by using Path but still worth noting)
Resource exhaustion: No limit on download size (could download unexpectedly large files)

Good:

✅ No shell command injection risks
✅ No obvious SQL injection vectors
✅ Uses encoding='utf-8' consistently

🧪 Test Coverage

⚠️ Major Gap:

No automated tests provided. For a benchmark of this size (1,486 additions), tests are essential.

Recommendations:

Unit tests for:
- normalize_text() function
- WER/CER calculation
- Download validation logic
Integration tests for:
- End-to-end benchmark with mock data
- C++ <-> Python interop
Example test structure:

# tests/test_fleurs_benchmark.py
import pytest
from benchmark_fleurs import normalize_text, SUPPORTED_LANGUAGES

def test_normalize_text():
    assert normalize_text("Hello, World!") == "hello world"
    
def test_supported_languages():
    assert "en_us" in SUPPORTED_LANGUAGES
    assert len(SUPPORTED_LANGUAGES) == 24

📝 Best Practices & Recommendations

Do:

✅ Add checksums for downloaded files
✅ Use proper audio resampling library
✅ Add automated tests
✅ Document dependencies clearly
✅ Add error logging
✅ Standardize cache directory paths across platforms

Consider:

Adding a requirements.txt or updating pyproject.toml
Using tqdm for download progress
Adding retry logic for network failures
Implementing concurrent downloads for multiple languages

🎖️ Overall Assessment

Rating: 7/10

This is a solid, well-thought-out implementation that follows the existing codebase patterns and provides valuable multilingual testing capabilities. The dual Python/C++ approach is smart, and the code quality is generally good.

Primary concerns:

Security (unvalidated downloads)
Audio resampling quality
Missing test coverage

Recommendation: ✅ Approve with minor revisions

Please address the critical issues (security validation, audio resampling) before merging. The medium/minor issues can be addressed in follow-up PRs if needed.

📚 Additional Notes

Great job documenting the Swift source as the reference implementation! This makes the code much easier to review and maintain. The benchmark results in the PR description look promising and consistent with expectations for multilingual ASR.

Example benchmark run to verify:

uv run python benchmark_fleurs.py --languages en_us --samples 10 --device NPU

Review conducted following CLAUDE.md guidelines. Note: Per CLAUDE.md, I have not committed any changes.

BrandonWeng

🔥

add fleur test

8503134

BrandonWeng approved these changes Nov 3, 2025

View reviewed changes

Alex-Wengg marked this pull request as ready for review November 3, 2025 20:13

Alex-Wengg merged commit 6100f7a into main Nov 3, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fleur test #5

add fleur test #5

Uh oh!

Alex-Wengg commented Nov 3, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 3, 2025

Uh oh!

BrandonWeng left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add fleur test #5

add fleur test #5

Uh oh!

Conversation

Alex-Wengg commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 3, 2025

Code Review: FLEURS Multilingual ASR Benchmark

✅ Strengths

Code Quality & Structure

Implementation Quality

🔍 Issues Found

🔴 Critical Issues

1. Security: Unvalidated Network Downloads (benchmark_fleurs.py:202-250)

2. Bug: Naive Audio Resampling (benchmark_fleurs.py:382-391)

3. Bug: Missing Dependency Declaration

🟡 Medium Issues

4. Performance: Inefficient Text Processing (C++: benchmark_fleurs.cpp:94-115)

5. Error Handling: Silent Failures (benchmark_fleurs.py:461-463)

6. Code Quality: Hardcoded Model Version (benchmark_fleurs.py:825)

🟢 Minor Issues

7. Path Inconsistency (benchmark_fleurs.py:142-145)

8. Documentation: Missing Requirements

9. Code Smell: Unused Import (benchmark_fleurs.py:22)

10. CMakeLists.txt: Missing Comments (examples/cpp/CMakeLists.txt:13-14)

🎯 Performance Considerations

Good:

Suggestions:

🔒 Security Assessment

Concerns:

Good:

🧪 Test Coverage

⚠️ Major Gap:

📝 Best Practices & Recommendations

Do:

Consider:

🎖️ Overall Assessment

📚 Additional Notes

Uh oh!

BrandonWeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alex-Wengg commented Nov 3, 2025 •

edited

Loading