-
Notifications
You must be signed in to change notification settings - Fork 0
Parakeet TDT v2 implementation with chunking, #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… JSON; docs + setup
- Fix (preproc): correct windowed mel concatenation to append frames (no overwrite)
- Prevents “tail‑only” hypotheses on long clips
- Dedup (chunk join): port FluidAudio Swift overlap logic
- Punctuation guard → exact suffix→prefix → boundary‑window partial search
- Remove tail‑dedup safety net (by request)
- Tunables: EDDY_BOUNDARY_SEARCH_FRAMES (default 20), EDDY_MAX_OVERLAP_TOKENS
- Type safety (OpenVINO):
- Match i32/i64 for preproc/encoder lengths; named port APIs
- Decoder targets built with port’s element type
- Add read_length_scalar() to safely read i32/i64
- Chunking geometry:
- Add EDDY_CONTEXT_FRAMES (default 20) → overlap = 2 * context_frames
- Logging:
- Remove per‑file [PROFILE]/[TDT Stats] noise; gate detailed logs behind EDDY_DEBUG
- Keep final benchmark summary only by default
- Benchmark JSON:
- Add per‑chunk metadata only when multi‑chunk (index/offset/size/skip/holdback/tokens + appended chunk text)
- Always include chunk_sizes_frames
- Add --min-wer <percent> to filter output JSON (e.g., ≥10%)
- Scripts/Docs:
- Add scripts/setup_parakeet_models.ps1 (fetch models + rebuild helper)
- Add docs/Benchmark-Troubleshooting.md (setupvars, quoting, filters, chunk logs)
- Update AGENTS.md (cache clear, NPU wrapper, pointers)
- Create run_bench_npu.bat wrapper (ensures setupvars.bat before running)
- Misc:
- Gated dedup/holdback diagnostic prints (EDDY_DEBUG)
- Minor cleanups around time‑gate and head layout logs
BrandonWeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some initial comments.
|
@claude code review the overall changes. note that we will have to support other arch like arm64 parakeet model as well. keep that in mind |
did claude even ran |
This comment was marked as off-topic.
This comment was marked as off-topic.
BrandonWeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a much closerl ook this time around. pushed some of the nit changes myself but still have some confusion around naming, and nits aronnd code spacing,
some of the functions are just one big paragraph without any spacing and its hard to read.
also a lot of "ifs this that" and "fallbacks" that smells like AI agent dbeugging relics that should be cleaned up
tbh not too sure about C++ best practices here so will have to trust you (i guess moreso codex/claude?) on this
Replace dr_wav with libsndfile + libsamplerate; mel spec on CPU Audio Library Refactoring: - Replaced dr_wav with libsndfile for audio I/O - Added libsamplerate for high-quality resampling (sinc interpolation) - Added vcpkg.json manifest for dependency management - Updated CMake to use SndFile and SampleRate packages Parakeet Model Improvements: - Optimized encoder/decoder pipeline with better tensor handling - Improved chunking algorithm for long audio files - Renamed cache.hpp/cpp to app_dir.hpp/cpp for clarity - Supports FLAC, MP3, OGG, and other formats (not just WAV)
This comment was marked as outdated.
This comment was marked as outdated.
Tokenizer Changes: - Move tokenizer.cpp/hpp from src/utils to src/models/parakeet-v2 (tokenizer is Parakeet-specific, not a generic utility) - Improve decode_span() readability with spacing and comments - Update CMakeLists.txt and all include paths accordingly ensure_models Refactoring: - Remove system() call to hf_fetch_models (security risk for embedded apps) - Simplify to only check file existence, not auto-download - Library code should not spawn external processes - Users must manually download models via hf_fetch_models or HuggingFace - Reduced from 118 lines to 45 lines (-62% code) Preprocessor Simplification: - Remove unnecessary rank.is_static() check (always static for Parakeet) - Query element type once instead of per-window (performance optimization) - Model types are fixed at export time and won't change at runtime - Clarify comments: no dynamic input support, just short vs long audio - Parakeet mel spec model requires exactly 160000 samples (10 seconds) - Fail fast if shape is not static (as expected) Benchmark Cleanup: - Remove unnecessary try/except in benchmark.py (uv automatically manages dependencies via pyproject.toml) Overall: -64 lines of code, better performance, clearer semantics Addresses code review feedback from @BrandonWeng: - "I dont think we can just run CLIs in the code base, if a user decides to embed into their apps especially" - "this seems a bit redundant" (auto-fetch logic) - "whhy do we need this check? its always static no?" (preprocessor shape) - "the type won't change after we export?" (element type query) - "the static/dynamic checks here seem a bit unncessary as the models shouldn't be changing" - "pls add some spacings for some of these loops and group them" - "is this specific for parakeet? if so prob better to be under that folder" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add clear comment explaining this is a PRIVATE implementation header: - Contains internal OpenVINO types and state (not part of public API) - Follows pImpl pattern to hide implementation details from users - Kept in src/ so it won't be installed with public headers - Only parakeet_*.cpp implementation files should include it This is good practice - allows changing implementation without breaking API. Addresses code review question from @BrandonWeng: "why is this header file here but the other ones are in include/eddy" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…udio Improve comments to make it clear: - Parakeet model requires exactly 160000 samples (10 seconds), not dynamic - Code handles short audio (≤10s) vs long audio (>10s), not dynamic vs static - Previous comments were confusing - implied dynamic input support that doesn't exist Addresses code review feedback from @BrandonWeng: "the static/dynamic checks here seem a bit unncessary as the models shouldn't be changing. Im not quite following, unless openvino supports dynmic sized inputs?" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Improve code organization with spacing and section comments: - Add blank lines between logical sections - Group related operations together - Add comments explaining each section's purpose - Makes the code easier to review and understand This is a partial improvement - the file has 491 lines and needs more spacing work throughout, but this addresses the most dense sections in initialize_decoder_state() and finalize_chunk_decoding(). Addresses code review feedback from @BrandonWeng: "add some spacing pls and group the code, its all one big paragraph in the files and its hard to review" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses all code review feedback from @BrandonWeng: 1. Removed debugging relics from parakeet_openvino.cpp - Removed metadata JSON loading (parakeet_metadata.json doesn't exist) - Hardcoded port names instead of configurable variables - Removed unused member variables (enc_*_name) from impl struct - Inlined required_length() helper function 2. Simplified error handling in encoder port detection - Changed from boolean flags + post-loop checks to throwing directly in catch blocks - More direct error reporting with exception chaining 3. Extracted pipeline helper function - Created process_mel_features() to handle both short and long audio - Reduced main inference function from 75+ lines to a single call - Eliminated code duplication between single-chunk and multi-chunk paths 4. Moved OpenVINO compile utilities to utils/ - Created openvino_utils.hpp/cpp for reusable OpenVINO helpers - Extracted compile_component() and make_compile_cfg_from_env() - Updated CMakeLists.txt to include new utility file 5. Completed "cache" to "app_dir" rename - Renamed ensure_cache_dir() → ensure_directory() - Updated comments from "cache directory" to "app data directory" 6. Removed redundant benchmark script - Deleted scripts/bench_parakeet_ctypes.py - benchmarks/benchmark.py is now the canonical benchmark 7. Fixed audio conversion implementation - Fixed stereo mixing bug (was dividing by 65536 instead of 32768) - Uses libsndfile-consistent normalization (1.0f / 32768.0f) - Properly converts each channel before mixing 8. Improved preprocessor shape handling - Restored defensive logic for dynamic shapes with fallback to default Impact: - Reduced parakeet_openvino.cpp from 567 lines to 427 lines (-140 lines, -25%) - Improved code organization and reusability - Fixed audio normalization bug - Eliminated code duplication 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This comment was marked as outdated.
This comment was marked as outdated.
Fixes regression introduced in 45ddb87 where preprocessor started failing with 34% WER on both CPU and NPU. Root cause: Changed window_samples from 0 (dynamic support) to 160000 (hardcoded), breaking models compiled with dynamic shapes. Fix: Restore working logic from a339223: - Use window_samples = 0 to indicate dynamic length support - Create tensor size dynamically based on actual audio length - Move length type detection into lambda with try/catch fallback - Allow both static (windowed) and dynamic (single-shot) modes Test results after fix: - CPU (50 files): 1.12% WER (was 34%) - NPU (10 files): 0.00% WER (was 34%) - CLI (15s audio): Working correctly, 6.3x RTF 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This comment was marked as outdated.
This comment was marked as outdated.
Removed excessive try/catch fallbacks and debugging relics: - Preprocessor: Query element type once instead of in lambda - Preprocessor: Removed catch-all fallbacks, fail properly on errors - parakeet_openvino: Renamed compile_with_fallback -> compile_with_npu_fallback - parakeet_openvino: Move verbose warnings behind EDDY_DEBUG flag - parakeet_openvino: Simplify encoder port resolution (combine try blocks) - parakeet_openvino: Simplify encoder output index detection - Removed unnecessary comments explaining "if this that fallback" Improved code spacing and readability: - Added visual separators between logical sections - Grouped related operations together - Reduced function density (was one big paragraph) Changes: - preprocessor.cpp: -31 lines, cleaner shape detection logic - parakeet_openvino.cpp: -30 lines, better sectioning Test results: 0.00% WER on 5-file benchmark (no regressions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Pull Request Review: Parakeet TDT v2 ImplementationOverall AssessmentImpressive implementation of Parakeet TDT v2 ASR in C++/OpenVINO. 8,023 lines with excellent performance (2.87% WER, 37.8x RTFx). Well-designed architecture. Strengths
Critical Security Issues1. Command Injection VulnerabilitiesFiles: examples/cpp/hf_fetch_models.cpp:26, benchmark_librispeech.cpp:208,225,315 Using std::system() with unsanitized URLs creates injection risk. Fix: Use C++ HTTP library (libcurl, cpr, cpp-httplib) instead 2. Unsafe String CopyFile: src/eddy_c.cpp:25 Using strcpy instead of safe alternatives. Fix: Use std::memcpy Major Issues3. Missing Input ValidationPublic APIs lack null checks and buffer validation. Fix: Add defensive checks at API boundaries 4. Missing Unit Tests (CRITICAL)No unit tests in this PR. Fix: Add tests for audio loading, dedup, decoder state, overflow, errors 5. vcpkg Version ConstraintsNo version pinning. Fix: Pin dependency versions 6. Debug Logging InconsistencyDirect std::getenv calls instead of helper. Fix: Use is_debug_enabled() consistently Other Issues
RecommendationsMust Fix Before Merge
Should Fix
Future
ConclusionHigh-quality work with excellent performance. Architecture is sound. Main concerns: Security (command injection) and testing (no unit tests). Recommendation: Request changes for security and testing before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
examples/cpp/hf_fetch_models.cpp
Outdated
| std::string curl_cmd = "curl -L --progress-bar \"" + url + "\" -o \"" + output_path.string() + "\""; | ||
|
|
||
| std::cout << "Downloading: " << output_path.filename().string() << "\n"; | ||
| int ret = std::system(curl_cmd.c_str()); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Escape user input before invoking curl
The HuggingFace downloader builds a shell command by concatenating url and output_path directly into curl_cmd and then executes it via std::system. Both the repository ID (--repo) and each file name are controllable from the command line and are inserted without any escaping beyond surrounding double quotes. Supplying a value containing " or shell metacharacters allows the constructed command to break out of the quoted segment and run arbitrary commands with the user's privileges. Because this tool will typically be executed in developer environments to fetch models, the command injection is a high‑risk security issue. Consider either avoiding system() entirely (use libcurl/HTTP client APIs) or at minimum rejecting/escaping characters that have shell meaning before composing the command.
Useful? React with 👍 / 👎.
BrandonWeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| vcpkg install | ||
|
|
||
| # Or manually install specific packages | ||
| vcpkg install openvino libsndfile libsamplerate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACcording to GPT - precompiling the libraries and .d files are preferred over users manually installing deps for C++ SDKs but do push abck here
Excellent question — this is a subtle but crucial decision when distributing a C++ SDK. The short answer is:
✅ Most modern C++ SDKs ship dependencies as statically linked libraries, or via embedded source / vendored builds (CMake ExternalProject, FetchContent, etc.) — not DLLs.
Let’s break down the common patterns and tradeoffs so you can choose the right one for your SDK.
🧩 1. Static Linking (Most Common for SDKs)
✅ Recommended for SDKs / developer libraries.
How it works
You compile dependencies (.lib or .a) into your SDK’s .lib / .dll or .so.
Consumers don’t need to install those libs separately.
Why it’s preferred
Zero runtime DLL hassles (no “missing .dll” errors).
Version safety — your SDK controls which version of libsndfile/libsamplerate it uses.
Easier for cross-platform builds and packaging (one .lib or .a per platform).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i gave it a try and it created like 25 extra files. maybe not in this pr
| #include "eddy/backends/openvino_backend.hpp" | ||
| #include "eddy/core/app_dir.hpp" | ||
| #include "eddy/models/parakeet-v2/parakeet.hpp" | ||
| #include "eddy/models/parakeet-v2/parakeet_openvino.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a normal inference is more than enough, no need for the benchmark_librispeech or the text normalizer IMO
|
|
||
| #include "eddy/eddy_c.h" | ||
|
|
||
| #if defined(EDDY_WITH_OPENVINO_GENAI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blocking comment but we will want to split this into its odwn folder later and break down into smaller files
|
|
||
| // Create length tensor with correct type | ||
| ov::Tensor audio_length; | ||
| if (use_i64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would we use one type or the pother?
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Code Review: PR #1 - Parakeet TDT v2 ImplementationSummaryThis PR introduces a comprehensive C++/OpenVINO implementation of Parakeet TDT v2 ASR with impressive benchmark results (2.87% WER, 37.8x RTFx on CPU). The implementation is well-structured with good safety practices, but I've identified several critical issues that should be addressed before merging, along with recommendations for improved test coverage. Overall Assessment: 🟡 Approve with Changes Required 🔴 Critical Issues (Must Fix)1. Infinite Loop Risk in Chunking
|
Overview
Complete C++/OpenVINO implementation of Parakeet TDT v2 ASR matching FluidAudio's architecture
with production-ready features including token-level timestamps, confidence scores, buffered
streaming, and comprehensive benchmarking infrastructure.
Core ASR Pipeline
Chunking & Deduplication
Model Management