Release notes

Releases after `1.0.0`

For releases after 1.0.0, see the GitHub releases page.

`1.0.0` (April 12, 2024)

New features:

Add support for whisper.cpp, a C++ port of OpenAI's Whisper speech recognition architecture. It is faster than the integrated whisper engine, supports large models, and GPU processing. It can now be used for recognition, speech translation and alignment (via dtw-ra), though its word timestamps are less accurate than the integrated whisper engine.
Add the MDX-NET source separation model, enabling vocal tracks to be extracted from music and speech audio using the new isolate operation
Add support for OpenAI cloud platform's speech recognition and translation services
Add support for OpenAI cloud platform's speech synthesis service
Integrate optional vocal isolation to speech recognition, alignment and translation operations using the new --isolate option, allowing for higher accuracy in difficult cases like achieving word-level lyrics alignment
Add the new adaptive-gate VAD engine using a custom bandlimited adaptive gate. Fast and robust. Works well for relatively clean tracks or tracks that have already been processed using vocal isolation
Add optional token-level repetition suppression to Whisper engine
Expose several new configuration options for the Whisper engine: settings for maximum tokens per part, and a setting to enable/disable repetition suppression, set custom random seed, disable/enable decoding of timestamp tokens
Expose more options for the Elevenlabs engine

Enhancements:

detect-voice-activity now also outputs the cropped voice when an audio file is given as output, with .cropped suffix added
Improved word segmentation of whisper engine. Words spanning multiple recognized time segments are now split. Word timestamp ranges don't overlap non-speech sections.
Whisper and Whisper.cpp outputs now include timestamps for individual recognized tokens

Behavioral and breaking changes:

Minimal required node version changed to 18.0.0
All recognition, alignment, translation and language recognition operations first apply the new adaptive gate VAD (can be changed to any other VAD engine via the vad. option prefix) and remove any sections that are not identified as containing voice, before starting processing. This should improve results in most cases, and reduce processing time
To reduce Whisper hallucinations and repetition loops, these changes were made:
- Pre-cropping by default can significantly help with reducing hallucinations, giving the model less "empty space" to hallucinate on
- Enable new token-level repetition suppression (suppressRepetition = true) during decoding
- Disable decoding of timestamp tokens by default (decodeTimestampTokens = false), since more accurate timing is already extracted via cross-attention weight alignment. For unclear reasons, this can significantly reduce the occurrence of token repetition loops, and increases word timestamp accuracy. However, there are cases where this causes the model to end a part prematurely, especially in singing and less speech-like voice segments. In those cases the it can be enabled with decodeTimestampTokens = true
When transcribe, align or translate-speech operations are run with --isolate enabled, they will output the isolated part in some-output-file.isolated.wav and background part (isolated subtracted from original) in some-output-file.background.wav (any supported codec other than wav can be used - this is just an example)
VAD operations now return a timeline including only the active sections, labeled as active
When the specified language is not English, but an .en Whisper model was specified, a warning would be shown, and the model will be automatically switched to the corresponding multilingual model (omitting the .en), instead of producing an error
Default speech language detection engine is now changed to whisper
Default voice activity detection engine is now changed to silero
CLI: duplicate file name outputs would now append the _001 suffix pattern instead of (1). This change is meant to simplify sorting and typing the resulting file names and remove the space and parenthesis characters, to ensure compatibility with all operating systems
targetPeakDb post-processing and denoising option renamed to targetPeak
maxIncreaseDb post-processing and denoising option renamed to maxGainIncrease
dryMixGainDb denoising option renamed to dryMixGain

Fixes:

Integrated whisper engine now uses tiktoken to tokenize text, which produces near identical tokenization compared to the official Python implementation. This fixes issues with decoding Chinese characters, correctly encoding prompts, and several other languages with characters that span multiple tokens
Fix voice language lists for multilingual voices in Elevenlabs TTS engine
Fix runtime error with rnnoise when used as VAD engine
Fix issue in CLI where supported output media formats weren't reported correctly
Fix issue with identifying legacy IEEE Float wave format
Fix issue with language detection failing with empty inputs
Fix issue with Whisper model failing when no tokens are detected in a part
Fix issue with phone alignment not getting the right DTW window size, producing warnings like all cost directions are equal to infinity
Fix phone timelines in DTW-RA
Add more workarounds for eSpeak tokenization bugs
Various other fixes

Documentation:

Add new release notes page. Releases before 1.0.x were retroactively documented based on commit history (may not exactly detail all historical changes)
Add new development page with a quick guide for setting up a development environment to make changes to the Echogarden codebase
Options reference page reorganized and updated with some missing information

`0.12.x` (March 16, 2024)

New features:

Add global options support to API
Add updated ffmpeg internal packages for many platforms, including pre-signed binaries for macOS

Enhancements:

Fixes:

Fix issue with Polish TTS in eSpeak NG
Fix warning about punycode module in the CLI

Other:

Include package-lock.json in repository

`0.11.x` (August 17, 2023)

Many features, enhancements, and fixes were incrementally added over the span of 7 months, up to March 2024.

New features:

Partial rewrite of subtitle generation methods. Adds many features and options
Expose subtitle methods to API
Add support for multi-pass (AKA hierarchical) DTW alignment
Auto-prompt parts in whisper STT engine
Add support for Klatt synthesis in espeak engine

Enhancements:

Adds a total of 14 new VITS voices
Retry on failure for microsoft-edge and google-translate TTS engines

Fixes:

Don't error when empty audio is returned in microsoft-edge response
Fix audio playback in macOS
Many other fixes

`0.10.x` (August 2, 2023)

New features:

Add developer and server APIs, allowing the package to be used as a library or server
Add granularity options to DTW
Add Linux SoX package

Enhancements:

Adds 7 new VITS voices

Fixes:

Many fixes

`0.9.x` (July 29, 2023)

New features:

Add text offsets to timeline
Include segments and sentences in recognition and speech translation timelines
Improve voice information in Elevenlabs engine

Behavioral changes:

Set en_GB-alan-low as default en-GB VITS voice, since danny was trained to accept en-US pronunciations, which may confuse some people

Fixes:

Work around several eSpeak bugs
Many fixes

`0.8.x` (July 25, 2023)

New features:

Add confidence to some speech recognition timelines
Add plaintext options
Add punctuation thresholds to Whisper decoder
Expose subtitle configuration options to CLI

Enhancements:

Improve speech language detection to work with arbitrary length audio. Split audio into overlapping parts and detect each part individually. Then average the results.
Add more heteronyms

Behavioral changes:

Rename subtitles.minWords to subtitles.minWordsInLine

Fixes:

Improve numerical stability of softmax function
Work around eSpeak bug with markers in long inputs
Various fixes

`0.7.x` (July 23, 2023)

New features:

Alignment: Add option to accept custom lexicons
Implement and add decoder temperature option to Whisper model

Enhancements:

Alignment: use preprocessing and lexicons
Add language detection to speech translation
Add language detection options to synthesis, and set empty default options for detection in several APIs
Accept speech language detection options in recognition
Show warning only when DTW window is smaller than 25% of audio duration
Add colors to log messages
Warn when maximum DTW window duration is smaller than source audio duration.

Behavioral changes:

Fixes:

Fix support for SSML input in eSpeak engine
Various fixes

`0.6.x` (July 20, 2023)

New features:

Add support for custom lexicons. Change lexicon object structure to include language code and allow for multiple languages in a single lexicon
Add support for SSML inputs (currently only supported by Google, Microsoft and Amazon cloud engines). Ensure they are not split to segments or sentences
CLI: Add flag to enable or disable file overwriting in CLI
CLI: Add operation to list engines
Add arguments to customize paragraph parsing
Change whitespace option to include option for collapsing all whitespace

Enhancements:

Change whitespace option to include option for collapsing all whitespace
Add awareness of guillemets
Log full language of selected voice

Behavioral changes:

Set Elevenlabs defaults to mid-values
Split plain text to paragraphs using double line breaks by default.
Rename awsPolly options to amazonPolly to be more consistent with documentation

Fixes:

Fix and update Elevenlabs engine
Fix incorrect gender properties for some VITS voices
Convert to plaintext before detecting language when input is SSML.
Various fixes

Other:

Remove package-lock.json from the repository

`0.5.x` (July, 19 2023)

New features:

Enhancements:

Add large numbers of VITS voices

Behavioral changes:

Move to new package system supporting version tags, and hosted in a Hugging Face repository

Fixes:

Various fixes

`0.4.x` (July 9, 2023)

New features:

Enhancements:

Improve text normalization and add support for currencies
Improve heteronym lexicons
Extend year patterns
Improve logic for decade normalization
Add some British English (RP) pronunciations to heteronym lexicon
Add large numbers of VITS voices

Behavioral changes:

Remove support for afplay and aplay for playback. Only SoX is used now.

Fixes:

Various fixes

`0.3.x` (July 1, 2023)

New features:

Enhancements:

Upgrade heteronym disambiguation to an improved, rule-based approach, which doesn't use POS tagging
Extend decade normalization
Improve text normalization and add support to currencies
Update the heteronym lexicon

Behavioral changes:

Fixes:

Remove sentences containing only whitespace when synthesizing
Many fixes

`0.2.x` (May 10, 2023)

New features:

Enhancements:

Show current sentence and segment in synthesis log
Add check for cancellation flag
Add support to additional VITS voices
Improve auto TTS engine selection
Many Enhancements

Behavioral changes:

Remove dependency on xregexp package. Use Unicode RegExp instead.

Fixes:

Many fixes

`0.1.x` (April 24, 2023)

Initial release

Enhancements:

Add support for skipping audio playback with the Enter key.
Run CLI in a worker thread by default.
Many Enhancements

Fixes:

Many fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases.md

Releases.md

Release notes

Releases after `1.0.0`

`1.0.0` (April 12, 2024)

`0.12.x` (March 16, 2024)

`0.11.x` (August 17, 2023)

`0.10.x` (August 2, 2023)

`0.9.x` (July 29, 2023)

`0.8.x` (July 25, 2023)

`0.7.x` (July 23, 2023)

`0.6.x` (July 20, 2023)

`0.5.x` (July, 19 2023)

`0.4.x` (July 9, 2023)

`0.3.x` (July 1, 2023)

`0.2.x` (May 10, 2023)

`0.1.x` (April 24, 2023)

Files

Releases.md

Latest commit

History

Releases.md

File metadata and controls

Release notes

Releases after 1.0.0

1.0.0 (April 12, 2024)

0.12.x (March 16, 2024)

0.11.x (August 17, 2023)

0.10.x (August 2, 2023)

0.9.x (July 29, 2023)

0.8.x (July 25, 2023)

0.7.x (July 23, 2023)

0.6.x (July 20, 2023)

0.5.x (July, 19 2023)

0.4.x (July 9, 2023)

0.3.x (July 1, 2023)

0.2.x (May 10, 2023)

0.1.x (April 24, 2023)

Releases after `1.0.0`

`1.0.0` (April 12, 2024)

`0.12.x` (March 16, 2024)

`0.11.x` (August 17, 2023)

`0.10.x` (August 2, 2023)

`0.9.x` (July 29, 2023)

`0.8.x` (July 25, 2023)

`0.7.x` (July 23, 2023)

`0.6.x` (July 20, 2023)

`0.5.x` (July, 19 2023)

`0.4.x` (July 9, 2023)

`0.3.x` (July 1, 2023)

`0.2.x` (May 10, 2023)

`0.1.x` (April 24, 2023)