All CLI command lines have the general structure:
echogarden [operation] [one or more inputs..] [one or more outputs...] [options...]
Each operation can accept one or more options, in the form --[optionName]=[value]
(The =
is required).
Keyboard shortcuts:
- While the program is running, you can press
esc
to exit immediately - When audio is playing, you can press
enter
to skip it,space
to pause/resume,right
to skip 1 second forward, andleft
to skip 1 second backwards
Task: Given a text file, synthesize spoken audio for it.
This would synthesize "Hello World" and play the result in the terminal:
echogarden speak "Hello world!"
If no language is specified, it would attempt to detect it. This usually works better for longer texts, and may misidentify shorter ones. To ensure the right language is selected, you can specify the language explicitly:
echogarden speak "Hello world!" --language=en
This would save the resulting audio to result.mp3
:
echogarden speak "Hello world!" result.mp3 --language=en
speak-file
synthesizes text loaded from a textual file, which can have the extensions txt
, html
, xml
, ssml
, srt
, vtt
:
echogarden speak-file text.txt result.mp3 --language=en
You can specify an engine using the --engine
option (a full list of engines can be found here). This would set the synthesis engine to pico
(SVOX Pico):
echogarden speak-file text.txt result.mp3 --language=en --engine=pico
The CLI supports multiple output files. This would synthesize a text file, and save the resulting audio in both result.mp3
and result.wav
, as well as subtitles in result.srt
:
echogarden speak-file text.txt result.mp3 result.wav result.srt --engine=vits --speed=1.1
Synthesize a web page (it will try to extract its main article parts and omit the rest):
echogarden speak-url https://example.com/hola
Synthesize a Wikipedia article, in any of its language editions:
echogarden speak-wikipedia "Psychologie" --language=fr
Task: Given an audio recording containing speech, find a textual transcription that best matches it.
This would transcribe the audio file speech.mp3
, and then play the audio, along with the recognized text, in the terminal:
echogarden transcribe speech.mp3
This would transcribe the audio file speech.mp3
and store the resulting transcription in result.txt
, subtitles in result.srt
, and a full timeline tree in result.json
:
echogarden transcribe speech.mp3 result.txt result.srt result.json
Task: Given an audio file and its transcript, try to approximate the timing of the start and end of each spoken word (and its subparts).
This would align the audio file speech.mp3
with the transcript provided in transcript.txt
, and would play the synchronized result in the terminal:
echogarden align speech.mp3 transcript.txt
This would align the audio file speech.mp3
with the transcript provided in transcript.txt
, and store the resulting subtitles in result.srt
, and a full timeline tree in result.json
:
echogarden align speech.mp3 transcript.txt result.srt result.json
Task: Given an audio file containing speech in one language, transcribe it to a second language. The translated transcript should be generated directly from the speech itself, without an intermediate textual translation step.
This will detect the spoken language, apply speech translation to English, and play the original audio, synced with the translated transcript:
echogarden translate-speech speech.mp3
To specify the source and target languages explicitly, use the sourceLanguage
and targetLanguage
options:
echogarden translate-speech speech.mp3 translation.txt --sourceLanguage=es --targetLanguage=en
Note: currently, only English is supported as target language. This is a limitation of the whisper
Engine, which is the only one used for speech translation, at this time.
Task: Given a spoken audio file and its English translated transcript, try to approximate the timing of the start and end of each translated word.
This would align the audio file dutch-speech.mp3
with the translated transcript provided in english-translation.txt
, and would play the synchronized result in the terminal:
echogarden align-translation dutch-speech.mp3 english-translation.txt
This would align the audio file dutch-speech.mp3
with the translated transcript provided in english-translation.txt
, and store the resulting subtitles in result.srt
, and a full timeline tree in result.json
:
echogarden align-translation dutch-speech.mp3 english-translation.txt result.srt result.json
Task: Given a spoken audio file, its transcript, and its translated transcript, try to approximate the timing of the start and end of each translated word.
This would align the audio file dutch-speech.mp3
with the Dutch (native language) transcript provided in dutch-transcript.txt
and the translated transcript provided in russian-translation.txt
, and would play the synchronized result in the terminal:
echogarden align-transcript-and-translation dutch-speech.mp3 dutch-transcript.txt russian-translation.txt
This would perform the same operation but write the results to disk:
echogarden align-transcript-and-translation dutch-speech.mp3 dutch-transcript.txt russian-translation.txt out.json out.srt
The output would include separate files for the native language and the translation language:
out.json
out.srt
out.translated.json
out.translated.srt
Task: Given an audio file, its transcript, and its translated transcript, try to approximate the timing of the start and end of each translated word. Do this in two, separate stages.
This manual two-step approach allows to reuse the already-aligned transcript in the next stage, possibly for several different translation languages. The method used for alignment is otherwise identical to align-transcript-and-translation
.
Stage 1:
Align the audio with its native language transcript, to produce a timeline in the native language:
echogarden align dutch-speech.mp3 dutch-transcript.txt dutch-timeline.json
Stage 2:
Align the resulting timeline with the target translation, and play the synchronized result in the terminal.
echogarden align-timeline-translation dutch-timeline.json russian-transcript.txt --audio=dutch-speech.mp3
(--audio
is only used for previewing the result in the terminal. Otherwise, it is not necessary)
Task: Given audio or textual input, try to identify which language it is spoken or written in.
Try to identify the language of an audio file containing speech, and print the probabilities to the terminal:
echogarden detect-speech-language speech.mp3
Try to identify the language of a text file, and print the probabilities to the terminal:
echogarden detect-text-language story.txt
Try to identify the language of a text file, and store the detailed probabilities in a JSON file:
echogarden detect-text-language story.txt detection-results.json
Task: Given an audio file, try to classify which parts of the audio contain speech, and which don't.
This would apply VAD and play the audio, synchronized with speech
and nonspeech
indicators, printed to the terminal.
echogarden detect-voice-activity speech.mp3
This would apply VAD and store the results in a timeline JSON file.
echogarden detect-voice-activity speech.mp3 timeline.json
Task: Attempt to reduce the amount of background noise in a spoken recording.
This would apply denoising and play the denoised audio:
echogarden denoise speech.mp3
This would apply denoising, and save the denoised audio to a file:
echogarden denoise speech.mp3 denoised-speech.mp3
Task: Try to isolate a vocal track (or other type of track, depending on model used), from the audio.
This would apply source separation and play the isolated audio:
echogarden isolate voice-with-music.mp3
This would apply source separation, and save both the isolated and background audio:
echogarden isolate voice-with-music.mp3 voice-isolated.mp3
Written files would be:
voice-isolated.mp3
voice-isolated.background.mp3
Echogarden can split the output to multiple parts based on the segment boundaries detected. For example:
echogarden speak text.txt parts/[segment].opus
The [segment]
placeholder would cause multiple files to be created, one for each text segment (segments would be determined according to paragraph or line breaks, in this case). The placeholder would be replaced by the index and initial text of the segment, producing an output file with a name like parts/001 Hello world how are you doing ... .opus
.
Templates can also be used for multiple output files. For instance, the following would align speech.mp3
with transcript.txt
and then split the audio according to the segments found in the transcript, and store separate audio and subtitle files for each part.
echogarden align speech.mp3 transcript.txt parts/[segment].m4a parts/[segment].srt
Splitting based on sentences, using a [sentence]
placeholder, is currently on the to-do list. Please let me know if you find this feature important, and I'll prioritize it.
By default, audio isn't played in the terminal when an output file is specified, you can override this behavior by adding --play
:
echogarden speak-file text.txt result.mp3 --play
Or similarly prevent playback using --no-play
:
echogarden transcribe speech.mp3 --no-play
By default, the CLI doesn't overwrite existing files. If an output file out.mp3
already exists, it will save it as out_00x.mp3
.
To have existing files be overwritten, you can pass the --overwrite
option.
Since there are many possible configuration options, it may be more convenient to store them in a configuration file.
When a file named echogarden.config
is found at the current directory, it will be loaded automatically and its content would be used as default options. You can also specify a particular configuration file path with the option --config=path/to/your-config-file.config
.
The configuration file format is simple and has a dedicated section for each command (all speak-
commands are grouped together under speak
), global
section for global API options, and cli
for common CLI options. #
is used as a comment character.
Example:
[global]
# Custom remote packages base URL:
packageBaseURL = https://hf-mirror.com/echogarden/echogarden-packages/resolve/main/
# Log level:
logLevel = info
[cli]
# Should play audio in the terminal:
play = true
# Overwrite existing files:
overwrite = true
[speak]
# Engine for synthesis:
engine = vits
# Voice for synthesis (case-insensitive, can be a search pattern):
voice = amy
# Custom lexicon paths:
customLexiconPaths = ["lexicon1.json", "lexicon2.json"]
[transcribe]
# Engine for recognition:
engine = whisper
# Whisper options:
whisper.model = tiny
whisper.temperature = 0.15
You can also use a JSON configuration file format instead, if preferred.
Name your file echogarden.config.json
:
{
"speak": {
"engine": "vits",
"voice": "amy",
"customLexiconPaths": ["lexicon1.json", "lexicon2.json"]
},
"transcribe": {
"engine": "whisper",
"whisper": {
"model": "tiny"
}
}
}
Flattened property names are also accepted:
{
"transcribe": {
"engine": "whisper",
"whisper.model": "tiny"
}
}
Shows a list of available engines for a given command:
echogarden list-engines speak
Shows a list of available TTS voices for a given engine:
echogarden list-tts-voices google-cloud
Saves the voice list in a JSON file:
echogarden list-tts-voices google-cloud google-cloud-voices.json
Manage the Echogarden packages that are locally installed.
Install one or more packages
Uninstall one or more packages
Show a list of installed packages