Closed
Description
Describe the bug
The captioned speech example gives duplicate word level timestamps
python kokoro-docker.py
Generating captioned speech for example texts...
Example 1:
Input text: Hello world! Welcome to the captioned speech system.
Response status: 200
Response headers: {'date': 'Sat, 08 Feb 2025 16:40:28 GMT', 'server': 'uvicorn', 'content-disposition': 'attachment; filename=speech.wav', 'x-accel-buffering': 'no', 'cache-control': 'no-cache', 'x-word-timestamps': '[{"word": "Hello", "start_time": 0.175, "end_time": 0.525}, {"word": "Hello", "start_time": 0.175, "end_time": 0.525}, {"word": "world", "start_time": 0.525, "end_time": 0.9}, {"word": "world", "start_time": 0.525, "end_time": 0.9}, {"word": "!", "start_time": 0.9, "end_time": 0.9875}, {"word": "!", "start_time": 0.9, "end_time": 0.9875}, {"word": "Welcome", "start_time": 0.9875, "end_time": 1.45}, {"word": "Welcome", "start_time": 0.9875, "end_time": 1.45}, {"word": "to", "start_time": 1.45, "end_time": 1.5375}, {"word": "to", "start_time": 1.45, "end_time": 1.5375}, {"word": "the", "start_time": 1.5375, "end_time": 1.625}, {"word": "the", "start_time": 1.5375, "end_time": 1.625}, {"word": "captioned", "start_time": 1.625, "end_time": 2.075}, {"word": "captioned", "start_time": 1.625, "end_time": 2.075}, {"word": "speech", "start_time": 2.075, "end_time": 2.4}, {"word": "speech", "start_time": 2.075, "end_time": 2.4}, {"word": "system", "start_time": 2.4, "end_time": 3.1}, {"word": "system", "start_time": 2.4, "end_time": 3.1}, {"word": ".", "start_time": 3.1, "end_time": 3.25}, {"word": ".", "start_time": 3.1, "end_time": 3.25}]', 'content-type': 'audio/wav', 'Transfer-Encoding': 'chunked'}
Audio saved to: [REMOVED]
Timestamps saved to: [REMOVED]
Word-level timestamps:
Hello: 0.175s - 0.525s
Hello: 0.175s - 0.525s
world: 0.525s - 0.900s
world: 0.525s - 0.900s
!: 0.900s - 0.988s
!: 0.900s - 0.988s
Welcome: 0.988s - 1.450s
Welcome: 0.988s - 1.450s
to: 1.450s - 1.538s
to: 1.450s - 1.538s
the: 1.538s - 1.625s
the: 1.538s - 1.625s
captioned: 1.625s - 2.075s
captioned: 1.625s - 2.075s
speech: 2.075s - 2.400s
speech: 2.075s - 2.400s
system: 2.400s - 3.100s
system: 2.400s - 3.100s
.: 3.100s - 3.250s
.: 3.100s - 3.250s
Example 2:
Input text: The quick brown fox jumps over the lazy dog.
Response status: 200
Response headers: {'date': 'Sat, 08 Feb 2025 16:40:28 GMT', 'server': 'uvicorn', 'content-disposition': 'attachment; filename=speech.wav', 'x-accel-buffering': 'no', 'cache-control': 'no-cache', 'x-word-timestamps': '[{"word": "The", "start_time": 0.175, "end_time": 0.25}, {"word": "The", "start_time": 0.175, "end_time": 0.25}, {"word": "quick", "start_time": 0.25, "end_time": 0.5}, {"word": "quick", "start_time": 0.25, "end_time": 0.5}, {"word": "brown", "start_time": 0.5, "end_time": 0.8375}, {"word": "brown", "start_time": 0.5, "end_time": 0.8375}, {"word": "fox", "start_time": 0.8375, "end_time": 1.2375}, {"word": "fox", "start_time": 0.8375, "end_time": 1.2375}, {"word": "jumps", "start_time": 1.2375, "end_time": 1.5375}, {"word": "jumps", "start_time": 1.2375, "end_time": 1.5375}, {"word": "over", "start_time": 1.5375, "end_time": 1.7375}, {"word": "over", "start_time": 1.5375, "end_time": 1.7375}, {"word": "the", "start_time": 1.7375, "end_time": 1.825}, {"word": "the", "start_time": 1.7375, "end_time": 1.825}, {"word": "lazy", "start_time": 1.825, "end_time": 2.2}, {"word": "lazy", "start_time": 1.825, "end_time": 2.2}, {"word": "dog", "start_time": 2.2, "end_time": 2.85}, {"word": "dog", "start_time": 2.2, "end_time": 2.85}, {"word": ".", "start_time": 2.85, "end_time": 3.025}, {"word": ".", "start_time": 2.85, "end_time": 3.025}]', 'content-type': 'audio/wav', 'Transfer-Encoding': 'chunked'}
Audio saved to: [REMOVED]
Timestamps saved to: [REMOVED]
Word-level timestamps:
The: 0.175s - 0.250s
The: 0.175s - 0.250s
quick: 0.250s - 0.500s
quick: 0.250s - 0.500s
brown: 0.500s - 0.838s
brown: 0.500s - 0.838s
fox: 0.838s - 1.238s
fox: 0.838s - 1.238s
jumps: 1.238s - 1.538s
jumps: 1.238s - 1.538s
over: 1.538s - 1.738s
over: 1.538s - 1.738s
the: 1.738s - 1.825s
the: 1.738s - 1.825s
lazy: 1.825s - 2.200s
lazy: 1.825s - 2.200s
dog: 2.200s - 2.850s
dog: 2.200s - 2.850s
.: 2.850s - 3.025s
.: 2.850s - 3.025s
Screenshots or console output
Branch / Deployment used
Using Docker Run command for the cpu
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post3
Operating System
On Linux (Pop OS) Ubuntu 22.04
Metadata
Metadata
Assignees
Labels
No labels
Activity