Skip to content

Long audio with OpenAI Whisper produces repeated nonsense text when Memo uploads compressed MP3 instead of WAV #389

@fish-wjj

Description

@fish-wjj

Describe the bug

When using MemoAI 1.6.7 on Windows 11 with the OpenAI Whisper (cloud) provider and my own OpenAI API key, long recordings (~95 minutes) are reported as successfully transcribed, but the transcript content is almost entirely nonsense: the same short phrase is repeated for most of the timeline.

However, when I send the WAV version of the same recording directly to whisper-1 via the OpenAI API, the transcription is perfectly reasonable Chinese meeting speech.

This strongly suggests that Memo is actually uploading the compressed transcribe.mp3 to OpenAI (which appears to be heavily compressed or encoded in a way that degrades the audio for Whisper), while keeping a higher-quality transcribe.wav only for local use.


Environment

  • OS: Windows 11
  • MemoAI version: 1.6.7
  • Transcription provider: OpenAI Whisper (cloud) with my own API key
  • Language: Simplified Chinese
  • Recording:
    • Length: ~95.4 minutes
    • Format (original): MP3, mono, 16 kHz, 128 kbps
    • Content: internal business meeting (Chinese)

What I did

  1. Imported a ~95 min MP3 recording into Memo.

  2. Chose transcript type cloud, provider OpenAI Whisper, language zh_s (Simplified Chinese).

  3. Let Memo run the transcription until status became success.

  4. The resulting transcript (and the convertResult saved into project.json) is almost entirely the same short phrase repeated over and over (e.g. "迷你版迷你曲迷你合拍" for almost the whole 1.5 hours).

  5. I then debugged locally using the files Memo stores under:

    "audioPath": "...\\resources\\transcribe.wav",
    "mp3Path":   "...\\resources\\transcribe.mp3"

6. I called the OpenAI API myself from Python using:

   * the same API key,
   * the same recording, but once via `transcribe.mp3` and once via `transcribe.wav`,
   * standard `model="whisper-1"` and `language="zh"`.

---

### Expected behavior

Using the **OpenAI Whisper** provider with a valid API key on a reasonably clear meeting recording should produce a meaningful Chinese transcript (even if not perfect), similar to what I get when I call `whisper-1` directly on the WAV.

In particular:

* If Memo shows `status: "success"` for the transcription,
* The transcript should not collapse into a single nonsense phrase repeated for almost the entire recording.

---

### Actual behavior

* Memo reports the transcription as **`status: "success"`**.

* `project.json` shows `convertResult` as a large JSON array of segments, but almost all segments have **the *same* `text` value**, for example:

  ```json
  {
    "st": "00:11:26.020",
    "et": "00:11:38.500",
    "text": "迷你版迷你曲迷你合拍"
  },
  {
    "st": "00:11:46.659",
    "et": "00:12:08.059",
    "text": "迷你版迷你曲迷你合拍"
  }
  ```

* This pattern continues from around the beginning to the end (~01:35:xx), which makes the transcript unusable.

---

### Investigation details

#### 1. Files created by Memo

In `project.json` for the offending recording I see (paths shortened):

```json
"audioPath": "...\\resources\\transcribe.wav",
"mp3Path":   "...\\resources\\transcribe.mp3",
"status": "success",
"convertResult": "[{ \"st\": ..., \"et\": ..., \"text\": \"迷你版迷你曲迷你合拍\" }, ...]"
```

* `transcribe.wav` is a long WAV file (~95.4 min), and when I listen to it, the audio sounds **normal and intelligible**.
* `transcribe.mp3` is a compressed MP3 created by Memo.

#### 2. Calling OpenAI Whisper on `transcribe.mp3` → nonsense

I wrote a small Python script to segment `transcribe.mp3` into 10-minute chunks and send each chunk to `whisper-1`:

```python
from openai import OpenAI
from pathlib import Path
import subprocess, shutil
from tempfile import TemporaryDirectory

client = OpenAI(api_key="<YOUR_OPENAI_API_KEY>")
AUDIO_PATH = Path(r"...\\resources\\transcribe.mp3")

def _require_tool(name: str) -> None:
    if shutil.which(name) is None:
        raise SystemExit(f"Missing required binary: {name}")

def _probe_duration_seconds(path: Path) -> float:
    _require_tool("ffprobe")
    result = subprocess.run(
        ["ffprobe", "-v", "error",
         "-show_entries", "format=duration",
         "-of", "default=noprint_wrappers=1:nokey=1",
         str(path)],
        capture_output=True, text=True, check=True,
    )
    return float(result.stdout.strip())

def transcribe(path: Path) -> str:
    duration = _probe_duration_seconds(path)
    print(f"Audio duration: {duration/60:.1f} minutes, segmenting into 10-min chunks...")

    _require_tool("ffmpeg")
    texts = []
    with TemporaryDirectory(prefix="whisper_segments_") as tmpdir:
        out_pattern = str(Path(tmpdir) / "chunk_%03d.mp3")
        subprocess.run(
            [
                "ffmpeg",
                "-hide_banner",
                "-loglevel", "error",
                "-y",
                "-i", str(path),
                "-f", "segment",
                "-segment_time", "600",
                "-c", "copy",
                "-reset_timestamps", "1",
                out_pattern,
            ],
            check=True,
        )

        for idx, chunk in enumerate(sorted(Path(tmpdir).glob("chunk_*.mp3")), 1):
            with chunk.open("rb") as f:
                resp = client.audio.transcriptions.create(
                    model="whisper-1",
                    file=f,
                    language="zh",
                    temperature=0,
                )
            text = (resp.text or "").strip()
            texts.append(text)
            print(f"[{idx}] {chunk.name} length: {len(text)} chars")

    return "\n".join(texts)

full_text = transcribe(AUDIO_PATH)
print("== First 500 chars ==")
print(full_text[:500])
```

Console output (simplified):

```text
Audio duration: 95.4 minutes, segmenting into 10-min chunks...
[1] chunk_000.mp3 length: 1394
[2] chunk_001.mp3 length: 2617
...
[10] chunk_009.mp3 length: 1110

== First 500 chars ==
继续练习 继续练习 继续练习 继续练习 继续练习 继续练习 继续练习 ...
```

So:

* Every segment returns **lots of characters**, but the *content* is basically the same phrase repeated (`"继续练习"` over and over), which matches the “repeated garbage” behavior seen in Memo’s transcript.

#### 3. Calling OpenAI Whisper on `transcribe.wav` → good transcript

Then I ran a different script on `transcribe.wav`, but this time I only sampled 3 × 60-second snippets (to save cost):

```python
from openai import OpenAI
from pathlib import Path
from tempfile import TemporaryDirectory
from contextlib import contextmanager
import subprocess, shutil

client = OpenAI(api_key="<YOUR_OPENAI_API_KEY>")
AUDIO_PATH = Path(r"...\\resources\\transcribe.wav")
SAMPLE_OFFSETS_SECONDS = [5 * 60, 30 * 60, 60 * 60]  # 5, 30, 60 min
SAMPLE_DURATION_SECONDS = 60
MIN_CHUNK_SECONDS = 1.0

def _require_tool(name: str) -> None:
    if shutil.which(name) is None:
        raise SystemExit(f"Missing required binary: {name}")

def _probe_duration_seconds(path: Path) -> float:
    _require_tool("ffprobe")
    result = subprocess.run(
        ["ffprobe", "-v", "error",
         "-show_entries", "format=duration",
         "-of", "default=noprint_wrappers=1:nokey=1",
         str(path)],
        capture_output=True, text=True, check=True,
    )
    return float(result.stdout.strip())

@contextmanager
def _extracted_samples(path: Path, offsets, sample_duration, min_duration):
    total_duration = _probe_duration_seconds(path)
    _require_tool("ffmpeg")
    print(
        "Sampling at offsets: "
        + ", ".join(f"{sec/60:.1f} min" for sec in offsets)
        + f", each {sample_duration} seconds."
    )

    with TemporaryDirectory(prefix="whisper_samples_") as tmpdir:
        tmp_dir = Path(tmpdir)
        samples = []
        for idx, offset in enumerate(offsets, 1):
            if offset >= total_duration:
                print(f"Skip sample {idx}: offset {offset/60:.1f} exceeds duration.")
                continue
            actual_length = min(sample_duration, total_duration - offset)
            if actual_length < min_duration:
                print(f"Skip sample {idx}: remaining length {actual_length:.2f}s too short.")
                continue

            sample_path = tmp_dir / f"sample_{idx:02d}.wav"
            subprocess.run(
                [
                    "ffmpeg",
                    "-hide_banner",
                    "-loglevel", "error",
                    "-y",
                    "-ss", str(offset),
                    "-i", str(path),
                    "-t", str(actual_length),
                    "-ac", "1",
                    "-ar", "16000",
                    "-c:a", "pcm_s16le",
                    str(sample_path),
                ],
                check=True,
            )
            samples.append(sample_path)

        if not samples:
            raise RuntimeError("No samples extracted.")
        yield samples

def _transcribe_samples(audio_path: Path) -> str:
    combined = []
    with _extracted_samples(audio_path, SAMPLE_OFFSETS_SECONDS,
                            SAMPLE_DURATION_SECONDS, MIN_CHUNK_SECONDS) as samples:
        total = len(samples)
        for idx, chunk in enumerate(samples, 1):
            with chunk.open("rb") as f:
                resp = client.audio.transcriptions.create(
                    model="whisper-1",
                    file=f,
                    language="zh",
                    temperature=0,
                )
            text = (resp.text or "").strip()
            combined.append(text)
            print(f"[{idx}/{total}] {chunk.name} done, length: {len(text)} chars")
    return "\n\n".join(filter(None, combined))

if __name__ == "__main__":
    transcript = _transcribe_samples(AUDIO_PATH)
    print("== Sample transcript ==")
    print(transcript)
```

Console output:

```text
Sampling at offsets: 5.0 min, 30.0 min, 60.0 min, each 60 seconds.
[1/3] sample_01.wav done, length: 259 chars
[2/3] sample_02.wav done, length: 187 chars
[3/3] sample_03.wav done, length: 312 chars

== Sample transcript ==
自然是在百分之四点多,还没有达到之前的38.6%的时候,比较热闹的水平,然后往下是自然渠道的改变,然后这个只能是自然渠道规模,这个没有太大变化,主要还是苹果和其他一些微软公司的这些渠道在涨,然后冯农文化目前保持在区间左右,也没有太大变化。刘腾的话,上周的那个图,看上去像是两条线,但是其实我是标了一个,这些所有都是冯农文化的流程,冯农流程目前就是在百分之三十左右。刘腾是吧? 刘腾渠道收入百分之三十。冯农自然百分之三十。那跟那个,百分之三十一减去百分之六,百分之二十八。这个好像之前跟徐老师他们的体感不一样。
...

那个没办法,拿他们的长周期来看。9月份122块钱是实际已经产生的。我的意思是这块的价格体系,它互相开放的那种一个月,就比如说自然的那个250块钱,然后呢商业化的那个80块钱,这个是有共识的。它这个只能按照实际产生的来看。对,它这个没有那么多的共识。所以我觉得这一块应该来严谨一下。肯定这个数据应该也没问题。不是,它现在其实那个,加上他们在做商业化的时候,应该直接就采用了那个商业化的数据。哦,它直接返回的那个应该是没问题的。对,那个是确认过的,实际产生。对,那是实际产生确认的那个肯定没问题。如果说这是运营那边某一个同学给你预估的,那后来我们这个地方一下用那个实际产生的。所以实际产生的时候它会重启。
```

This is **perfectly intelligible Chinese meeting speech**, matching what I hear in the recording. So:

* `transcribe.wav` + `whisper-1` works fine (even on long audio, when sampled).
* `transcribe.mp3` + `whisper-1` collapses into repeated nonsense.
* Memo’s in-app transcript and `convertResult` match the **MP3 failure mode**, not the WAV success mode.

This makes it very likely that **Memo is uploading `transcribe.mp3` (the compressed version) to OpenAI**, instead of the higher-quality `transcribe.wav`.

---

### Summary / Hypothesis

* The original recording (and Memo’s `transcribe.wav`) is of sufficient quality for Whisper to transcribe correctly.
* The `transcribe.mp3` that Memo generates is either:

  * over-compressed, or
  * encoded with parameters (sample rate/bitrate/etc.) that make Whisper struggle badly,
  * or otherwise corrupted in a way that leads to highly repetitive hallucinations.
* Memo seems to use `transcribe.mp3` as the source when calling OpenAI Whisper with a personal API key.
* As a result, long recordings sent via Memo + OpenAI Whisper can look "successful" but produce essentially unusable transcripts.

---

### Suggestions

* For long recordings, consider **explicitly segmenting the audio into shorter chunks** (e.g. 5–10 minutes each) on the client side, uploading each chunk separately to OpenAI Whisper, and then **merging the segment transcripts** on Memo’s side in time order. This would:

  * avoid hitting model/file-size limits,
  * make error handling per-chunk easier,
  * and allow partial progress/partial retries for long sessions.
* When doing such segmentation, it would be safer to use **`transcribe.wav` or a high-quality derivative** as the source for each chunk, rather than a heavily compressed MP3, to preserve enough acoustic detail for Whisper to work reliably.
* Additionally, consider adding some basic quality checks:

  * If a very high percentage of segments contain **identical text**, treat the transcription as failed or suspicious and warn the user instead of marking it as `success`.

If you need any additional logs, a sanitized `project.json`, or more targeted test scripts, I’m happy to provide them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions