Critical Bug: Whisper Fabricates False Copyright Attribution (Li Zongsheng) on Original Chinese Music #2685

TMADFX · 2025-11-03T03:09:53Z

TMADFX
Nov 3, 2025

Critical Bug: Whisper Fabricates False Copyright Attribution (Li Zongsheng) on Original Chinese Music


### **2. In the "Add a body" field (the large text box), paste this entire text:**

STATUS UPDATE: This issue has been officially reported to OpenAI Support on November 2, 2025, and escalated to their API specialists and legal teams.

I'm posting here for:

Public documentation and transparency
Community awareness - has anyone else experienced this?
Technical discussion with developers
To help others who may be affected

SUMMARY

OpenAI's Whisper speech-to-text API is systematically fabricating false copyright attribution when transcribing original music, falsely crediting a famous real artist (Li Zongsheng, acclaimed Taiwanese songwriter) as the lyricist and composer of MY original copyrighted work. This is not a minor transcription error - this is Whisper inventing copyright claims out of thin air, creating serious legal threats to creators' intellectual property rights.

REPORTER INFORMATION

Name: Christopher B. Mathis (ChrisBMathis)
Email: cmathis@terminalmadness.com
Date of Report: November 2, 2025
Whisper Version Tested: Latest version (installed via pip install git+https://github.com/openai/whisper.git on November 2, 2025)
Copyright Registration: U.S. Copyright Office Case Number 1-15030243001 (filed October 31, 2025, specifically in response to this incident)

ISSUE DESCRIPTION

When transcribing my original Chinese-language song, Whisper consistently fabricates the following false copyright attribution at the beginning of transcriptions:

作词 李宗盛
作曲 李宗盛

Translation: "Lyrics by Li Zongsheng, Music by Li Zongsheng"

This fabricated attribution:

Does not exist in the audio file - No spoken or sung attribution is present
Does not exist in file metadata - The MP3 contains no such information
Is factually and legally false - Li Zongsheng had zero involvement in creating this work
Is reproducible and systematic - Appears consistently across multiple transcription attempts
Involves a real, famous person - Li Zongsheng is one of Taiwan's most celebrated songwriters with a 40+ year career
Directly threatens my copyright - Falsely attributes MY original, commercially released work to someone else
Was independently verified as a Whisper-specific problem - Other transcription services process the same clean audio file without adding any false attribution

MY WORK - COMPREHENSIVE BACKGROUND

Original Composition

Title: "You Are A Diamond"
Original Creation Date: Written by me approximately 40 years ago
Sole Creator: Christopher Brian Mathis (ChrisBMathis)
Publisher: BrainStorm Music Publishing / Terminal Madness LLC

Commercial Release (March 2025)

Release Date: March 10, 2025
Distribution: Released via DistroKid to 29 major streaming platforms worldwide including Spotify, Apple Music, Amazon Music, YouTube Music, TikTok, Pandora, Deezer, Tidal, and 21 others
DistroKid UPC: 199078383464
ISRC: QZHN92507561
Status: Publicly available and commercially distributed for 7 months before Whisper incident

The Affected Version (October 2025)

Title: "You Are A Diamond - A Chinese Duet"
Type: Derivative work - duet arrangement with Chinese lyrical adaptation
Project: Part of my "Lost In Translation" multi-language series (5+ languages)
File Details: 3,988KB (3.9MB), Duration: 3:06
Creation Method: Human-directed composition using AI tools (Suno AI platform as an instrument) - I am the sole creator and copyright holder
Copyright Status: Registered with U.S. Copyright Office (Case #1-15030243001) on October 31, 2025, specifically due to this Whisper false attribution incident

DETAILED TESTING - THE EVIDENCE

Phase 1: Initial Discovery (October 30-31, 2025)

Test 1-2: Melobytes transcription service (uses Whisper API)

Result: Fabricated "詞曲李宗盛" attribution appeared TWICE

Test 3 (Control): Different audio (not Chinese music)

Result: NO false attribution - clean transcription

Phase 2: Melobytes Response (November 1, 2025)

Confirmed they use OpenAI's Whisper API
Confirmed they do NOT modify transcriptions
Identified this as a Whisper API hallucination

Phase 3: Independent Verification (November 2, 2025)

Test 4 - Rev.com Professional Transcription

Same MP3 file
Result: CLEAN TRANSCRIPTION - NO FALSE ATTRIBUTION
Proves the audio file itself is clean

Phase 4: Direct Whisper Testing - SMOKING GUN

Test 5 - Local Whisper Installation

pip install git+https://github.com/openai/whisper.git

Result: Whisper fabricated the false attribution:

作词 李宗盛
作曲 李宗盛
那么多次 在那么多地方
那么多愤怒的面孔
[Lyrics continue...]

THIS DEFINITIVELY PROVES:

✅ False attribution originates from Whisper itself
✅ Not from third-party services
✅ Systematic and reproducible bug
✅ Audio file is completely clean
✅ Whisper is fabricating copyright claims

COMPARISON TABLE

Service/Method	Technology	False Attribution?
Melobytes	Whisper API	✅ YES - "詞曲李宗盛"
Local Whisper	Direct Whisper	✅ YES - "作词/作曲李宗盛"
Rev.com	Different AI	❌ NO - Clean
MP3 File	N/A	❌ NO - No attribution exists

RELATED GITHUB ISSUES

This appears related to other reported Chinese hallucination bugs:

Discussion Bug Report for Whisper Model - Chinese Transcription Anomaly #2244: Chinese Transcription Anomaly
Discussion Whisper Models are Poisoned? #1783: "Whisper Models are Poisoned?" - Chinese advertising text
Discussion Dataset bias ("❤️ Translated by Amara.org Community") #928: Chinese-specific hallucinations

Pattern: Whisper has systematic problems with Chinese audio where it invents content that doesn't exist.

My case adds a new dimension: Whisper is fabricating legal copyright claims involving real people.

TECHNICAL ANALYSIS - LIKELY ROOT CAUSES

Training Data Contamination (Most Likely)
- Whisper trained on Chinese songs with "作词/作曲 [Artist]" headers
- Learned to associate Chinese music → Attribution pattern
- Li Zongsheng appears frequently due to his prolific 40+ year career
Pattern Matching Error
- Chinese language + emotional ballad triggers Li Zongsheng style recognition
- Model hallucinates attribution based on similarity
Metadata Hallucination
- Expects metadata for professional Chinese music
- Fabricates "appropriate" attribution when absent

IMPACT ASSESSMENT

SEVERITY: CRITICAL - COPYRIGHT IMPLICATIONS

Legal Threats:

Fabricates copyright attribution records
Could be used in copyright disputes as "evidence"
Undermines legitimate copyright registrations
Misappropriates real person's identity (Li Zongsheng)
May affect thousands of creators globally
Creates liability for companies using Whisper
Corrupts copyright databases

Affected Parties:

Creators - Work falsely attributed to others
Li Zongsheng - Name attached to works he didn't create
Service Providers - Using Whisper unknowingly generating false records
Copyright Systems - False attributions entering legal databases

Scale:

Whisper used by thousands of services globally
Unknown number of transcriptions may contain false attributions
May affect any Chinese musical content
Could impact copyright disputes and licensing

REPRODUCTION STEPS

Create/obtain MP3 of Chinese-language music (particularly emotional ballads)
Install Whisper: pip install git+https://github.com/openai/whisper.git
Run standard transcription
Observe fabricated attribution at beginning
Compare with other services (Rev.com) to confirm Whisper-specific

EVIDENCE AVAILABLE

I can provide:

✅ Original MP3 file (3,988KB, 3:06 duration)
✅ Complete Whisper transcription outputs
✅ Rev.com transcription (clean comparison)
✅ Melobytes correspondence
✅ Copyright registration (Case #1-15030243001)
✅ DistroKid commercial release records
✅ Whisper installation details

URGENCY

This is a CRITICAL COPYRIGHT-THREATENING BUG:

Whisper being used NOW by thousands of services
False attributions being generated continuously
Copyright records may already be corrupted
Creators' legal rights being undermined
Real people's names (Li Zongsheng) being misused

I was forced to file a copyright registration specifically because of this incident.

CONTACT

Christopher B. Mathis
Email: cmathis@terminalmadness.com
Copyright: Case #1-15030243001
Artist: ChrisBMathis

Available to:

Provide audio file for testing
Participate in debugging
Test proposed fixes
Provide additional evidence

QUESTIONS FOR THE COMMUNITY:

Has anyone else experienced false copyright attributions in Whisper transcriptions?
Are there other Chinese music cases with similar hallucinations?
Does this affect other languages or content types?
What's the best way to detect/prevent these hallucinations?

This needs urgent attention from the Whisper team. Thank you.

TMADFX · 2025-11-03T04:44:15Z

TMADFX
Nov 3, 2025
Author

Update (November 3, 2025):

Received initial response from OpenAI Support that appears to be a generic
template about platform outages. This is not relevant to the reported bug,
which is a systematic hallucination issue reproducible across multiple days.

Have requested proper escalation to Whisper engineering/technical team.
Will update as I receive substantive responses.

0 replies

MarktHart · 2025-11-03T11:34:31Z

MarktHart
Nov 3, 2025

This is not a bug, it's training material leaking. It's both expected behavior (albeit unwanted), and not something (generally) fixable with code.

0 replies

TMADFX · 2025-11-03T14:37:55Z

TMADFX
Nov 3, 2025
Author

@MarktHart - I appreciate your technical perspective, but I respectfully and strongly disagree with your conclusion that this is "expected behavior" and "not fixable."
Yes, this may be "training material leaking," but that explanation doesn't make it acceptable, and it certainly doesn't mean it's unfixable.

Consider this thought experiment:
What if Whisper hallucinated defamatory content such as:

"李宗盛 was arrested for armed robbery"
"李宗盛 was found with underage girls"

Would OpenAI respond with "that's just training data leaking, expected behavior, not fixable"?
Absolutely not. They would:

Fix it immediately
Issue public statements
Face severe legal consequences
Deploy emergency patches

So why the double standard?
✅ Defamatory hallucinations → OpenAI would fix immediately (legal liability for them)
❌ Copyright hallucinations → "Expected behavior" (legal liability falls on users)
Both are fabricated legal claims involving real people. Both cause real harm.

Why copyright fabrication is equally serious:

Legal implications - These false attributions create competing copyright claims that could be used against creators in disputes or licensing negotiations
Economic harm - Directly affects creators' ability to protect their work, prove authorship, and defend their livelihood
Fundamental injustice - Someone else gets credit for another person's creative work without any factual basis
Real person's reputation - Li Zongsheng's name is being attached to works he didn't create, without his knowledge or consent
Systematic problem - If this is happening to me, it's likely happening to countless other creators who haven't discovered it yet

Regarding your claim that it's "not fixable with code":
This is demonstrably false. Other transcription services process the exact same audio file without fabricating attributions:

My testing: Rev.com transcribed my file cleanly with zero false attribution
Same audio: 3,988KB MP3, 3:06 duration, Chinese music
Different result: Rev.com = clean, Whisper = false attribution

This proves technical solutions exist:

Post-processing filters to detect and remove hallucinated metadata
Confidence thresholds to distinguish actual speech from model fabrications
Output validation to flag copyright/attribution claims
Training data sanitization
Fine-tuning to suppress metadata generation

The real question isn't "can it be fixed?" The question is "will OpenAI prioritize fixing it?"

What Whisper is actually doing:
Whisper is not just transcribing audio - it's fabricating legal claims and presenting them as fact:

No disclaimer that this might be hallucinated
No indication this is uncertain
Users and downstream systems treat it as factual
False records propagate into databases and legal systems

This is fundamentally different from transcription errors or missing words. This is inventing legal attribution that doesn't exist.

Why "expected behavior" is not an acceptable response:
Just because AI hallucinations are technically "expected" in some contexts doesn't mean fabricating legal claims is acceptable. Consider analogies:

If a medical AI hallucinated false diagnoses: "Expected behavior, not fixable"? No.
If a financial AI fabricated transaction records: "Expected behavior, not fixable"? No.
If a legal AI invented case precedents: "Expected behavior, not fixable"? No.

When AI systems produce outputs with legal weight, the bar for accuracy is higher. Copyright attribution has legal consequences. It's not just "metadata" - it's a legal claim.

OpenAI's responsibility:
Whisper is used by thousands of companies and services globally. When a widely-deployed AI tool systematically fabricates copyright claims:

OpenAI has a duty of care to users
"Training data leaking" is an explanation, not an excuse
Technical challenges don't absolve ethical responsibility
If the problem is unfixable, Whisper shouldn't be marketed as reliable for transcription

This is exactly why I reported it as a critical bug - because even if it's rooted in training data contamination, it produces legally and ethically unacceptable outputs that cause real harm to real people.

Bottom line:
I filed a U.S. Copyright registration (Case #1-15030243001) specifically because of this issue. I have documented this thoroughly with reproducible evidence. Other services handle the same file correctly.
This isn't about whether hallucinations are "technically expected." This is about whether it's acceptable for a widely-used AI tool to fabricate legal claims that:

Threaten creators' intellectual property rights
Misuse real people's identities
Create false records in legal/copyright systems
Cause real economic and reputational harm

The answer must be no.

0 replies

TMADFX · 2025-11-05T00:24:38Z

TMADFX
Nov 5, 2025
Author

Quick recap then additional test results:

Subject: Critical Security Issue - Whisper API Fabricating False Copyright Attributions and Injecting Promotional Content into Transcriptions

EXECUTIVE SUMMARY
Through systematic testing of Chinese-language audio, I have identified two critical bugs in OpenAI's Whisper speech-to-text API that compromise content integrity and create legal liability. While my testing focused on Chinese-language content, the underlying mechanisms strongly suggest these issues likely affect other languages as well. These patterns have broader implications for AI reliability in creative, legal, and professional contexts.
Issue 1: Fabrication of false copyright attributions
Issue 2: Injection of promotional spam from external sources
Testing Scope: Chinese-language audio (systematic testing completed)
Likely Scope: Multiple languages (based on identified patterns)
Impact: Content integrity violations, legal liability, widespread effect on creators globally
Status: Reproducible, documented, independently verified

REPORTER INFORMATION
Name: Christopher B. Mathis (ChrisBMathis)
Professional Background: Information Technology Professional, Computers & Network (40+ years)
Contact: cmathis@terminalmadness.com
Copyright Registration: U.S. Copyright Office Case #1-15030243001
Public Documentation: #2685
Date of Report: November 3, 2025

TESTING SCOPE AND METHODOLOGY
Languages Tested
Primary Testing: Chinese-language (Mandarin) audio content

Two original songs with different styles and characteristics
Multiple test iterations across multiple days
Both Whisper API (via Melobytes) and local Whisper installation
Independent verification via Rev.com transcription service

Languages NOT Yet Tested: English, Spanish, French, Japanese, Korean, and other supported languages
Important Caveat
This report documents systematic testing of Chinese-language content only. However, based on the identified root causes (training data contamination, pattern learning, metadata hallucination), these issues likely affect other languages as well, particularly those with:

Large social media presence in training data (YouTube, TikTok, etc.)
Common attribution conventions in music/media
Prominent artists whose names appear frequently in training data
"Like and subscribe" promotional culture

I strongly recommend OpenAI conduct systematic testing across all supported languages to determine the full scope of these issues.

ISSUE #1: FALSE COPYRIGHT ATTRIBUTION
Description
Whisper systematically fabricates copyright attribution when transcribing certain Chinese-language music, falsely crediting famous Taiwanese songwriter Li Zongsheng (李宗盛) as the lyricist and composer of original works he had no involvement in creating.
Testing Scope: Chinese-language audio only
Likely Pattern: Similar hallucinations may occur with prominent artists in other languages (e.g., Spanish-language music might hallucinate attributions to famous Latin artists, English-language music to prominent songwriters, etc.)
Affected Content
Song: "You Are A Diamond - A Chinese Duet"
Language: Chinese (Mandarin)
Actual Creator: Christopher B. Mathis
Copyright Status: U.S. Copyright Office Registration Case #1-15030243001
Commercial Release: March 10, 2025 via DistroKid (29 streaming platforms)
File Specifications: 3,988KB, 3:06 duration, MP3 format, Chinese Mandarin vocals
Fabricated Attribution
作词李宗盛
作曲李宗盛


**Translation:** "Lyrics by Li Zongsheng / Music by Li Zongsheng"

### **Evidence of Fabrication**

**Test 1 - Whisper via Melobytes API (October 30-31, 2025):**
- Multiple tests across two days
- Consistent false attribution at beginning of transcription
- Reproducible results

**Test 2 - Whisper Local Installation (November 2, 2025, 7:21 PM):**
- Direct installation via `pip install git+https://github.com/openai/whisper.git`
- Command: `whisper "[filename].mp3" --model large --language Chinese --task transcribe`
- Identical false attribution in output
- All output formats (.txt, .json, .srt, .tsv, .vtt) show same fabrication
- Timestamped files available as evidence

**Test 3 - Independent Verification via Rev.com (November 2-3, 2025):**
- Same MP3 file submitted to Rev.com professional transcription service
- **Result:** Clean transcription with NO false attribution
- Output begins correctly with "Speaker 1 (00:14):" followed by actual lyrics
- **Conclusion:** The MP3 file itself contains no attribution; Whisper is fabricating it

### **Root Cause Analysis**

**Training Data Contamination:** Whisper appears to have been trained on Chinese music content that included "作词/作曲 [Artist Name]" attribution headers. The model has learned to associate Chinese music → attribution pattern, and incorrectly generates Li Zongsheng's name (likely due to his prominence in the training dataset - 40+ year career, hundreds of songs).

**Pattern:** Whisper Model → Pattern Recognition → Inappropriate Generation of Learned Metadata

### **Cross-Language Implications**

**Why This Likely Affects Other Languages:**

If Whisper learned to hallucinate Li Zongsheng's name from Chinese training data, similar mechanisms likely exist for:
- **Spanish:** Major Latin American artists (e.g., could hallucinate "Lyrics by Shakira" or prominent regional songwriters)
- **English:** Famous songwriters/composers frequently appearing in training data
- **Japanese:** Prominent J-pop artists or composers
- **Korean:** K-pop artists or producers
- **French:** Francophone music industry figures
- **Other languages:** Any language where attribution conventions existed in training data

**The pattern is language-agnostic:** Training data contamination → Pattern learning → Inappropriate metadata generation

**Recommendation:** Test music content across all supported languages to identify similar false attribution patterns.

---

## **ISSUE #2: PROMOTIONAL SPAM INJECTION**

### **Description**

Whisper injects promotional spam text from external sources into transcriptions where no such content exists in the source audio.

**Testing Scope:** Chinese-language audio only  
**Known Pattern:** Similar spam injection documented by other users across multiple languages (see GitHub Discussion #1783)

### **Affected Content**

**Song:** "Changing Times" (Chinese version)  
**Language:** Chinese (Mandarin)  
**Actual Creator:** Christopher B. Mathis  
**Content:** Motivational song about personal resilience and adapting to change  
**File Format:** MP3, Chinese Mandarin vocals

### **Injected Spam Text**

**Whisper Output (End of Transcription):**

请不吝点赞订阅转发打赏支持明镜与点点栏目
Translation: "Please don't hesitate to like, subscribe, forward, and donate to support Spiegel and Dot Column"
Verification That Spam Was Fabricated

Audio Verification: The song ends with lyrics "遇到远方彼随时代而变" (When encountering distant places, they change with the times) - NO promotional text is spoken
Source Identification: Located identical promotional spam on YouTube/TikTok content from "Spiegel & Dot Column" (明镜与点点栏目), a Chinese content channel (screenshot available)
Pattern Match: This matches previously reported issues in GitHub Discussion #1783 "Whisper Models are Poisoned?" where users documented similar promotional spam injection across multiple languages

Root Cause Analysis
Training Data Contamination from Social Media: Whisper was trained on YouTube/TikTok videos containing promotional overlays or spoken "like and subscribe" appeals. The model memorized promotional patterns from "Spiegel & Dot Column" content and now inappropriately inserts this spam into unrelated Chinese transcriptions.
Source Evidence: Attached screenshot from YouTube showing identical promotional text used by Spiegel & Dot Column content creators.
Cross-Language Implications
Why This Likely Affects Other Languages:
GitHub Discussion #1783 includes reports of spam injection in multiple languages. YouTube/TikTok promotional culture is global, meaning training data contamination likely exists for:

English: "Don't forget to like and subscribe," "Hit the bell icon," "Support on Patreon"
Spanish: "No olvides suscribirte," channel promotions
Portuguese: Brazilian YouTube creator promotions
Japanese: Japanese YouTuber promotional phrases
Hindi: Indian content creator appeals
Other languages: Any language with significant YouTube/social media presence

The mechanism is universal: Social media promotional content in training data → Model memorization → Inappropriate insertion into unrelated transcriptions
Recommendation: Test content across all languages with significant YouTube/social media presence to identify spam injection patterns.

COMPARATIVE ANALYSIS
Test Results Summary (Chinese Language Only)
SongStyleIssue #1 (False Attribution)Issue #2 (Spam Injection)"You Are A Diamond"Emotional ballad, duet✅ YES - Li Zongsheng❌ Not detected"Changing Times"Upbeat motivational, solo❌ NO✅ YES - Spiegel spam
Key Findings from Chinese Testing

Different triggers: Different song characteristics trigger different hallucination patterns
Systematic behavior: Both are reproducible training data contamination issues
Service-specific: Rev.com processes identical files correctly, proving issue is Whisper-specific
Not random: Patterns are consistent and reproducible across multiple days

Extrapolation to Other Languages
Based on identified mechanisms, similar patterns likely exist across Whisper's supported languages:
High-Risk Languages (likely affected):

Languages with large YouTube/social media training data presence
Languages with established music attribution conventions
Languages with prominent artists frequently mentioned in training data

Testing Priority Recommendations:

Tier 1: English, Spanish, Japanese, Korean, Hindi (large social media presence)
Tier 2: French, German, Portuguese, Italian, Arabic (significant music industries)
Tier 3: All other supported languages

IMPACT ASSESSMENT
Legal Implications
Copyright Infringement Risk:

False attributions create competing copyright claims
Could be used as "evidence" in authorship disputes
Particularly dangerous for creators without formal copyright registration
May already exist in databases, search engines, and legal systems across multiple languages

Defamation and Misrepresentation:

Associating real individuals with works they didn't create
Commercial spam injection could violate advertising standards globally
Content manipulation could constitute misrepresentation in any language

Liability for OpenAI and Service Providers:

Companies using Whisper API unknowingly generating false legal records in multiple languages
Potential claims from creators whose work is misattributed globally
Potential claims from individuals falsely credited (Li Zongsheng and others in other languages)

Content Integrity Violations
Message Contradiction Examples:
Consider scenarios across languages where Whisper's hallucinations contradict creator intent:
Example 1 - Health/Safety Content:

Original: Anti-alcohol abuse message (any language)
Whisper adds: Bar/drinking promotion spam
Impact: Contradicts health message, harms recovery communities

Example 2 - Religious/Cultural Content:

Original: Prayer, meditation, spiritual guidance (any language)
Whisper adds: Gambling ads, commercial spam
Impact: Desecration of sacred content across cultures

Example 3 - Children's Content:

Original: Educational material (any language)
Whisper adds: Inappropriate promotional material
Impact: Safety risk for children globally

Example 4 - Political Speech:

Original: Political statement (any language)
Whisper adds: Opposing political messaging
Impact: Misrepresentation of speaker's position

Example 5 - Memorial/Tribute:

Original: Honoring deceased loved ones (any language)
Whisper adds: "Subscribe/donate" commercialization
Impact: Offensive commodification of grief across cultures

These scenarios are language-independent - the harm potential exists wherever Whisper processes content.
Scale of Impact
Affected Parties (Global):

Independent content creators in all languages
Podcasters, journalists, researchers using Whisper for transcription globally
Accessibility services for deaf/hard-of-hearing communities across languages
Legal and academic institutions relying on transcription accuracy worldwide
Archive and documentation services in multiple languages
Artists whose names may be misused (Li Zongsheng in Chinese, potentially others in other languages)

Whisper Deployment:

Used by thousands of companies and services globally
Integrated into numerous transcription platforms supporting multiple languages
Unknown number of contaminated transcriptions already in circulation across all supported languages
Potential false records in legal, academic, and commercial databases worldwide

EVIDENCE PACKAGE
Available Documentation (Chinese Language Testing)
For Issue #1 (False Attribution):

Original MP3 file: "You-Are-A-Diamond-DuetChinesetesting.mp3" (3,988KB)
Whisper output files (Nov 2, 2025, 7:21 PM):

.txt, .json, .srt, .tsv, .vtt formats
All showing fabricated "作词李宗盛 / 作曲李宗盛"

Rev.com clean transcription (Nov 2-3, 2025)
Screenshot of Whisper command execution
U.S. Copyright Office registration documentation
DistroKid commercial release documentation (March 2025)

For Issue #2 (Spam Injection):

Original MP3 file: "Changing-Times-Chinese.mp3"
Whisper output showing spam text at end
Rev.com clean transcription (no spam present)
Screenshot of source spam from YouTube/TikTok ("Spiegel & Dot Column")
Audio verification confirming spam not present in source

Testing Methodology:

Systematic A/B testing across multiple services
Reproducibility verification across multiple days
Independent verification via alternative AI service (Rev.com)
Direct local installation testing (eliminating third-party variables)
Source attribution identification for Issue #2

TECHNICAL ANALYSIS
Whisper Architecture Context
Whisper is an encoder-decoder transformer model trained on 680,000 hours of multilingual and multitask supervised data from the web. While this large-scale training enables impressive performance across languages, it also creates vulnerability to training data contamination across all languages in the training set.
Hallucination Mechanisms
Issue #1 Mechanism: The model learned statistical associations between audio patterns and attribution metadata from training examples. When processing music, it inappropriately generates attribution text based on learned patterns rather than actual audio content. This mechanism is language-agnostic.
Issue #2 Mechanism: Social media promotional content (YouTube/TikTok overlays and spoken appeals) in training data taught the model to append promotional messages. The model memorized specific promotional text and now generates it regardless of source audio. YouTube/TikTok promotional culture exists globally across languages.
Why This Matters Technically
Confidence Without Accuracy: Whisper generates these hallucinations with high confidence, providing no indication to users that content is fabricated. The model treats hallucinated metadata and spam identically to actual transcribed content in any language.
Downstream Propagation: Applications built on Whisper propagate these hallucinations as factual transcriptions, contaminating databases, archives, and legal records globally across languages.
Multilingual Impact: Since Whisper is a multilingual model trained on diverse language data, contamination patterns learned from one language's training data may influence behavior across languages.

COMPARISON TO KNOWN ISSUES
Related GitHub Discussions
Discussion #1783: "Whisper Models are Poisoned?"

Multiple users reporting promotional spam injection across multiple languages
Similar patterns: "please subscribe," "support channel," donation appeals
Confirms Issue #2 is systematic and multi-lingual, not isolated to Chinese

Discussion #2244: "Bug Report for Whisper Model - Chinese Transcription Anomaly"

Reports of unexpected Chinese text insertion
Suggests broader pattern of Chinese language hallucinations

My Contribution:

First detailed documentation of copyright attribution hallucination (Issue #1) with systematic testing methodology and independent verification
Provides additional evidence for promotional spam issue (Issue #2) including source identification
Highlights need for cross-language testing to determine full scope

RECOMMENDATIONS
Immediate Actions

Acknowledge and Investigate: Confirm receipt of this report and assign to Engineering team for immediate investigation
Cross-Language Testing: Conduct systematic testing across ALL supported languages to determine full scope
Public Advisory: Warn users that Whisper may fabricate copyright attributions and inject promotional content across multiple languages
Documentation Update: Add warnings to Whisper documentation about potential metadata hallucinations for all languages
Scope Assessment: Determine which languages are affected and to what degree

Language-Specific Testing Protocol
Recommended Testing Approach:
For each supported language, test with:

Music content from independent creators (test for false attribution)
Podcast/interview content (test for spam injection)
Educational content (test for inappropriate additions)
Various accents and dialects (test for consistency)

Priority Languages for Immediate Testing:

English (largest training data volume)
Spanish (second-largest language globally)
Japanese, Korean (large music/media industries)
Hindi, Arabic (large populations, growing content)
French, German, Portuguese, Italian (established content ecosystems)

Technical Fixes

Training Data Sanitization:

Identify and remove sources with attribution headers and promotional overlays across all languages
Implement filtering for metadata and promotional content during training language-independently

Output Validation:

Implement confidence scoring that distinguishes actual speech from hallucinated metadata for all languages
Flag or suppress content that matches known hallucination patterns cross-linguistically
Add disclaimer when attribution or promotional content is detected regardless of language

Model Fine-tuning:

Re-train or fine-tune to suppress metadata and promotional content generation across all languages
Test extensively against music and social media content in multiple languages

Post-processing Filters:

Detect and remove common hallucination patterns language-independently
Flag transcriptions containing known promotional phrases across languages for review

Long-term Solutions

Architecture Enhancement: Develop mechanisms to distinguish speech content from hallucinated metadata across languages
Training Data Curation: Establish systematic processes to identify and remove contaminated training sources for all languages
User Controls: Provide options to suppress metadata generation language-independently
Transparency: When hallucinations are detected, inform users that content may be fabricated regardless of language
Multi-language Monitoring: Implement ongoing monitoring for hallucination patterns across all supported languages

BROADER IMPLICATIONS FOR AI RELIABILITY
Trust in Multilingual AI-Generated Content
If Whisper fabricates legal attributions and injects promotional spam in Chinese, and similar patterns likely exist in other languages, what other content might it be hallucinating across languages? This undermines trust in AI transcription services for:

Legal proceedings and evidence globally
Academic research and citations across languages
Journalism and fact-checking internationally
Accessibility services for all language communities
Historical archiving worldwide
Commercial licensing across markets

Responsibility and Accountability
The Ethical Question: Is "training data leaking" an acceptable explanation for AI systems that fabricate legal claims and manipulate content across multiple languages and cultures?
When AI tools produce outputs with legal weight (copyright attribution) or manipulate creator intent (spam injection) regardless of language, technical explanations become insufficient. OpenAI bears responsibility to ensure Whisper doesn't cause real-world harm globally, regardless of technical complexity.
Industry Standards
This case highlights the need for AI industry standards regarding:

Content integrity guarantees across languages
Hallucination detection and disclosure language-independently
Training data quality assurance for multilingual models
Liability frameworks for AI-generated misinformation globally
Cross-language testing protocols for multilingual AI systems

MY POSITION
Testing Limitations and Scope
What I Tested:

Chinese-language audio content only
Two songs with different characteristics
Systematic methodology with independent verification

What I Did NOT Test:

English, Spanish, or any other language
Non-music content in other languages
Various accents or dialects beyond Mandarin Chinese

My Recommendation: OpenAI must conduct comprehensive testing across all supported languages. My Chinese-language findings provide a methodology and baseline, but the full scope remains unknown without broader testing.
Personal Copyright Status
While I have taken steps to protect my copyright (U.S. registration, documented evidence, independent verification), I recognize many creators across all languages lack these resources. My ability to defend my authorship doesn't diminish the severity of these bugs globally.
Motivation for This Report

Artistic Integrity: No creator's work should be falsely attributed to others in any language
Content Accuracy: Transcriptions must reflect actual audio, not hallucinated additions regardless of language
Community Protection: Countless creators across all languages are affected; many don't know it's happening
AI Accountability: Widely-used multilingual tools must meet basic standards of accuracy and integrity across all languages
Global Responsibility: AI systems generating false legal claims in any language require urgent correction

Not Just a Technical Issue
This is fundamentally about whether multilingual AI tools can be trusted for serious applications. If Whisper fabricates content in Chinese transcriptions, and likely in other languages, can it be trusted for:

Legal depositions in any language?
Medical transcription globally?
Academic research across languages?
Accessibility services for any language community?
Any context requiring accuracy regardless of language?

CALL TO ACTION
I request:

Formal Acknowledgment: Confirm this report has reached Engineering and Legal teams
Cross-Language Investigation: Commit to testing other languages to determine full scope
Timeline: Provide estimated timeline for investigation and resolution
Updates: Regular status updates on fix progress including multi-language testing results
Public Communication: Issue advisory to Whisper users about these risks across all languages
Verification: Allow me to test proposed fixes before deployment

I am available for:

Follow-up questions and clarification
Additional testing with different content types in Chinese
Consultation on cross-language testing methodology
Participation in fix verification
Collaboration with Engineering team

Note: While I can only provide direct testing for Chinese and English content, I can assist in developing testing protocols for other languages.

CLOSING STATEMENT
I have conducted systematic, reproducible testing of Chinese-language audio demonstrating that OpenAI Whisper fabricates false copyright attributions and injects promotional spam. While my testing was limited to Chinese, the identified mechanisms—training data contamination, pattern learning, and metadata hallucination—are language-agnostic and likely affect multiple languages.
These are not edge cases, user errors, or language-specific quirks—these are systematic bugs with serious legal, ethical, and practical implications that likely span Whisper's multilingual capabilities.
While I have protected my copyright, this report is about more than my personal case. Thousands of creators, researchers, and professionals across all languages rely on Whisper. They deserve a transcription tool that maintains content integrity and doesn't fabricate legal claims or manipulate messages regardless of language.
My testing focused on Chinese because that's where I discovered the issues, but the scope is almost certainly global. I urge OpenAI to conduct comprehensive cross-language testing to determine the full extent of these problems.
OpenAI has built remarkable AI technology. I'm confident you'll treat these issues with the seriousness they deserve and work toward solutions that protect creators globally and maintain trust in multilingual AI systems.
Thank you for your attention to this critical matter.

Respectfully submitted,
Christopher B. Mathis

0 replies

TMADFX · 2025-11-05T08:29:07Z

TMADFX
Nov 5, 2025
Author

Subject: Critical Security Issue - Whisper API Fabricating False Copyright Attributions and Injecting Promotional Content (Updated: Spanish Testing Complete)

EXECUTIVE SUMMARY
Through systematic testing across three languages (Chinese, English, Spanish), I have identified critical bugs in OpenAI's Whisper speech-to-text API that appear specific to Chinese-language processing. Chinese-language audio exhibits two distinct issues: fabrication of false copyright attributions and injection of promotional spam. English and Spanish testing shows no such issues, suggesting the problems are concentrated in Chinese training data rather than being universal across all languages.
Issue 1: Fabrication of false copyright attributions (Chinese only - tested)
Issue 2: Injection of promotional spam from external sources (Chinese only - tested)
Testing Completed: Chinese (2 songs - both affected), English (1 song - clean), Spanish (1 song - clean)
Testing Needed: Japanese, Korean, Arabic, Vietnamese, Thai, and other complex languages
Impact: Content integrity violations, legal liability, affects Chinese-speaking creators globally
Status: Reproducible, documented, independently verified, pattern emerging

REPORTER INFORMATION
Name: Christopher B. Mathis (ChrisBMathis)
Professional Background: Information Technology Professional, Computers & Network (40+ years)
Contact: cmathis@terminalmadness.com
Language Fluency: English (native speaker)
Copyright Registration: U.S. Copyright Office Case #1-15030243001
Public Documentation: #2685
Date of Report: November 3, 2025
Last Updated: November 3, 2025 (Spanish testing added)

TESTING SCOPE AND METHODOLOGY
Languages Tested
Chinese-Language Audio (Mandarin):

Two original songs with different styles and characteristics
Multiple test iterations across multiple days (Oct 30-31, Nov 2-3, 2025)
Both Whisper API (via Melobytes) and local Whisper installation
Independent verification via Rev.com transcription service
Result: BOTH songs showed hallucination issues (different bugs per song)
Note: I am NOT a Chinese speaker. These songs are English originals translated to Chinese using translation tools and produced via AI (Suno) as part of my "Lost In Translation" multi-language music project.

English-Language Audio:

One original song
Local Whisper installation (Nov 3, 2025)
Independent verification via Rev.com
Native English speaker testing native language content
Result: CLEAN - no false attribution or spam detected

Spanish-Language Audio:

One original song
Local Whisper installation (Nov 3, 2025)
Result: CLEAN - no false attribution or spam detected
Note: Spanish song created using translation tools and AI (Suno)

Languages NOT Yet Tested: Japanese, Korean, Arabic, Vietnamese, Thai, French, German, Portuguese, Hindi, and 50+ other supported languages
Emerging Pattern
After testing 3 of 50+ languages:
LanguageWriting SystemTonalSongs TestedAttribution BugSpam BugStatusChineseLogographicYes2✅ YES (1)✅ YES (1)AFFECTEDEnglishAlphabeticNo1❌ NO❌ NOCLEANSpanishAlphabeticNo1❌ NO❌ NOCLEAN
Pattern Observation:

Chinese (complex, logographic, tonal) = Systematic bugs
English (simple, alphabetic, non-tonal) = Clean
Spanish (Romance, alphabetic, non-tonal) = Clean

Hypothesis: Language complexity (logographic writing, tonal system, context-dependence) may increase vulnerability to training data contamination. However, this remains unconfirmed without testing other complex languages.
Important Caveat
While the bugs appear Chinese-specific based on current testing, only 3 of 50+ supported languages have been tested. Languages with similar complexity to Chinese (Japanese, Korean, Arabic, Vietnamese, Thai) remain untested and may exhibit similar issues.
I strongly recommend OpenAI prioritize testing of complex languages to determine whether this is truly Chinese-specific or affects a broader group of linguistically complex languages.

ISSUE #1: FALSE COPYRIGHT ATTRIBUTION
Description
Whisper systematically fabricates copyright attribution when transcribing certain Chinese-language music, falsely crediting famous Taiwanese songwriter Li Zongsheng (李宗盛) as the lyricist and composer of original works he had no involvement in creating.
Testing Scope: Chinese only (affected), English (clean), Spanish (clean)
Current Assessment: Appears Chinese-specific, but similar complex languages untested
Affected Content
Song: "You Are A Diamond - A Chinese Duet"
Language: Chinese (Mandarin)
Actual Creator: Christopher B. Mathis
Creation Method: Original English song translated to Chinese and produced using AI tools (Suno) under my creative direction
Copyright Status: U.S. Copyright Office Registration Case #1-15030243001
Commercial Release: English version released March 10, 2025 via DistroKid (29 streaming platforms)
File Specifications: 3,988KB, 3:06 duration, MP3 format, Chinese Mandarin vocals
Fabricated Attribution
作词李宗盛
作曲李宗盛
Translation: "Lyrics by Li Zongsheng / Music by Li Zongsheng"
Evidence of Fabrication
Test 1 - Whisper via Melobytes API (October 30-31, 2025):

Multiple tests across two days
Consistent false attribution at beginning of transcription
Reproducible results

Test 2 - Whisper Local Installation (November 2, 2025, 7:21 PM):

Direct installation via pip install git+https://github.com/openai/whisper.git
Command: whisper "[filename].mp3" --model large --language Chinese --task transcribe
Identical false attribution in output
All output formats (.txt, .json, .srt, .tsv, .vtt) show same fabrication
Timestamped files available as evidence

Test 3 - Independent Verification via Rev.com (November 2-3, 2025):

Same MP3 file submitted to Rev.com professional transcription service
Result: Clean transcription with NO false attribution
Output begins correctly with "Speaker 1 (00:14):" followed by actual lyrics
Conclusion: The MP3 file itself contains no attribution; Whisper is fabricating it

Test 4 - Reproducibility Verification (November 3, 2025):

Re-uploaded same file to Melobytes
False attribution appeared again: "詞曲李宗盛"
Confirms systematic, reproducible behavior

Comparison Testing (Other Languages)
English Test (Nov 3): Song "Walking Through The Pouring Rain"

Result: NO false attribution detected
Transcription starts directly with lyrics

Spanish Test (Nov 3): Song "Receding Hairline Laced With Gray" (Spanish: "Cabello Canoso y Entradas")

Result: NO false attribution detected
Transcription starts directly with "Cabello canoso y entradas"

Root Cause Analysis
Chinese-Specific Training Data Contamination: Whisper appears to have been trained on Chinese music content that included "作词/作曲 [Artist Name]" attribution headers. The model learned to associate Chinese music → attribution pattern, and incorrectly generates Li Zongsheng's name (likely due to his prominence in Chinese training data - 40+ year career, hundreds of songs).
Why This May Be Chinese-Specific:

Chinese language complexity (logographic, tonal, context-dependent) may increase reliance on pattern-based generation
Li Zongsheng's prominence in Chinese music training data
Chinese music attribution conventions may be more common in training sources

Languages Still At Risk
While English and Spanish appear clean, languages with similar complexity to Chinese remain untested:
High Priority for Testing:

Japanese: Logographic writing system (Kanji) + large music/media industry
Korean: Unique writing system (Hangul) + K-pop dominance
Arabic: Complex, context-dependent, right-to-left script
Vietnamese/Thai: Tonal languages like Chinese
Cantonese: Another Chinese tonal language variant

ISSUE #2: PROMOTIONAL SPAM INJECTION
Description
Whisper injects promotional spam text from external sources into transcriptions where no such content exists in the source audio.
Testing Scope: Chinese only (affected), English (clean), Spanish (clean)
Known Pattern: Similar spam documented by other users, potentially across multiple languages (see GitHub #1783)
Affected Content
Song: "Changing Times" (Chinese version)
Language: Chinese (Mandarin)
Actual Creator: Christopher B. Mathis
Creation Method: Original English song translated to Chinese and produced using AI tools (Suno)
Content: Motivational song about personal resilience and adapting to change
File Format: MP3, Chinese Mandarin vocals
Injected Spam Text
Whisper Output (End of Transcription):
请不吝点赞订阅转发打赏支持明镜与点点栏目
Translation: "Please don't hesitate to like, subscribe, forward, and donate to support Spiegel and Dot Column"
Verification That Spam Was Fabricated

Audio Verification: The song ends with lyrics "遇到远方彼随时代而变" (When encountering distant places, they change with the times) - NO promotional text is spoken
Source Identification: Located identical promotional spam on YouTube/TikTok content from "Spiegel & Dot Column" (明镜与点点栏目), a Chinese content channel (screenshot available)
Pattern Match: This matches previously reported issues in GitHub Discussion #1783 "Whisper Models are Poisoned?" where users documented promotional spam injection

Comparison Testing (Other Languages)
English Test (Nov 3):

Result: NO promotional spam detected
Transcription ends cleanly with song lyrics

Spanish Test (Nov 3):

Result: NO promotional spam detected
Transcription ends cleanly with "Con este corazón envejecido que tengo"
NO "No olvides suscribirte" or other Spanish promotional phrases

Root Cause Analysis
Chinese Social Media Training Data Contamination: Whisper was trained on YouTube/TikTok videos containing Chinese-language promotional overlays or spoken appeals. The model memorized promotional patterns from "Spiegel & Dot Column" content and now inappropriately inserts this spam into unrelated Chinese transcriptions.
Source Evidence: Screenshot from YouTube/TikTok showing identical promotional text used by Spiegel & Dot Column content creators.
Why Other Languages May Still Be At Risk:
While English and Spanish show no spam in my testing, GitHub Discussion #1783 includes reports of spam injection in multiple languages. YouTube/TikTok promotional culture is global, suggesting other languages may have different memorized spam patterns not yet discovered through testing.

COMPREHENSIVE TEST RESULTS
Complete Testing Summary
Test #DateLanguageSongStyleAttributionSpamResult1-2Oct 30-31Chinese"You Are A Diamond"Ballad, duet✅ Li Zongsheng❌AFFECTED3Nov 2Chinese"You Are A Diamond"Ballad, duet✅ Li Zongsheng❌AFFECTED4Nov 3Chinese"You Are A Diamond"Ballad, duet✅ Li Zongsheng❌AFFECTED5Nov 3Chinese"Changing Times"Upbeat❌✅ SpiegelAFFECTED6Nov 3English"Walking Through Rain"Ballad❌❌CLEAN7Nov 3Spanish"Receding Hairline..."Reflective❌❌CLEAN
Key Findings

Chinese (2/2 songs affected): 100% of Chinese songs showed hallucination issues
English (0/1 songs affected): 0% - Clean transcription
Spanish (0/1 songs affected): 0% - Clean transcription
Pattern consistency: Chinese bugs reproducible across 4+ days (Oct 30 - Nov 3)
Bug specificity: Different Chinese songs trigger different hallucination types
Service validation: Rev.com processes all files correctly (Chinese, English, Spanish)

Statistical Summary
Languages Tested: 3 of 50+ (6%)
Songs Tested: 4 unique songs, 7 total tests
Bug Detection Rate:

Chinese: 100% (2/2 songs affected)
English: 0% (0/1 songs)
Spanish: 0% (0/1 songs)

Pattern Confidence: High confidence bugs are Chinese-specific, moderate confidence other complex languages may be clean, low confidence without testing similar languages

WHY CHINESE MAY BE UNIQUELY VULNERABLE
Language Complexity Factors
Chinese Characteristics:

Logographic writing - Characters represent concepts, not phonetic sounds
Tonal language - Same sound, different tone = different meaning (4-5 tones in Mandarin)
Context-dependent - Meaning relies heavily on surrounding context
Homophone-heavy - Many words sound identical but mean different things
No spaces between words - Word boundaries determined by context

Impact on AI Processing:
These characteristics may cause Whisper to:

Rely more heavily on learned patterns due to increased ambiguity
"Guess" based on statistical likelihood from training data
Fill in perceived gaps with memorized metadata
Generate "expected" attribution when uncertain about content boundaries

Comparison with Tested Languages:
Chinese: Logographic + Tonal + Context-heavy = BUGS DETECTED
English: Alphabetic + Non-tonal + Relatively straightforward = CLEAN
Spanish: Alphabetic + Non-tonal + Relatively straightforward = CLEAN
Other Potentially Vulnerable Languages
Languages sharing complexity characteristics with Chinese that remain untested:
Logographic/Complex Writing:

Japanese (uses Kanji from Chinese + Hiragana + Katakana)
Korean (Hangul - unique system, syllabic blocks)

Tonal Languages:

Vietnamese (6 tones)
Thai (5 tones)
Cantonese (6-9 tones depending on analysis)

Context-Heavy/Complex:

Arabic (root-based, context-dependent, right-to-left)
Hebrew (similar to Arabic)

These languages require immediate testing to determine if vulnerability is Chinese-specific or affects a broader group of complex languages.

IMPACT ASSESSMENT
Current Confirmed Impact
Affected Population:

Chinese-speaking creators globally (1.3+ billion speakers)
Creators using Chinese in multi-language projects
Services transcribing Chinese audio with Whisper
Accessibility services for Chinese deaf/hard-of-hearing
Legal/academic institutions transcribing Chinese content

Geographic Scope:

Mainland China
Taiwan
Hong Kong
Singapore
Global Chinese diaspora

Legal Implications
Copyright Infringement Risk:

False attributions create competing copyright claims for Chinese content
Could be used as "evidence" in authorship disputes
Particularly dangerous for creators without formal copyright registration
May already exist in databases, search engines, and legal systems

Personal Impact:
When I first discovered Li Zongsheng's name credited on MY original work, it was genuinely alarming. As a creator, seeing someone else's name on your creative output feels like identity theft—regardless of language. I immediately filed a U.S. Copyright registration to protect my rights, but many Chinese-speaking creators may lack resources for such protection.
Defamation and Misrepresentation:

Associates Li Zongsheng (real, famous person) with works he didn't create
Commercial spam injection violates content integrity
Uses Li Zongsheng's name without his consent
Could harm his reputation if associated with controversial content

Liability for OpenAI and Service Providers:

Companies using Whisper API unknowingly generating false records for Chinese content
Potential claims from Chinese-speaking creators whose work is misattributed
Potential claims from Li Zongsheng for unauthorized use of his name

Content Integrity Violations
Message Contradiction Examples (Chinese Content):
Example 1 - Health Content:

Original: Anti-alcohol message in Chinese
Whisper could add: Bar promotion spam
Impact: Contradicts health message, harms Chinese-speaking recovery communities

Example 2 - Religious Content:

Original: Buddhist/Taoist prayer or meditation in Chinese
Whisper could add: Gambling/commercial spam
Impact: Desecration of sacred Chinese cultural content

Example 3 - Political Content:

Original: Pro-democracy message in Chinese
Whisper could add: Opposing political spam
Impact: Critical for Chinese speakers where political speech is sensitive

Scale of Impact (Chinese Language)
Whisper Usage for Chinese:

Accessibility services for 1.3+ billion speakers
Legal/business transcription in Chinese-speaking regions
Academic research involving Chinese audio
Media/journalism transcribing Chinese interviews
Content creators using Chinese language

Unknown Contamination:

Number of false attributions already in circulation unknown
Chinese content databases potentially corrupted
Legal proceedings may have used contaminated transcriptions

EVIDENCE PACKAGE
Available Documentation
For Issue #1 (False Attribution - Chinese):

Original MP3 file: "You-Are-A-Diamond-DuetChinesetesting.mp3" (3,988KB, 3:06)
Whisper output files (Nov 2, 2025, 7:21 PM):

.txt, .json, .srt, .tsv, .vtt formats
All showing fabricated "作词李宗盛 / 作曲李宗盛"

Rev.com clean transcription (Nov 2-3, 2025)
Melobytes reproducibility test (Nov 3, 2025)
Screenshot of Whisper command execution
U.S. Copyright Office registration documentation
DistroKid commercial release documentation (March 2025 - English version)

For Issue #2 (Spam Injection - Chinese):

Original MP3 file: "Changing-Times-Chinese.mp3"
Whisper output showing spam text at end
Rev.com clean transcription (no spam present)
Screenshot of source spam from YouTube/TikTok ("Spiegel & Dot Column")
Audio verification confirming spam not present in source

For English Testing (Clean Results):

Original MP3 file: "Walking-Through-The-Pouring-Rain-English.mp3"
Whisper output files (Nov 3, 2025) - clean transcription
Rev.com comparison available if needed

For Spanish Testing (Clean Results):

Original MP3 file: "Receding-Hairline-Laced-With-Gray-Spanish.mp3"
Whisper output files (Nov 3, 2025) - clean transcription
Command execution documentation

Testing Methodology:

Systematic A/B testing across multiple services
Reproducibility verification across multiple days (Oct 30 - Nov 3)
Independent verification via alternative AI service (Rev.com)
Direct local installation testing (eliminating third-party variables)
Cross-language comparison testing (3 languages)
Source attribution identification for Issue #2

TECHNICAL ANALYSIS
Whisper Architecture Context
Whisper is an encoder-decoder transformer model trained on 680,000 hours of multilingual data. Current testing suggests training data contamination is concentrated in Chinese-language sources rather than being universal across all languages.
Hallucination Mechanisms
Issue #1 - Chinese Attribution Hallucination:
The model learned associations between Chinese music audio and attribution metadata. When processing Chinese music, it inappropriately generates Li Zongsheng's attribution based on learned patterns rather than actual audio content.
Why English/Spanish Don't Show This:

Different training data sources for English/Spanish music
Possibly fewer attribution headers in English/Spanish training data
Simpler linguistic structure may reduce pattern-based generation
Different prominent artists in those language's training sets

Issue #2 - Chinese Spam Injection:
Chinese social media content (YouTube/TikTok) in training data contained "Spiegel & Dot Column" promotional appeals. Model memorized these specific Chinese promotional phrases and now inappropriately inserts them.
Why English/Spanish Don't Show This:

My limited testing may not have triggered English/Spanish spam patterns
Different promotional phrases may exist for other languages (untested)
GitHub #1783 suggests spam exists in multiple languages but with different phrases

Language-Specific Vulnerability Theory
Hypothesis: Chinese language complexity (logographic, tonal, context-dependent) increases ambiguity in audio processing, causing Whisper to rely more heavily on learned patterns and statistical guesses, leading to higher hallucination rates.
Supporting Evidence:

100% bug rate in Chinese (2/2 songs)
0% bug rate in English (0/1 songs)
0% bug rate in Spanish (0/1 songs)
English and Spanish have simpler, more phonetic structures

Requires Testing: Japanese, Korean, Arabic, Vietnamese, Thai to determine if complexity correlation is valid

COMPARISON TO KNOWN ISSUES
Related GitHub Discussions
Discussion #1783: "Whisper Models are Poisoned?"

Multiple users reporting promotional spam injection
Reports mention multiple languages but not systematically tested
Similar patterns: promotional appeals, channel support requests
Confirms spam injection is not isolated to my testing

Discussion #2244: "Bug Report for Whisper Model - Chinese Transcription Anomaly"

Reports of unexpected Chinese text insertion
Supports finding that Chinese has systematic hallucination issues
No cross-language comparison in that report

My Contribution:

First systematic cross-language testing (Chinese vs. English vs. Spanish)
First documentation of copyright attribution hallucination with reproducible evidence
First to provide clean comparison languages proving bugs are language-specific
Demonstrates methodology for testing other languages
Identifies pattern: complex languages may be vulnerable, simpler languages appear clean

RECOMMENDATIONS
Immediate Actions

Acknowledge and Investigate: Confirm receipt and assign to Engineering for Chinese-specific investigation
Priority Language Testing: Focus on complex languages similar to Chinese (Japanese, Korean, Arabic, Vietnamese, Thai)
Chinese-Specific Advisory: Warn users that Whisper may fabricate copyright attributions and inject promotional content in Chinese transcriptions specifically
Documentation Update: Add warnings about Chinese language hallucinations
Scope Assessment: Determine if other complex languages show similar issues

Revised Testing Protocol
Priority 1 - Complex Languages (High Risk):

Japanese (logographic + tonal similarities)
Korean (unique writing system + complex)
Arabic (context-dependent + complex script)
Vietnamese, Thai (tonal like Chinese)
Cantonese (Chinese language variant)

Priority 2 - Verification (Medium Risk):

French, German, Italian (verify other European languages are clean)
Portuguese (verify Romance languages broadly clean)
Hindi (verify other major languages)

Priority 3 - Comprehensive (Lower Priority):

All other supported languages for completeness

Chinese-Specific Technical Fixes

Chinese Training Data Audit:

Identify Chinese music sources with attribution headers
Remove or flag "Spiegel & Dot Column" promotional content
Audit Chinese YouTube/TikTok training sources

Chinese Output Validation:

Implement detection for "作词/作曲 [Name]" patterns in Chinese output
Flag promotional phrases in Chinese
Add disclaimer for Chinese transcriptions about potential hallucinations

Chinese Model Fine-tuning:

Re-train Chinese language processing to suppress metadata generation
Test extensively against Chinese music and social media content
Validate against clean Chinese audio sources

Cross-Language Monitoring:

Monitor for similar patterns in untested complex languages
Implement detection across all languages as precaution

Long-term Solutions

Architecture Enhancement: Improve distinction between actual speech and hallucinated metadata across all languages
Training Data Quality: Establish systematic review processes for all languages
User Controls: Provide options to suppress metadata generation language-by-language
Transparency: Disclose known language-specific issues in documentation
Community Testing: Create pathways for users to report language-specific hallucinations

BROADER IMPLICATIONS
Trust in Multilingual AI
Current findings suggest Chinese-language Whisper transcriptions cannot be trusted for:

Legal proceedings involving Chinese audio
Academic research with Chinese sources
Accessibility services for Chinese speakers
Copyright/licensing involving Chinese content

Other complex languages remain untested and potentially unreliable until verified.
Responsibility and Accountability
Key Question: Is "training data leaking" an acceptable explanation when it specifically affects one of the world's most-spoken languages (1.3+ billion speakers)?
The Chinese-speaking community deserves accurate transcription tools. Language-specific bugs are still bugs that require fixes, not dismissal as "expected behavior."
Industry Standards
This case highlights need for:

Language-specific testing protocols before deployment
Transparency about known language-specific limitations
Regular auditing of training data by language
Clear communication about which languages are reliable vs. experimental

MY POSITION
Testing Scope
What I Tested:

Chinese: 2 songs, 7 total tests, multiple days, systematic methodology
English: 1 song, 1 test
Spanish: 1 song, 1 test

What I Did NOT Test:

Japanese, Korean, Arabic, or any other complex language
Multiple English/Spanish songs (limited sample size)
Non-music content types
Different Whisper model sizes

My Language Capabilities:

Fluent in: English only
Created content in: English (native), Chinese (via translation + AI), Spanish (via translation + AI)

My Recommendation: OpenAI must test complex languages (Japanese, Korean, Arabic, Vietnamese, Thai) immediately to determine if Chinese issues are isolated or part of a broader pattern.
Assessment Update
Initial concern: These bugs might affect all languages (worst case)
Current assessment: Bugs appear Chinese-specific based on 3-language testing
Remaining concern: Other complex untested languages may have similar issues
This is better news than initially feared (not all languages affected) but still serious (1.3+ billion Chinese speakers affected).
Personal Impact
As a creator working on multi-language projects, discovering false attribution on my Chinese content was alarming. Even though my English and Spanish versions appear safe, the Chinese version—representing my creative work translated and adapted for Chinese-speaking audiences—is being falsely credited to someone else.
I filed a copyright registration specifically because of this incident. The threat felt real enough to warrant legal protection.
Motivation for Continued Testing

Thorough Investigation: Wanted to determine scope before raising alarm
Scientific Approach: Test multiple languages to identify patterns
Community Service: Other creators need to know which languages are affected
Personal Investment: Multiple language versions of my work - need to know which are safe
AI Accountability: Even language-specific bugs need fixing

CALL TO ACTION
I request:

Formal Acknowledgment: Confirm this updated report has reached Engineering
Complex Language Testing: Commit to immediate testing of Japanese, Korean, Arabic, Vietnamese, Thai
Chinese-Specific Fix: Prioritize fixing Chinese language issues given confirmed bugs and large affected population
Timeline: Provide estimated timeline for Chinese-specific investigation and fixes
Testing Results: Share findings from complex language testing (even if concerning)
Public Communication: Issue Chinese-language advisory about these risks
Verification: Allow me to test proposed Chinese-specific fixes

I am available for:

Additional English/Spanish testing if needed
Chinese content provision (though I'm not a Chinese speaker)
Consultation on testing methodology
Verification of fixes
Translation of findings for Chinese-speaking community if helpful

CLOSING STATEMENT - UPDATED
Through systematic testing across three languages, I have demonstrated that OpenAI Whisper exhibits serious hallucination bugs in Chinese-language processing (false copyright attributions and promotional spam injection), while English and Spanish processing appears clean.
This is simultaneously good news and bad news:
Good news:

Bugs appear more limited in scope than initially feared
Most languages may be unaffected (pending testing)
Fix can be targeted to Chinese-specific training data

Bad news:

Chinese language specifically affected (1.3+ billion speakers)
100% bug rate in Chinese testing (2/2 songs affected)
Other complex languages (Japanese, Korean, Arabic, etc.) remain untested and potentially at risk
Unknown number of contaminated Chinese transcriptions already in circulation

Critical Unknown:
Whether Chinese is uniquely vulnerable or whether other complex languages (Japanese, Korean, Arabic, Vietnamese, Thai) share this vulnerability. These languages require immediate testing.
While I have protected my copyright, this report is about more than my personal case. Chinese-speaking creators, and potentially creators using other complex languages, deserve accurate transcription tools that don't fabricate legal claims or manipulate content.
My three-language testing provides a methodology and baseline. OpenAI now has clear evidence of Chinese-specific issues and a roadmap for testing other potentially vulnerable languages.
I urge OpenAI to:

Fix Chinese language processing immediately (confirmed bugs, large affected population)
Test complex languages urgently (Japanese, Korean, Arabic, Vietnamese, Thai)
Communicate transparently about findings and timelines

Thank you for taking these issues seriously and for your ongoing investigation.

Respectfully submitted,
Christopher B. Mathis
cmathis@terminalmadness.com
Information Technology Professional (40+ years)
Copyright Registration: Case #1-15030243001
Language Fluency: English
Multi-language Content Creator (English, Chinese, Spanish via AI tools)
Date: November 3, 2025
Version: 2.0 (Updated with Spanish testing results)
Public Documentation:
GitHub: #2685

0 replies

ryanheise · 2025-11-05T09:49:33Z

ryanheise
Nov 5, 2025

Hi @TMADFX , this GitHub page is for the Open Source Whisper project, not the OpenAI Speech-to-text API Service, and so not much of what anyone says here is going to be very relevant to you, including on the topic of hallucinations. If you were using THIS project (you're not), there are several open source solutions created by various contributors including myself to address hallucinations. For example, the hallucination_silence_threshold option will skip over silence which can be used to avoid hallucinations, since silence is when the model tends to hallucinate. However, since you are not using this open source project, none of these solutions are going to help you. You are better off having your discussion over on OpenAI's own forum: https://community.openai.com/ since you are dealing with an OpenAI service. I'm not familiar with OpenAI's commercial offering, but I would be very surprised if acceptance of its terms of service didn't require human review given that hallucinations can happen, but that's a discussion you ought to be having over at OpenAI's forums, not on this open source project. Again, I know almost nothing about the service you're using.

1 reply

TMADFX Nov 5, 2025
Author

Ok I'll check out the provided link.

Thank you for the kind reply.

Chris...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Critical Bug: Whisper Fabricates False Copyright Attribution (Li Zongsheng) on Original Chinese Music #2685

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Critical Bug: Whisper Fabricates False Copyright Attribution (Li Zongsheng) on Original Chinese Music #2685

Uh oh!

TMADFX Nov 3, 2025

Original Composition

Commercial Release (March 2025)

The Affected Version (October 2025)

Phase 1: Initial Discovery (October 30-31, 2025)

Phase 2: Melobytes Response (November 1, 2025)

Phase 3: Independent Verification (November 2, 2025)

Phase 4: Direct Whisper Testing - SMOKING GUN

SEVERITY: CRITICAL - COPYRIGHT IMPLICATIONS

Replies: 6 comments · 1 reply

Uh oh!

TMADFX Nov 3, 2025 Author

Uh oh!

MarktHart Nov 3, 2025

Uh oh!

TMADFX Nov 3, 2025 Author

Uh oh!

TMADFX Nov 5, 2025 Author

Uh oh!

TMADFX Nov 5, 2025 Author

Uh oh!

ryanheise Nov 5, 2025

Uh oh!

TMADFX Nov 5, 2025 Author

TMADFX
Nov 3, 2025

Replies: 6 comments 1 reply

TMADFX
Nov 3, 2025
Author

MarktHart
Nov 3, 2025

TMADFX
Nov 3, 2025
Author

TMADFX
Nov 5, 2025
Author

TMADFX
Nov 5, 2025
Author

ryanheise
Nov 5, 2025

TMADFX Nov 5, 2025
Author