Replies: 6 comments 1 reply
-
|
Update (November 3, 2025): Received initial response from OpenAI Support that appears to be a generic Have requested proper escalation to Whisper engineering/technical team. |
Beta Was this translation helpful? Give feedback.
-
|
This is not a bug, it's training material leaking. It's both expected behavior (albeit unwanted), and not something (generally) fixable with code. |
Beta Was this translation helpful? Give feedback.
-
|
@MarktHart - I appreciate your technical perspective, but I respectfully and strongly disagree with your conclusion that this is "expected behavior" and "not fixable." Consider this thought experiment: "李宗盛 was arrested for armed robbery" Would OpenAI respond with "that's just training data leaking, expected behavior, not fixable"? Fix it immediately So why the double standard? Why copyright fabrication is equally serious: Legal implications - These false attributions create competing copyright claims that could be used against creators in disputes or licensing negotiations Regarding your claim that it's "not fixable with code": My testing: Rev.com transcribed my file cleanly with zero false attribution This proves technical solutions exist: Post-processing filters to detect and remove hallucinated metadata The real question isn't "can it be fixed?" The question is "will OpenAI prioritize fixing it?" What Whisper is actually doing: No disclaimer that this might be hallucinated This is fundamentally different from transcription errors or missing words. This is inventing legal attribution that doesn't exist. Why "expected behavior" is not an acceptable response: If a medical AI hallucinated false diagnoses: "Expected behavior, not fixable"? No. When AI systems produce outputs with legal weight, the bar for accuracy is higher. Copyright attribution has legal consequences. It's not just "metadata" - it's a legal claim. OpenAI's responsibility: OpenAI has a duty of care to users This is exactly why I reported it as a critical bug - because even if it's rooted in training data contamination, it produces legally and ethically unacceptable outputs that cause real harm to real people. Bottom line: Threaten creators' intellectual property rights The answer must be no. |
Beta Was this translation helpful? Give feedback.
-
|
Quick recap then additional test results: Subject: Critical Security Issue - Whisper API Fabricating False Copyright Attributions and Injecting Promotional Content into Transcriptions EXECUTIVE SUMMARY REPORTER INFORMATION TESTING SCOPE AND METHODOLOGY Two original songs with different styles and characteristics Languages NOT Yet Tested: English, Spanish, French, Japanese, Korean, and other supported languages Large social media presence in training data (YouTube, TikTok, etc.) I strongly recommend OpenAI conduct systematic testing across all supported languages to determine the full scope of these issues. ISSUE #1: FALSE COPYRIGHT ATTRIBUTION 请不吝点赞 订阅 转发 打赏支持明镜与点点栏目 Audio Verification: The song ends with lyrics "遇到远方 彼随时代而变" (When encountering distant places, they change with the times) - NO promotional text is spoken Root Cause Analysis English: "Don't forget to like and subscribe," "Hit the bell icon," "Support on Patreon" The mechanism is universal: Social media promotional content in training data → Model memorization → Inappropriate insertion into unrelated transcriptions COMPARATIVE ANALYSIS Different triggers: Different song characteristics trigger different hallucination patterns Extrapolation to Other Languages Languages with large YouTube/social media training data presence Testing Priority Recommendations: Tier 1: English, Spanish, Japanese, Korean, Hindi (large social media presence) IMPACT ASSESSMENT False attributions create competing copyright claims Defamation and Misrepresentation: Associating real individuals with works they didn't create Liability for OpenAI and Service Providers: Companies using Whisper API unknowingly generating false legal records in multiple languages Content Integrity Violations Original: Anti-alcohol abuse message (any language) Example 2 - Religious/Cultural Content: Original: Prayer, meditation, spiritual guidance (any language) Example 3 - Children's Content: Original: Educational material (any language) Example 4 - Political Speech: Original: Political statement (any language) Example 5 - Memorial/Tribute: Original: Honoring deceased loved ones (any language) These scenarios are language-independent - the harm potential exists wherever Whisper processes content. Independent content creators in all languages Whisper Deployment: Used by thousands of companies and services globally EVIDENCE PACKAGE Original MP3 file: "You-Are-A-Diamond-DuetChinesetesting.mp3" (3,988KB) .txt, .json, .srt, .tsv, .vtt formats Rev.com clean transcription (Nov 2-3, 2025) For Issue #2 (Spam Injection): Original MP3 file: "Changing-Times-Chinese.mp3" Testing Methodology: Systematic A/B testing across multiple services TECHNICAL ANALYSIS COMPARISON TO KNOWN ISSUES Multiple users reporting promotional spam injection across multiple languages Discussion #2244: "Bug Report for Whisper Model - Chinese Transcription Anomaly" Reports of unexpected Chinese text insertion My Contribution: First detailed documentation of copyright attribution hallucination (Issue #1) with systematic testing methodology and independent verification RECOMMENDATIONS Acknowledge and Investigate: Confirm receipt of this report and assign to Engineering team for immediate investigation Language-Specific Testing Protocol Music content from independent creators (test for false attribution) Priority Languages for Immediate Testing: English (largest training data volume) Technical Fixes Training Data Sanitization: Identify and remove sources with attribution headers and promotional overlays across all languages Output Validation: Implement confidence scoring that distinguishes actual speech from hallucinated metadata for all languages Model Fine-tuning: Re-train or fine-tune to suppress metadata and promotional content generation across all languages Post-processing Filters: Detect and remove common hallucination patterns language-independently Long-term Solutions Architecture Enhancement: Develop mechanisms to distinguish speech content from hallucinated metadata across languages BROADER IMPLICATIONS FOR AI RELIABILITY Legal proceedings and evidence globally Responsibility and Accountability Content integrity guarantees across languages MY POSITION Chinese-language audio content only What I Did NOT Test: English, Spanish, or any other language My Recommendation: OpenAI must conduct comprehensive testing across all supported languages. My Chinese-language findings provide a methodology and baseline, but the full scope remains unknown without broader testing. Artistic Integrity: No creator's work should be falsely attributed to others in any language Not Just a Technical Issue Legal depositions in any language? CALL TO ACTION Formal Acknowledgment: Confirm this report has reached Engineering and Legal teams I am available for: Follow-up questions and clarification Note: While I can only provide direct testing for Chinese and English content, I can assist in developing testing protocols for other languages. CLOSING STATEMENT Respectfully submitted, |
Beta Was this translation helpful? Give feedback.
-
|
Subject: Critical Security Issue - Whisper API Fabricating False Copyright Attributions and Injecting Promotional Content (Updated: Spanish Testing Complete) EXECUTIVE SUMMARY REPORTER INFORMATION TESTING SCOPE AND METHODOLOGY Two original songs with different styles and characteristics English-Language Audio: One original song Spanish-Language Audio: One original song Languages NOT Yet Tested: Japanese, Korean, Arabic, Vietnamese, Thai, French, German, Portuguese, Hindi, and 50+ other supported languages Chinese (complex, logographic, tonal) = Systematic bugs Hypothesis: Language complexity (logographic writing, tonal system, context-dependence) may increase vulnerability to training data contamination. However, this remains unconfirmed without testing other complex languages. ISSUE #1: FALSE COPYRIGHT ATTRIBUTION Multiple tests across two days Test 2 - Whisper Local Installation (November 2, 2025, 7:21 PM): Direct installation via pip install git+https://github.com/openai/whisper.git Test 3 - Independent Verification via Rev.com (November 2-3, 2025): Same MP3 file submitted to Rev.com professional transcription service Test 4 - Reproducibility Verification (November 3, 2025): Re-uploaded same file to Melobytes Comparison Testing (Other Languages) Result: NO false attribution detected Spanish Test (Nov 3): Song "Receding Hairline Laced With Gray" (Spanish: "Cabello Canoso y Entradas") Result: NO false attribution detected Root Cause Analysis Chinese language complexity (logographic, tonal, context-dependent) may increase reliance on pattern-based generation Languages Still At Risk Japanese: Logographic writing system (Kanji) + large music/media industry ISSUE #2: PROMOTIONAL SPAM INJECTION Audio Verification: The song ends with lyrics "遇到远方 彼随时代而变" (When encountering distant places, they change with the times) - NO promotional text is spoken Comparison Testing (Other Languages) Result: NO promotional spam detected Spanish Test (Nov 3): Result: NO promotional spam detected Root Cause Analysis COMPREHENSIVE TEST RESULTS Chinese (2/2 songs affected): 100% of Chinese songs showed hallucination issues Statistical Summary Chinese: 100% (2/2 songs affected) Pattern Confidence: High confidence bugs are Chinese-specific, moderate confidence other complex languages may be clean, low confidence without testing similar languages WHY CHINESE MAY BE UNIQUELY VULNERABLE Logographic writing - Characters represent concepts, not phonetic sounds Impact on AI Processing: Rely more heavily on learned patterns due to increased ambiguity Comparison with Tested Languages: Japanese (uses Kanji from Chinese + Hiragana + Katakana) Tonal Languages: Vietnamese (6 tones) Context-Heavy/Complex: Arabic (root-based, context-dependent, right-to-left) These languages require immediate testing to determine if vulnerability is Chinese-specific or affects a broader group of complex languages. IMPACT ASSESSMENT Chinese-speaking creators globally (1.3+ billion speakers) Geographic Scope: Mainland China Legal Implications False attributions create competing copyright claims for Chinese content Personal Impact: Associates Li Zongsheng (real, famous person) with works he didn't create Liability for OpenAI and Service Providers: Companies using Whisper API unknowingly generating false records for Chinese content Content Integrity Violations Original: Anti-alcohol message in Chinese Example 2 - Religious Content: Original: Buddhist/Taoist prayer or meditation in Chinese Example 3 - Political Content: Original: Pro-democracy message in Chinese Scale of Impact (Chinese Language) Accessibility services for 1.3+ billion speakers Unknown Contamination: Number of false attributions already in circulation unknown EVIDENCE PACKAGE Original MP3 file: "You-Are-A-Diamond-DuetChinesetesting.mp3" (3,988KB, 3:06) .txt, .json, .srt, .tsv, .vtt formats Rev.com clean transcription (Nov 2-3, 2025) For Issue #2 (Spam Injection - Chinese): Original MP3 file: "Changing-Times-Chinese.mp3" For English Testing (Clean Results): Original MP3 file: "Walking-Through-The-Pouring-Rain-English.mp3" For Spanish Testing (Clean Results): Original MP3 file: "Receding-Hairline-Laced-With-Gray-Spanish.mp3" Testing Methodology: Systematic A/B testing across multiple services TECHNICAL ANALYSIS Different training data sources for English/Spanish music Issue #2 - Chinese Spam Injection: My limited testing may not have triggered English/Spanish spam patterns Language-Specific Vulnerability Theory 100% bug rate in Chinese (2/2 songs) Requires Testing: Japanese, Korean, Arabic, Vietnamese, Thai to determine if complexity correlation is valid COMPARISON TO KNOWN ISSUES Multiple users reporting promotional spam injection Discussion #2244: "Bug Report for Whisper Model - Chinese Transcription Anomaly" Reports of unexpected Chinese text insertion My Contribution: First systematic cross-language testing (Chinese vs. English vs. Spanish) RECOMMENDATIONS Acknowledge and Investigate: Confirm receipt and assign to Engineering for Chinese-specific investigation Revised Testing Protocol Japanese (logographic + tonal similarities) Priority 2 - Verification (Medium Risk): French, German, Italian (verify other European languages are clean) Priority 3 - Comprehensive (Lower Priority): All other supported languages for completeness Chinese-Specific Technical Fixes Chinese Training Data Audit: Identify Chinese music sources with attribution headers Chinese Output Validation: Implement detection for "作词/作曲 [Name]" patterns in Chinese output Chinese Model Fine-tuning: Re-train Chinese language processing to suppress metadata generation Cross-Language Monitoring: Monitor for similar patterns in untested complex languages Long-term Solutions Architecture Enhancement: Improve distinction between actual speech and hallucinated metadata across all languages BROADER IMPLICATIONS Legal proceedings involving Chinese audio Other complex languages remain untested and potentially unreliable until verified. Language-specific testing protocols before deployment MY POSITION Chinese: 2 songs, 7 total tests, multiple days, systematic methodology What I Did NOT Test: Japanese, Korean, Arabic, or any other complex language My Language Capabilities: Fluent in: English only My Recommendation: OpenAI must test complex languages (Japanese, Korean, Arabic, Vietnamese, Thai) immediately to determine if Chinese issues are isolated or part of a broader pattern. Thorough Investigation: Wanted to determine scope before raising alarm CALL TO ACTION Formal Acknowledgment: Confirm this updated report has reached Engineering I am available for: Additional English/Spanish testing if needed CLOSING STATEMENT - UPDATED Bugs appear more limited in scope than initially feared Bad news: Chinese language specifically affected (1.3+ billion speakers) Critical Unknown: Fix Chinese language processing immediately (confirmed bugs, large affected population) Thank you for taking these issues seriously and for your ongoing investigation. Respectfully submitted, |
Beta Was this translation helpful? Give feedback.
-
|
Hi @TMADFX , this GitHub page is for the Open Source Whisper project, not the OpenAI Speech-to-text API Service, and so not much of what anyone says here is going to be very relevant to you, including on the topic of hallucinations. If you were using THIS project (you're not), there are several open source solutions created by various contributors including myself to address hallucinations. For example, the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Critical Bug: Whisper Fabricates False Copyright Attribution (Li Zongsheng) on Original Chinese Music
STATUS UPDATE: This issue has been officially reported to OpenAI Support on November 2, 2025, and escalated to their API specialists and legal teams.
I'm posting here for:
SUMMARY
OpenAI's Whisper speech-to-text API is systematically fabricating false copyright attribution when transcribing original music, falsely crediting a famous real artist (Li Zongsheng, acclaimed Taiwanese songwriter) as the lyricist and composer of MY original copyrighted work. This is not a minor transcription error - this is Whisper inventing copyright claims out of thin air, creating serious legal threats to creators' intellectual property rights.
REPORTER INFORMATION
pip install git+https://github.com/openai/whisper.giton November 2, 2025)ISSUE DESCRIPTION
When transcribing my original Chinese-language song, Whisper consistently fabricates the following false copyright attribution at the beginning of transcriptions:
Translation: "Lyrics by Li Zongsheng, Music by Li Zongsheng"
This fabricated attribution:
MY WORK - COMPREHENSIVE BACKGROUND
Original Composition
Commercial Release (March 2025)
The Affected Version (October 2025)
DETAILED TESTING - THE EVIDENCE
Phase 1: Initial Discovery (October 30-31, 2025)
Test 1-2: Melobytes transcription service (uses Whisper API)
Test 3 (Control): Different audio (not Chinese music)
Phase 2: Melobytes Response (November 1, 2025)
Phase 3: Independent Verification (November 2, 2025)
Test 4 - Rev.com Professional Transcription
Phase 4: Direct Whisper Testing - SMOKING GUN
Test 5 - Local Whisper Installation
Result: Whisper fabricated the false attribution:
THIS DEFINITIVELY PROVES:
COMPARISON TABLE
RELATED GITHUB ISSUES
This appears related to other reported Chinese hallucination bugs:
Pattern: Whisper has systematic problems with Chinese audio where it invents content that doesn't exist.
My case adds a new dimension: Whisper is fabricating legal copyright claims involving real people.
TECHNICAL ANALYSIS - LIKELY ROOT CAUSES
Training Data Contamination (Most Likely)
Pattern Matching Error
Metadata Hallucination
IMPACT ASSESSMENT
SEVERITY: CRITICAL - COPYRIGHT IMPLICATIONS
Legal Threats:
Affected Parties:
Scale:
REPRODUCTION STEPS
pip install git+https://github.com/openai/whisper.gitEVIDENCE AVAILABLE
I can provide:
URGENCY
This is a CRITICAL COPYRIGHT-THREATENING BUG:
I was forced to file a copyright registration specifically because of this incident.
CONTACT
Christopher B. Mathis
Email: cmathis@terminalmadness.com
Copyright: Case #1-15030243001
Artist: ChrisBMathis
Available to:
QUESTIONS FOR THE COMMUNITY:
This needs urgent attention from the Whisper team. Thank you.
Beta Was this translation helpful? Give feedback.
All reactions