Extract 1 or thousands of YouTube transcripts fast. Turn video audio into clean, structured captions with optional timestamps and XML—ready for analysis, search, and content repurposing.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for YouTube Structured Transcript Extractor you've just found your team — Let’s Chat. 👆👆
This project pulls accurate transcripts/captions from YouTube videos and delivers them in structured formats (arrays, objects with timestamps, or XML). It solves the pain of manual transcription and inconsistent copy-paste by providing consistent fields and bulk processing. It’s built for creators, researchers, educators, accessibility teams, and anyone who needs reliable YouTube transcripts at scale.
- Handles single URLs or large batches (hundreds to thousands) with resilient retries.
- Multiple output formats: plain text array, timed captions array, XML, and one-line text.
- Structured fields (video metadata + caption payload) designed for analytics pipelines.
- Export-ready outputs (JSON, CSV, NDJSON) for downstream tooling and databases.
- Clear validation and error reporting per item for painless bulk runs.
| Feature | Description |
|---|---|
| Bulk URL ingestion | Paste one or many video URLs; the tool processes each and returns per-video results. |
| Multiple caption formats | Choose plain captions array, captions with timestamps, XML, or one-line string text. |
| Fast extraction | Optimized network flow with concurrency and smart backoff for speed at scale. |
| Reliable fallback | Graceful handling when a video has no captions; returns informative status fields. |
| Clean schema | Consistent, typed fields for video metadata, language, and caption format. |
| Export options | Easily export to JSON/CSV/NDJSON for analytics and warehousing. |
| Language awareness | Captures caption language codes when available and flags auto-generated captions. |
| Timestamp precision | Start/end values in seconds (float) for aligned text analytics. |
| Input validation | URL validation and deduplication reduce wasted runs and errors. |
| Metrics & logging | Aggregate run stats (success count, failures, durations) for operations visibility. |
| Field Name | Field Description |
|---|---|
| videoId | YouTube video ID parsed from the URL. |
| videoUrl | Original video URL submitted. |
| title | Video title (if accessible). |
| channelId | Channel ID owning the video. |
| channelName | Channel name (if available). |
| language | Detected/declared caption language (e.g., en, es), when present. |
| hasAutoCaptions | Boolean indicating whether captions are auto-generated. |
| captionFormat | Selected output format (array, array_with_timestamps, xml, xml_with_timestamps, one_line_text). |
| captions | The transcript payload—array of strings, array of {start, end, text}, XML string, or single-line string depending on captionFormat. |
| duration | Video duration in seconds (if available). |
| publishedAt | Video publish datetime (ISO 8601), when retrievable. |
| thumbnailUrl | Primary video thumbnail URL. |
| requestedFormat | The format option you asked for in the job. |
| error | Error message for this item when extraction fails (null when successful). |
| createdAt | Extraction timestamp (ISO 8601). |
[
{
"videoId": "abc123XYZ",
"videoUrl": "https://www.youtube.com/watch?v=abc123XYZ",
"title": "Deep Learning 101: Intro Lecture",
"channelId": "UC-EXAMPLE",
"channelName": "ML University",
"language": "en",
"hasAutoCaptions": true,
"captionFormat": "array_with_timestamps",
"captions": [
{ "start": 0.64, "end": 3.12, "text": "[Applause]" },
{ "start": 3.13, "end": 8.45, "text": "Welcome to Deep Learning 101. In this session we cover the basics." },
{ "start": 8.46, "end": 12.02, "text": "We will define neural networks and discuss where they shine." }
],
"duration": 1258.4,
"publishedAt": "2024-09-10T14:00:00Z",
"thumbnailUrl": "https://i.ytimg.com/vi/abc123XYZ/hqdefault.jpg",
"requestedFormat": "array_with_timestamps",
"error": null,
"createdAt": "2025-11-10T17:05:22Z"
}
]
YouTube Structured Transcript Extractor/
├── src/
│ ├── runner.py
│ ├── extractors/
│ │ ├── youtube_client.py
│ │ ├── captions_parser.py
│ │ └── xml_formatter.py
│ ├── outputs/
│ │ ├── exporters.py
│ │ └── writers/
│ │ ├── json_writer.py
│ │ ├── csv_writer.py
│ │ └── ndjson_writer.py
│ └── config/
│ ├── settings.example.json
│ └── schema.json
├── data/
│ ├── inputs.sample.txt
│ └── sample_output.json
├── tests/
│ ├── test_parsers.py
│ └── test_exporters.py
├── requirements.txt
├── LICENSE
└── README.md
- Content teams use it to convert long-form videos into text so they can repurpose clips into blogs, newsletters, and social captions.
- Researchers use it to index lectures and interviews so they can keyword-search insights across large video libraries.
- Educators use it to generate study notes and outlines so learners can skim lessons and review key moments quickly.
- Accessibility teams use it to provide captioned alternatives so they can improve compliance and user experience.
- SEO specialists use it to surface transcript keywords so they can enhance discoverability and topic coverage.
Q1: Do all YouTube videos have transcripts?
No. Some videos don’t expose captions. When unavailable, the item returns with error populated and captions omitted.
Q2: What output formats are supported?
You can choose: array (text only), array_with_timestamps (objects with start/end), xml, xml_with_timestamps, or one_line_text.
Q3: How fast is bulk extraction? Throughput depends on network and concurrency. Typical batches of 100 URLs complete in minutes with high success rates; larger sets scale linearly.
Q4: Are auto-generated captions flagged?
Yes. The hasAutoCaptions boolean indicates when captions are auto-generated vs. provided by the publisher.
- Primary Metric (Speed): 2.5–4.0 videos/second on mid-range servers for
arrayformat; 1.5–2.5 videos/second for timed/XML formats. - Reliability Metric (Success Rate): 95–98% successful retrieval on public videos with available captions.
- Efficiency Metric (Throughput): Stable processing up to 5k URLs per run with adaptive backoff and batching.
- Quality Metric (Completeness): 99% caption segment coverage when captions are present, with start/end precision to ~0.01s.
