Skip to content

Conversation

@x15sr71
Copy link
Contributor

@x15sr71 x15sr71 commented Nov 28, 2025

In raising this pull request, I confirm the following (please check boxes):

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.
  • I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

Description

Fixes #1759 - This PR restores functional XMLTV generation for ATSC broadcast streams and adds comprehensive EPG parsing capabilities. ATSC streams with EIT/VCT/ETT tables now generate complete XMLTV output with program titles, descriptions, and extended text metadata.

Problem

The -xmltv parameter was completely non-functional for ATSC broadcast streams. When processing ATSC transport streams containing valid EPG data (EIT tables), channel information (VCT/TVCT tables), and extended text (ETT tables), CCExtractor would:

  • Generate SRT caption files (working correctly)
  • NOT generate XMLTV files (the bug)
  • Ignore extended program descriptions from ETT tables
  • Drop events due to buffer boundary check errors

This made it impossible to extract Electronic Program Guide data from ATSC streams, despite the -xmltv parameter being specified.

Root causes identified:

  1. EPG events stored in fallback storage (TS_PMT_MAP_SIZE) were never output to XMLTV
  2. Inverted buffer boundary check logic (CHECK_OFFSET macro) caused parser failures and potential buffer overruns
  3. Limited ATSC table ID support (missing extended EIT tables, Cable VCT, and ETT tables)
  4. ATSC multiple_string parser incorrectly combined title and description into single field
  5. No support for ETT (Extended Text Table) parsing, losing detailed program information

Solution

Core Fixes

  1. Fixed EPG output logic (EPG_output() function)

    • Modified to always check fallback storage regardless of nb_program value
    • ATSC streams store events in fallback due to VCT source ID mapping, but these were being ignored
    • Now correctly outputs events from both program-mapped storage and fallback storage
    • Ensures ATSC VCT-defined channels generate XMLTV output
  2. Fixed critical buffer boundary check (CHECK_OFFSET macro)

    • Corrected inverted logic from < to > in boundary validation
    • Before: if (offset + val < offset_end) (incorrect - allowed overruns)
    • After: if (offset + (val) > offset_end) (correct - prevents overruns)
    • Applied consistently across EIT, VCT, and ETT parsing functions
    • Prevents crashes and incomplete parsing
  3. Extended ATSC table support (EPG_parse_table() function)

    • Added extended EIT table IDs: 0xCD, 0xCE, 0xCF, 0xD0 (in addition to 0xCB)
    • Added Cable VCT variant: 0xC9 (in addition to Terrestrial VCT 0xC8)
    • New: Added ETT (Extended Text Table) support: 0xCC
    • Ensures comprehensive ATSC EPG data extraction

New Features

  1. Implemented ATSC ETT (Extended Text Table) parsing

    • Added EPG_ATSC_decode_ETT() function to parse ETT table structures
    • Added EPG_ATSC_decode_ETT_text() to extract multiple string format extended descriptions
    • ETT data now populates <desc> tags in XMLTV output with detailed program information
    • Matches ETT extended text to events by source_id (service_id)
    • Supports multi-segment, multi-language text extraction
  2. Enhanced ATSC multiple_string decoder (EPG_ATSC_decode_multiple_string())

    • Fixed to properly separate title (segment 0) and description (segment 1)
    • Before: Both segments written to same field, causing data loss
    • After: First segment → event_name (title), second segment → text (subtitle/description)
    • Added proper memory management and bounds checking
    • Only processes uncompressed ANSI strings (compression_type==0x00, mode==0x00)
  3. Improved XMLTV output formatting

    • Added proper indentation and line breaks for readability
    • ETT extended text now appears in <desc> tags (correct XMLTV placement)
    • Fixed empty subtitle handling (only output when text exists)

Testing

Tested with sample files provided by @TPeterson94070 in issue #1759:

  • channel5FullTS.ts - 5 channels with VCT/TVCT tables
  • ch12FullTS.ts - Additional ATSC test case
  • ch29FullTS.ts - 5 programs with extended EIT data (Nov 26-28, 2025)

Before this PR:

./ccextractor channel5FullTS.ts --xmltv 1

  • Output: Only .srt file generated
  • No XMLTV file created (bug)
  • ETT data completely ignored

After this PR:

./ccextractor channel5FullTS.ts --xmltv 1

  • Output: Both .srt AND .xml files generated successfully
  • XMLTV file contains:
    • Channel listings extracted from VCT with correct IDs
    • Program schedules parsed from EIT-0/1/2/3 (table IDs 0xCB-0xD0)
    • Extended program descriptions from ETT tables (0xCC)
    • UTC timestamps, titles, and subtitles properly captured
    • Unique ts-meta-id values matching EIT event IDs
    • Well-formatted XML with proper indentation

Sample XMLTV output (after ETT parsing):

Known Limitations

  • ATSC date/time conversion issues: ATSC date/time conversion occasionally produces incorrect years in some streams (pre-existing behavior).

  • Channel naming: XMLTV output uses numeric channel IDs (source_id) instead of human-readable names. VCT short_name and major/minor channel numbers are not currently mapped to XMLTV display-name elements.

  • Orphaned events: Some EIT events may appear under channel="0" when their service_id does not match any VCT-defined program. This occurs with malformed streams or when VCT data is incomplete.

These three accuracy issues mentioned above (incorrect dates, channel naming, orphaned programs) are data quality problems that existed in the codebase previously and are not directly caused by or related to the primary bug fix in this PR.

I believe these should be addressed in follow-up PRs for better separation of concerns. However, if maintainers prefer these issues to be fixed in this PR, I'm happy to include them.

@x15sr71 x15sr71 marked this pull request as draft December 4, 2025 05:05
@x15sr71 x15sr71 force-pushed the fix/atsc-eit-xmltv-generation branch from 52cce44 to b033bde Compare December 9, 2025 17:25
@x15sr71 x15sr71 marked this pull request as ready for review December 9, 2025 18:09
@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit b293017...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 7/7
DVD 3/3
DVR-MS 2/2
General 27/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 86/86
Teletext 21/21
WTV 13/13
XDS 34/34

Congratulations: Merging this PR would fix the following tests:


All tests passing on the master branch were passed completely.

Check the result page for more info.

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit b293017...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 7/7
DVD 3/3
DVR-MS 2/2
General 27/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 86/86
Teletext 21/21
WTV 13/13
XDS 34/34

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=srt --latin1 f1422b8bfe..., Last passed: Never
  • ccextractor --datapid 5603 --autoprogram --out=srt --latin1 --teletext 85c7fc1ad7..., Last passed: Never
  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
  • ccextractor --hardsubx 1a0302f7fd..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 c0d2fba8c0..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 006fdc391a..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 e92a1d4d2a..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 7e4ebf7fd7..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 9256a60e4b..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 27d7a43dd6..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 297a44921a..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 efbe129086..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 eae0077731..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 e2e2b501e0..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 c6407fb294..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --datets dcada745de..., Last passed: Never
  • ccextractor --autoprogram --out=srt --latin1 --tpage 398 5d5838bde9..., Last passed: Never
  • ccextractor --autoprogram --out=srt --latin1 --teletext --tpage 398 3b276ad8bf..., Last passed: Never

All tests passing on the master branch were passed completely.

Check the result page for more info.

@x15sr71
Copy link
Contributor Author

x15sr71 commented Dec 13, 2025

Hi @cfsmp3,

I noticed that master now includes changes to EPG_ATSC_decode_multiple_string() which overlap with this PR, so I wanted to clarify intent before resolving the conflict.
This PR originated from Issue #1759, where ATSC streams using VCT/TVCT tables were not producing XMLTV output despite valid EIT/ETT data being present. While addressing that, I updated the ATSC multiple_string handling to correctly map multi-segment strings.

In this PR, the logic follows ATSC A/65 semantics:

  • segment 0 → <title> (event_name)

  • segment 1 → (text), when present

  • ETT → , when available

I see that master currently duplicates segment 0 into both event_name and text. This works for single-segment streams, but when segment 1 is present it would ignore that segment, resulting in the short description not being propagated to the XMLTV output.

Before resolving the conflict, could you share your perspective on whether duplicating segment 0 into both fields was intentional (for example, as a fallback for single-segment streams), or whether preserving the multi-segment separation from this PR would be acceptable?

I’d really appreciate your guidance on the intended behavior here so I can resolve the conflict correctly.

Thanks!

@cfsmp3
Copy link
Contributor

cfsmp3 commented Dec 13, 2025

@x15sr71 I'm working on fixing all our timing problems as well as all the vulnerabilities that can be easily caught by a linter (use of memory without checking malloc, use of sprintf instead of snprintf), etc. So many files are being modified over the weekend.

Take a look at the PR that modified ts_tables_epg.c to see what happened. You can add a comment there if you think a change is wrong and needs work.

I have two more large PRs in flight that I'm currently testing, and once those are merged you can expect calm again.

I'm experimenting with claude to address some of the long standing issues and it's being an extremely productive session (timing is pretty much identical to FFmpeg in the reference samples I'm using) but at the same time it's possible this high speed is leading to some incorrect changes.

BTW if you are using input files that we don't have in our sample platform it would be great if you could upload them so they become part of the official tests.

@TPeterson94070
Copy link

I have documented in Issue #1759 that the existing PR1773 does not in fact include in the XMLTV output all ETT strings that match EIT events in the test stream. This description of the ETT format tells how to relate the EIT events to the ETT strings by combining bitwise the 14-bit EIT Event ID, 16-bit and Source ID, and 0x02. All of the EIT entries in my sample files have "ETM location: 1" meaning that the corresponding ETT string is present.

I'll be happy to upload more TS samples if you tell me where to put them.

@cfsmp3
Copy link
Contributor

cfsmp3 commented Dec 14, 2025

I'll be happy to upload more TS samples if you tell me where to put them.

Could you upload them to a google drive or dropbox or something like that and share them with me so I can fetch them? (then you can delete them).
No limit in number of files or size.

I can then copy them to our test suite and just have them tested for each pull request.

@x15sr71
Copy link
Contributor Author

x15sr71 commented Dec 14, 2025

Thanks for the context @cfsmp3, I’ll hold off on pushing for now and wait for the current EPG and timing-related changes on master to settle, so unnecessary conflicts can be avoided and keep the review clean.

(timing is pretty much identical to FFmpeg in the reference samples I'm using)

That's great to hear, good to know the timing work is refined!

@x15sr71
Copy link
Contributor Author

x15sr71 commented Dec 14, 2025

Thanks for the clarification @TPeterson94070 and for pointing out the exact ETM_id bit layout, you were absolutely right.

I’ve corrected the ETM_id handling to follow the ATSC A/65 definition precisely (source_id << 16) | (event_id << 2) | 0x02, and with that fix in place the XMLTV output now includes the full set of ETT descriptions matched correctly to their corresponding EIT events and timings. I’ve verified this against your sample stream, and the results now align with what TSReader produces for this case.

I’ve added the currently generated XML file here:

20251206ch29FullTS_epg.xml for 20251206ch29FullTS.ts

I’ll be pushing these changes shortly. At the moment, as cfsmp3 mentioned, there are several large EPG and timing-related changes landing on master, so I will push the changes once things stabilize a bit to avoid unnecessary conflict resolution and ensure the update is cleanly reviewable.

Thanks again for catching this, it helped improve the correctness of the implementation.

@TPeterson94070
Copy link

Thanks, @x15sr71 , for incorporating the ATSC ETM id "secret decoder ring" into PR1773. 😊

I see that the XML correctly identifies the EIT events by channel ID. However, the initial list of channel IDs at the head of the <tv> table only gives that same ID as the <display-name> rather than the real names, as suppled in the TVCT table. E.g., Channel 2.1, short name: "KTVU-HD" instead of channel=1. Does this not conform to the XMLTV specification? If not, providing the actual channel names somewhere in the XML file would be an excellent addition, IMO.

@TPeterson94070
Copy link

TPeterson94070 commented Dec 14, 2025

@cfsmp3 , after confirming that 20-second and even 15-second clips contain all the broadcast EIT/ETT data, I made 20-second clips of the 21 rf channels that I can receive well and put them into this folder on my Google Drive. IIUC, you should be able copy the files therein. If I'm mistaken about that folder's permissions and you cannot copy them, please let me know and I'll share them individually here.

BTW, I didn't examine them all in detail, but I did see that ch12's EIT event start times are off by months! That doesn't affect the EIT-ETT linkage, but it will mess up any EPG that tries to show actual event times for that channel. I'm going to try to contact the station engineer and ask that the times be corrected.

@cfsmp3
Copy link
Contributor

cfsmp3 commented Dec 15, 2025

Maintainer Review

I've done a deep code review of this PR. Excellent work, @x15sr71! This is a high-quality fix for a real, important bug.

Critical Bug Fix Confirmed ✅

The CHECK_OFFSET macro bug you identified is real and critical. The current master code has:

#define CHECK_OFFSET(val)              \
	if (offset + val < offset_end) \
	return

This returns when we're within bounds (wrong!). Your fix correctly returns when outside bounds:

#define CHECK_OFFSET(val)                \
	if (offset + (val) > offset_end) \
	return

This prevents potential buffer overreads.

Root Cause Analysis Confirmed ✅

Your diagnosis is correct: ATSC events end up in fallback storage (TS_PMT_MAP_SIZE) due to VCT source_id mapping, but EPG_output() was only outputting fallback when nb_program == 0. Since ATSC streams have programs from VCT, the fallback storage was never output - breaking XMLTV for ATSC entirely.

Other Improvements ✅

  • ETT parsing: Good addition for extended program descriptions
  • Multi-segment handling: Correct per ATSC A/65 (segment 0 → title, segment 1 → subtitle)
  • Extended table IDs: EIT 0xCD-0xD0, Cable VCT 0xC9
  • XMLTV formatting: <programme> is correct XMLTV spec

CI Status ✅

All builds and all 237 sample platform tests pass. The sample platform indicates this would actually fix some currently failing tests.

Request

Please rebase on master when you're ready. Master has stabilized from the recent changes I was making.

During the rebase, there's one minor cleanup: the diff shows what appears to be duplicate lines in EPG_ATSC_decode_EIT() (full_id/event.id/service_id assigned twice) - likely a merge artifact. Please verify this gets cleaned up.

Once rebased, this should be ready to merge. Thank you for this thorough fix!

@x15sr71 x15sr71 force-pushed the fix/atsc-eit-xmltv-generation branch from 8355cef to 4d658ed Compare December 15, 2025 19:58
@x15sr71 x15sr71 force-pushed the fix/atsc-eit-xmltv-generation branch from 4d658ed to e0ac99a Compare December 15, 2025 20:18
@x15sr71
Copy link
Contributor Author

x15sr71 commented Dec 15, 2025

Thanks for the review and the kind words @cfsmp3!

I’ve rebased the PR onto the current master and resolved the conflicts. During the rebase, I also cleaned up the duplicate full_id / event_id / service_id assignments in EPG_ATSC_decode_EIT() that you pointed out.

The PR is now up to date with master. Please let me know if you’d like anything else adjusted before merge.

Thanks again for the thorough review and guidance.

@TPeterson94070
Copy link

TPeterson94070 commented Jan 31, 2026

@cfsmp3 , I'd like to test my EPG generator using the improved ccextractor xmltv code on a DVB-T full TS file. Do you have any here? If not, can you suggest a source for my testing?

P.S.: I've scanned through the ts files listed in CCExtractor/sample-platform and didn't see any that evidently contained DVB-T full-ts.

@cfsmp3
Copy link
Contributor

cfsmp3 commented Jan 31, 2026

@cfsmp3 , I'd like to test my EPG generator using the improved ccextractor xmltv code on a DVB-T full TS file. Do you have any here? If not, can you suggest a source for my testing?

P.S.: I've scanned through the ts files listed in CCExtractor/sample-platform and didn't see any that evidently contained DVB-T full-ts.

https://drive.google.com/file/d/1jgTFaJelAaGTNYZCnBZSfaruVqDXJIaq/view?usp=drive_link

I recorded this just now from Spanish TV. Haven't tested it (other than to verify that it does play).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ccextractor appears to ignore -xmltv parameter

4 participants