Skip to content

fix(epub): parse manifest attributes in any order#5

Merged
ojspace merged 1 commit intomainfrom
fix/epub-manifest-attr-order
Mar 21, 2026
Merged

fix(epub): parse manifest attributes in any order#5
ojspace merged 1 commit intomainfrom
fix/epub-manifest-attr-order

Conversation

@ojspace
Copy link
Copy Markdown
Owner

@ojspace ojspace commented Mar 21, 2026

Problem

EPUB conversion returned epub-failed for valid EPUBs (e.g., Project Gutenberg books) even when unzip was available.

Root Cause

The manifest item regex required id= to appear before href= in <item> tags:

/<item\s+[^>]*id="([^"]+)"[^>]*href="([^"]+)"[^>]*/gi

Project Gutenberg EPUBs (and many others) use href first:

<item href="chapter1.html" id="item1" media-type="application/xhtml+xml"/>

This caused zero manifest items to be found, zero spine items, and epub-failed for every valid Gutenberg EPUB.

Fix

Match the full <item> element and extract id and href separately, independent of attribute order.

Test

  • Alice's Adventures in Wonderland (Project Gutenberg EPUB) → extraction: epub-native, usefulness_score: 1.00, 179 chunks ✅

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Fixed EPUB manifest parsing to correctly handle variations in attribute ordering, enabling previously unrecognized e-books to be properly read.

The manifest item regex required id= before href=, but EPUB OPF files
(e.g. Project Gutenberg) often place href first. This caused zero manifest
items to be found, leading to epub-failed for valid EPUBs.

Fix: match the full <item> element, then extract id and href separately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ojspace ojspace merged commit 943f644 into main Mar 21, 2026
3 of 4 checks passed
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 21, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66957d42-1987-484e-a535-bcab214ccc99

📥 Commits

Reviewing files that changed from the base of the PR and between 8f34a24 and 5fc698c.

📒 Files selected for processing (1)
  • src/providers/epub.ts

📝 Walkthrough

Walkthrough

The EPUB OPF manifest parsing logic was refactored to robustly extract <item> elements by matching the entire tag and separately extracting id and href attributes via case-insensitive regexes, accommodating any attribute ordering. The population of manifestItems is now conditional on both attributes being present.

Changes

Cohort / File(s) Summary
EPUB Manifest Parsing
src/providers/epub.ts
Refactored <item> element extraction from single sequential regex to separate case-insensitive regexes for id and href, enabling attribute order independence and improved robustness. Conditional population based on presence of both attributes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 With whiskers twitched and nose held high,
The EPUB's items now fly,
No order matters, left or right,
Our parsing's now more sturdy and bright! ✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/epub-manifest-attr-order

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

@ojspace ojspace deleted the fix/epub-manifest-attr-order branch March 21, 2026 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant