fix: add OG images for Twitter/social card previews#4
Merged
Conversation
- Created OG images (1200x630) for benchmark blog post and default fallback - Added og:image + twitter:image meta tags to Hugo base template - Per-post og_image frontmatter param with fallback to default Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NameetP
pushed a commit
that referenced
this pull request
Mar 18, 2026
Font-size-based heading detection (headings.py, ~220 lines): - Analyzes PyMuPDF font metadata to identify heading spans - Maps distinct font sizes to h1/h2/h3 (relative to body size) - Detects bold-at-same-size headings common in academic PDFs - Promotes short bold-only lines to ### as fallback - Early exit when pymupdf4llm already detected headings Borderless table fallback (table_fallback.py, ~200 lines): - Whitespace column detection for tables missed by find_tables() - Validates: 3+ rows, 2+ columns, numeric column required - Returns ExtractedTable objects matching existing type Integration: - fast.py: always opens fitz doc, injects headings per page - audit.py: injects headings in multipass/standard quality path Benchmark results (opendataloader-bench, 200 PDFs): Overall: 0.792 → 0.853 (+0.061) MHS: 0.500 → 0.740 (+0.240) NID: 0.911 → 0.911 (unchanged) TEDS: 0.704 → 0.704 (unchanged) Leaderboard: #6 → #4 (ahead of opendataloader local, mineru) 21 new tests, 246 total passing, zero new dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tasks
NameetP
added a commit
that referenced
this pull request
Mar 18, 2026
Font-size-based heading detection (headings.py, ~220 lines): - Analyzes PyMuPDF font metadata to identify heading spans - Maps distinct font sizes to h1/h2/h3 (relative to body size) - Detects bold-at-same-size headings common in academic PDFs - Promotes short bold-only lines to ### as fallback - Early exit when pymupdf4llm already detected headings Borderless table fallback (table_fallback.py, ~200 lines): - Whitespace column detection for tables missed by find_tables() - Validates: 3+ rows, 2+ columns, numeric column required - Returns ExtractedTable objects matching existing type Integration: - fast.py: always opens fitz doc, injects headings per page - audit.py: injects headings in multipass/standard quality path Benchmark results (opendataloader-bench, 200 PDFs): Overall: 0.792 → 0.853 (+0.061) MHS: 0.500 → 0.740 (+0.240) NID: 0.911 → 0.911 (unchanged) TEDS: 0.704 → 0.704 (unchanged) Leaderboard: #6 → #4 (ahead of opendataloader local, mineru) 21 new tests, 246 total passing, zero new dependencies. Co-authored-by: Nameet Potnis <nameetpotnis@Nameets-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
NameetP
pushed a commit
that referenced
this pull request
Mar 18, 2026
- Added benchmark leaderboard (opendataloader-bench, 200 PDFs) - pdfmux #4 overall (0.853), #2 reading order (0.911) - Heading detection in pipeline diagram and multi-pass description - Updated project structure with headings.py, table_fallback.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
og:imageortwitter:imagemeta tags existedog:image,og:image:width,og:image:height,twitter:imageto Hugo base templateog_imagefrontmatter param with automatic fallbackTest plan
og:imageandtwitter:imagetags in built HTML🤖 Generated with Claude Code