fix(docparser): preserve MinerU markdown and persist relative images#1404
Merged
Merged
Conversation
MinerU already returns markdown with embedded HTML blocks, but the current\nreader runs the whole document back through html-to-markdown. That\nsecond conversion escapes valid headings and image syntax, so chunk\nprofiling sees plain text instead of markdown structure and relative\nimage references stop matching the storage pipeline.\n\nKeep MinerU output in its original markdown form and only apply narrow\ncompatibility normalization for the specific over-escaped patterns we\nactually need to recover. The converter now matches image refs by the\npaths that are really present in markdown or embedded HTML instead of\nassuming a single images/<name> form.\n\nExtend ImageResolver so relative HTML <img src=...> references share the\nsame storage rewrite path as markdown images, deduplicate repeated saves,\nand keep the frontend sanitizer compatible with MinerU's details/summary\nblocks. Add focused docparser tests that cover escaped markdown repair,\nvariant image path matching, and relative HTML image persistence.
There was a problem hiding this comment.
Pull request overview
This PR fixes MinerU document ingestion by preserving MinerU’s original Markdown (including embedded HTML blocks) and improving relative image persistence so chunking and previews retain structure and images correctly.
Changes:
- Stop round-tripping MinerU
md_contentthroughhtml-to-markdown; instead apply narrow normalization to repair specific over-escaped patterns. - Improve MinerU image reference matching to support multiple relative-path variants and HTML
<img src="...">references, and extendImageResolverto persist/replace relative HTML image sources. - Add focused tests for markdown normalization and relative HTML image persistence; relax frontend DOMPurify allowlist for
details/summaryandopen.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/infrastructure/docparser/mineru_converter.go | Preserve MinerU markdown, normalize only specific escaped patterns, and broaden image ref matching. |
| internal/infrastructure/docparser/mineru_converter_test.go | Add tests for MinerU markdown normalization and image-ref variant detection. |
| internal/infrastructure/docparser/image_resolver.go | Normalize markdown before resolving, dedupe saves, and add relative HTML <img> src persistence/rewrite. |
| internal/infrastructure/docparser/image_resolver_relative_html_test.go | Add test covering relative HTML <img> persistence and rewrite. |
| frontend/src/utils/security.ts | Allow MinerU details/summary and open attribute through DOMPurify sanitizer config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tenantID uint64, | ||
| ) (updatedMarkdown string, images []StoredImage, err error) { | ||
| markdown := UnwrapLinkedImages(result.MarkdownContent) | ||
| markdown := UnwrapLinkedImages(normalizeMinerUMarkdown(result.MarkdownContent)) |
Comment on lines
+170
to
+172
| if !ref.IsOriginal && isIconImage(ref.ImageData) { | ||
| return StoredImage{}, false | ||
| } |
Comment on lines
+328
to
+332
| for _, match := range imgMarkdownPattern.FindAllStringSubmatch(content, -1) { | ||
| if len(match) >= 3 { | ||
| refs = append(refs, match[2]) | ||
| } | ||
| } |
Comment on lines
202
to
206
| for ipath, b64Str := range imagesB64 { | ||
| originalRef := "images/" + ipath | ||
| if !strings.Contains(mdContent, originalRef) { | ||
| matchedRefs := mineruImageOriginalRefs(mdContent, ipath) | ||
| if len(matchedRefs) == 0 { | ||
| continue | ||
| } |
3 tasks
lyingbug
added a commit
to lyingbug/WeKnora
that referenced
this pull request
May 21, 2026
Three follow-up fixes on top of the MinerU markdown preservation work: - Stop applying normalizeMinerUMarkdown inside ResolveAndStore. The helper is already called by MinerUReader.Read, and ResolveAndStore is shared by every parser (docreader, session attachments, ...). Running the heading/image unescape regexes globally would silently rewrite content (including inside fenced code blocks) for non-MinerU sources. - Recognize MinerU image references whose path contains spaces, e.g. "images/第 1 页.jpg". The previous regex used in extractImageRefsFromContent disallowed whitespace in the URL group, so such images were never matched and never persisted. Use a whitespace-tolerant pattern aligned with ResolveAndStore's own imgPattern. - Deduplicate uploads when the same MinerU image is referenced under multiple path forms (e.g. "images/foo.png" vs "./images/foo.png"). saveReferencedImage now caches by ref.Filename in addition to the raw ref path, so the second variant reuses the previously stored ServingURL instead of writing the same bytes to object storage again. Tests added: - TestProcessImagesMatchesPathsWithSpaces - TestResolveAndStoreDedupsSameImageRefVariants
lyingbug
added a commit
that referenced
this pull request
May 21, 2026
Three follow-up fixes on top of the MinerU markdown preservation work: - Stop applying normalizeMinerUMarkdown inside ResolveAndStore. The helper is already called by MinerUReader.Read, and ResolveAndStore is shared by every parser (docreader, session attachments, ...). Running the heading/image unescape regexes globally would silently rewrite content (including inside fenced code blocks) for non-MinerU sources. - Recognize MinerU image references whose path contains spaces, e.g. "images/第 1 页.jpg". The previous regex used in extractImageRefsFromContent disallowed whitespace in the URL group, so such images were never matched and never persisted. Use a whitespace-tolerant pattern aligned with ResolveAndStore's own imgPattern. - Deduplicate uploads when the same MinerU image is referenced under multiple path forms (e.g. "images/foo.png" vs "./images/foo.png"). saveReferencedImage now caches by ref.Filename in addition to the raw ref path, so the second variant reuses the previously stored ServingURL instead of writing the same bytes to object storage again. Tests added: - TestProcessImagesMatchesPathsWithSpaces - TestResolveAndStoreDedupsSameImageRefVariants
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MinerU already returns markdown with embedded HTML blocks, but the current\nreader runs the whole document back through html-to-markdown. That\nsecond conversion escapes valid headings and image syntax, so chunk\nprofiling sees plain text instead of markdown structure and relative\nimage references stop matching the storage pipeline.\n\nKeep MinerU output in its original markdown form and only apply narrow\ncompatibility normalization for the specific over-escaped patterns we\nactually need to recover. The converter now matches image refs by the\npaths that are really present in markdown or embedded HTML instead of\nassuming a single images/ form.\n\nExtend ImageResolver so relative HTML references share the\nsame storage rewrite path as markdown images, deduplicate repeated saves,\nand keep the frontend sanitizer compatible with MinerU's details/summary\nblocks. Add focused docparser tests that cover escaped markdown repair,\nvariant image path matching, and relative HTML image persistence.
Description
修复 MinerU 文档导入后分块预览与图片持久化异常的问题,主要包括:
md_content做二次html-to-markdown转换<img>图片引用写入对象存储并在预览中正常显示details/summary标签Type of Change
Related Issue
Fixes #1393
Testing
已执行:
go test ./internal/infrastructure/docparser验证点包括:
<img src="...">可正常入库并替换引用details/summaryChecklist
make fmt && make lint && make testpass locallydocs/, Swagger annotations, etc.)Screenshots / Recordings
无。本次主要为后端解析与图片持久化修复,无新增 UI 交互变更。