Skip to content

fix(docparser): preserve MinerU markdown and persist relative images#1404

Merged
lyingbug merged 1 commit into
Tencent:mainfrom
M1dnightSUN:fix/mineru-preview-images
May 21, 2026
Merged

fix(docparser): preserve MinerU markdown and persist relative images#1404
lyingbug merged 1 commit into
Tencent:mainfrom
M1dnightSUN:fix/mineru-preview-images

Conversation

@M1dnightSUN
Copy link
Copy Markdown
Contributor

MinerU already returns markdown with embedded HTML blocks, but the current\nreader runs the whole document back through html-to-markdown. That\nsecond conversion escapes valid headings and image syntax, so chunk\nprofiling sees plain text instead of markdown structure and relative\nimage references stop matching the storage pipeline.\n\nKeep MinerU output in its original markdown form and only apply narrow\ncompatibility normalization for the specific over-escaped patterns we\nactually need to recover. The converter now matches image refs by the\npaths that are really present in markdown or embedded HTML instead of\nassuming a single images/ form.\n\nExtend ImageResolver so relative HTML references share the\nsame storage rewrite path as markdown images, deduplicate repeated saves,\nand keep the frontend sanitizer compatible with MinerU's details/summary\nblocks. Add focused docparser tests that cover escaped markdown repair,\nvariant image path matching, and relative HTML image persistence.

Description

修复 MinerU 文档导入后分块预览与图片持久化异常的问题,主要包括:

  • 保留 MinerU 原始 Markdown,不再对整份 md_content 做二次 html-to-markdown 转换
  • 修复因转义导致的 Markdown 图片语法和标题语法损坏问题
  • 补齐相对路径图片的识别、入库与引用替换逻辑
  • 支持 HTML <img> 图片引用写入对象存储并在预览中正常显示
  • 前端放宽安全渲染白名单,允许 MinerU 合法输出的 details / summary 标签

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation update
  • 🎨 Refactor
  • ⚡ Performance improvement
  • 🧪 Test
  • 🔧 Configuration / Build / CI

Related Issue

Fixes #1393

Testing

已执行:

go test ./internal/infrastructure/docparser

验证点包括:

  • MinerU Markdown 标题与图片语法不再被二次转换破坏
  • 相对路径 Markdown 图片可正常入库并替换引用
  • HTML <img src="..."> 可正常入库并替换引用
  • 前端可安全渲染 details / summary

Checklist

  • make fmt && make lint && make test pass locally
  • Self-reviewed the code
  • Added/updated tests covering the change
  • Updated related documentation (README, docs/, Swagger annotations, etc.)
  • Breaking changes are clearly called out in the description above

Screenshots / Recordings

无。本次主要为后端解析与图片持久化修复,无新增 UI 交互变更。

MinerU already returns markdown with embedded HTML blocks, but the current\nreader runs the whole document back through html-to-markdown. That\nsecond conversion escapes valid headings and image syntax, so chunk\nprofiling sees plain text instead of markdown structure and relative\nimage references stop matching the storage pipeline.\n\nKeep MinerU output in its original markdown form and only apply narrow\ncompatibility normalization for the specific over-escaped patterns we\nactually need to recover. The converter now matches image refs by the\npaths that are really present in markdown or embedded HTML instead of\nassuming a single images/<name> form.\n\nExtend ImageResolver so relative HTML <img src=...> references share the\nsame storage rewrite path as markdown images, deduplicate repeated saves,\nand keep the frontend sanitizer compatible with MinerU's details/summary\nblocks. Add focused docparser tests that cover escaped markdown repair,\nvariant image path matching, and relative HTML image persistence.
Copilot AI review requested due to automatic review settings May 20, 2026 05:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes MinerU document ingestion by preserving MinerU’s original Markdown (including embedded HTML blocks) and improving relative image persistence so chunking and previews retain structure and images correctly.

Changes:

  • Stop round-tripping MinerU md_content through html-to-markdown; instead apply narrow normalization to repair specific over-escaped patterns.
  • Improve MinerU image reference matching to support multiple relative-path variants and HTML <img src="..."> references, and extend ImageResolver to persist/replace relative HTML image sources.
  • Add focused tests for markdown normalization and relative HTML image persistence; relax frontend DOMPurify allowlist for details/summary and open.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/infrastructure/docparser/mineru_converter.go Preserve MinerU markdown, normalize only specific escaped patterns, and broaden image ref matching.
internal/infrastructure/docparser/mineru_converter_test.go Add tests for MinerU markdown normalization and image-ref variant detection.
internal/infrastructure/docparser/image_resolver.go Normalize markdown before resolving, dedupe saves, and add relative HTML <img> src persistence/rewrite.
internal/infrastructure/docparser/image_resolver_relative_html_test.go Add test covering relative HTML <img> persistence and rewrite.
frontend/src/utils/security.ts Allow MinerU details/summary and open attribute through DOMPurify sanitizer config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tenantID uint64,
) (updatedMarkdown string, images []StoredImage, err error) {
markdown := UnwrapLinkedImages(result.MarkdownContent)
markdown := UnwrapLinkedImages(normalizeMinerUMarkdown(result.MarkdownContent))
Comment on lines +170 to +172
if !ref.IsOriginal && isIconImage(ref.ImageData) {
return StoredImage{}, false
}
Comment on lines +328 to +332
for _, match := range imgMarkdownPattern.FindAllStringSubmatch(content, -1) {
if len(match) >= 3 {
refs = append(refs, match[2])
}
}
Comment on lines 202 to 206
for ipath, b64Str := range imagesB64 {
originalRef := "images/" + ipath
if !strings.Contains(mdContent, originalRef) {
matchedRefs := mineruImageOriginalRefs(mdContent, ipath)
if len(matchedRefs) == 0 {
continue
}
@lyingbug lyingbug merged commit 6210f44 into Tencent:main May 21, 2026
4 of 5 checks passed
lyingbug added a commit to lyingbug/WeKnora that referenced this pull request May 21, 2026
Three follow-up fixes on top of the MinerU markdown preservation work:

- Stop applying normalizeMinerUMarkdown inside ResolveAndStore. The
  helper is already called by MinerUReader.Read, and ResolveAndStore is
  shared by every parser (docreader, session attachments, ...). Running
  the heading/image unescape regexes globally would silently rewrite
  content (including inside fenced code blocks) for non-MinerU sources.

- Recognize MinerU image references whose path contains spaces, e.g.
  "images/第 1 页.jpg". The previous regex used in
  extractImageRefsFromContent disallowed whitespace in the URL group,
  so such images were never matched and never persisted. Use a
  whitespace-tolerant pattern aligned with ResolveAndStore's own
  imgPattern.

- Deduplicate uploads when the same MinerU image is referenced under
  multiple path forms (e.g. "images/foo.png" vs "./images/foo.png").
  saveReferencedImage now caches by ref.Filename in addition to the
  raw ref path, so the second variant reuses the previously stored
  ServingURL instead of writing the same bytes to object storage
  again.

Tests added:
- TestProcessImagesMatchesPathsWithSpaces
- TestResolveAndStoreDedupsSameImageRefVariants
lyingbug added a commit that referenced this pull request May 21, 2026
Three follow-up fixes on top of the MinerU markdown preservation work:

- Stop applying normalizeMinerUMarkdown inside ResolveAndStore. The
  helper is already called by MinerUReader.Read, and ResolveAndStore is
  shared by every parser (docreader, session attachments, ...). Running
  the heading/image unescape regexes globally would silently rewrite
  content (including inside fenced code blocks) for non-MinerU sources.

- Recognize MinerU image references whose path contains spaces, e.g.
  "images/第 1 页.jpg". The previous regex used in
  extractImageRefsFromContent disallowed whitespace in the URL group,
  so such images were never matched and never persisted. Use a
  whitespace-tolerant pattern aligned with ResolveAndStore's own
  imgPattern.

- Deduplicate uploads when the same MinerU image is referenced under
  multiple path forms (e.g. "images/foo.png" vs "./images/foo.png").
  saveReferencedImage now caches by ref.Filename in addition to the
  raw ref path, so the second variant reuses the previously stored
  ServingURL instead of writing the same bytes to object storage
  again.

Tests added:
- TestProcessImagesMatchesPathsWithSpaces
- TestResolveAndStoreDedupsSameImageRefVariants
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants