fix(docparser): preserve MinerU markdown and persist relative images by M1dnightSUN · Pull Request #1404 · Tencent/WeKnora

M1dnightSUN · 2026-05-20T05:31:51Z

MinerU already returns markdown with embedded HTML blocks, but the current\nreader runs the whole document back through html-to-markdown. That\nsecond conversion escapes valid headings and image syntax, so chunk\nprofiling sees plain text instead of markdown structure and relative\nimage references stop matching the storage pipeline.\n\nKeep MinerU output in its original markdown form and only apply narrow\ncompatibility normalization for the specific over-escaped patterns we\nactually need to recover. The converter now matches image refs by the\npaths that are really present in markdown or embedded HTML instead of\nassuming a single images/ form.\n\nExtend ImageResolver so relative HTML references share the\nsame storage rewrite path as markdown images, deduplicate repeated saves,\nand keep the frontend sanitizer compatible with MinerU's details/summary\nblocks. Add focused docparser tests that cover escaped markdown repair,\nvariant image path matching, and relative HTML image persistence.

Description

修复 MinerU 文档导入后分块预览与图片持久化异常的问题，主要包括：

保留 MinerU 原始 Markdown，不再对整份 md_content 做二次 html-to-markdown 转换
修复因转义导致的 Markdown 图片语法和标题语法损坏问题
补齐相对路径图片的识别、入库与引用替换逻辑
支持 HTML <img> 图片引用写入对象存储并在预览中正常显示
前端放宽安全渲染白名单，允许 MinerU 合法输出的 details / summary 标签

Type of Change

Related Issue

Fixes #1393

Testing

已执行：

go test ./internal/infrastructure/docparser

验证点包括：

MinerU Markdown 标题与图片语法不再被二次转换破坏
相对路径 Markdown 图片可正常入库并替换引用
HTML <img src="..."> 可正常入库并替换引用
前端可安全渲染 details / summary

Checklist

make fmt && make lint && make test pass locally
Self-reviewed the code
Added/updated tests covering the change
Updated related documentation (README, docs/, Swagger annotations, etc.)
Breaking changes are clearly called out in the description above

Screenshots / Recordings

无。本次主要为后端解析与图片持久化修复，无新增 UI 交互变更。

MinerU already returns markdown with embedded HTML blocks, but the current\nreader runs the whole document back through html-to-markdown. That\nsecond conversion escapes valid headings and image syntax, so chunk\nprofiling sees plain text instead of markdown structure and relative\nimage references stop matching the storage pipeline.\n\nKeep MinerU output in its original markdown form and only apply narrow\ncompatibility normalization for the specific over-escaped patterns we\nactually need to recover. The converter now matches image refs by the\npaths that are really present in markdown or embedded HTML instead of\nassuming a single images/<name> form.\n\nExtend ImageResolver so relative HTML <img src=...> references share the\nsame storage rewrite path as markdown images, deduplicate repeated saves,\nand keep the frontend sanitizer compatible with MinerU's details/summary\nblocks. Add focused docparser tests that cover escaped markdown repair,\nvariant image path matching, and relative HTML image persistence.

Copilot

Pull request overview

This PR fixes MinerU document ingestion by preserving MinerU’s original Markdown (including embedded HTML blocks) and improving relative image persistence so chunking and previews retain structure and images correctly.

Changes:

Stop round-tripping MinerU md_content through html-to-markdown; instead apply narrow normalization to repair specific over-escaped patterns.
Improve MinerU image reference matching to support multiple relative-path variants and HTML <img src="..."> references, and extend ImageResolver to persist/replace relative HTML image sources.
Add focused tests for markdown normalization and relative HTML image persistence; relax frontend DOMPurify allowlist for details/summary and open.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
internal/infrastructure/docparser/mineru_converter.go	Preserve MinerU markdown, normalize only specific escaped patterns, and broaden image ref matching.
internal/infrastructure/docparser/mineru_converter_test.go	Add tests for MinerU markdown normalization and image-ref variant detection.
internal/infrastructure/docparser/image_resolver.go	Normalize markdown before resolving, dedupe saves, and add relative HTML `<img>` src persistence/rewrite.
internal/infrastructure/docparser/image_resolver_relative_html_test.go	Add test covering relative HTML `<img>` persistence and rewrite.
frontend/src/utils/security.ts	Allow MinerU `details/summary` and `open` attribute through DOMPurify sanitizer config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 	tenantID uint64,
 ) (updatedMarkdown string, images []StoredImage, err error) {
-	markdown := UnwrapLinkedImages(result.MarkdownContent)
+	markdown := UnwrapLinkedImages(normalizeMinerUMarkdown(result.MarkdownContent))


+	if !ref.IsOriginal && isIconImage(ref.ImageData) {
+		return StoredImage{}, false
+	}


+	for _, match := range imgMarkdownPattern.FindAllStringSubmatch(content, -1) {
+		if len(match) >= 3 {
+			refs = append(refs, match[2])
+		}
+	}


 	for ipath, b64Str := range imagesB64 {
-		originalRef := "images/" + ipath
-		if !strings.Contains(mdContent, originalRef) {
+		matchedRefs := mineruImageOriginalRefs(mdContent, ipath)
+		if len(matchedRefs) == 0 {
 			continue
 		}


Three follow-up fixes on top of the MinerU markdown preservation work: - Stop applying normalizeMinerUMarkdown inside ResolveAndStore. The helper is already called by MinerUReader.Read, and ResolveAndStore is shared by every parser (docreader, session attachments, ...). Running the heading/image unescape regexes globally would silently rewrite content (including inside fenced code blocks) for non-MinerU sources. - Recognize MinerU image references whose path contains spaces, e.g. "images/第 1 页.jpg". The previous regex used in extractImageRefsFromContent disallowed whitespace in the URL group, so such images were never matched and never persisted. Use a whitespace-tolerant pattern aligned with ResolveAndStore's own imgPattern. - Deduplicate uploads when the same MinerU image is referenced under multiple path forms (e.g. "images/foo.png" vs "./images/foo.png"). saveReferencedImage now caches by ref.Filename in addition to the raw ref path, so the second variant reuses the previously stored ServingURL instead of writing the same bytes to object storage again. Tests added: - TestProcessImagesMatchesPathsWithSpaces - TestResolveAndStoreDedupsSameImageRefVariants

Copilot AI review requested due to automatic review settings May 20, 2026 05:31

Copilot started reviewing on behalf of M1dnightSUN May 20, 2026 05:32 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

lyingbug mentioned this pull request May 21, 2026

fix(docparser): address review feedback on #1404 (MinerU markdown & image persistence) #1420

Merged

3 tasks

lyingbug merged commit 6210f44 into Tencent:main May 21, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docparser): preserve MinerU markdown and persist relative images#1404

fix(docparser): preserve MinerU markdown and persist relative images#1404
lyingbug merged 1 commit into
Tencent:mainfrom
M1dnightSUN:fix/mineru-preview-images

M1dnightSUN commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

M1dnightSUN commented May 20, 2026

Description

Type of Change

Related Issue

Testing

Checklist

Screenshots / Recordings

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants