add extractor: resiliparse trafilatura magic-html #3

pekopoke · 2025-07-30T10:11:42Z

No description provided.

e06084 · 2025-07-30T11:31:46Z

examples/magic_html_extract_demo.py

+from magic_html import GeneralExtractor
+
+# 初始化提取器
+extractor = GeneralExtractor()


这几个 example 是不是缺少了extractor 的初始化部分

这边的示例主要还是演示新定义的extractor 的用法，不是原始安装包的抽取方法

e06084 · 2025-07-30T11:33:16Z

webmainbench/extractors/magic_html_extractor.py

+from typing import Dict, Any, Optional
+from .base import BaseExtractor, ExtractionResult
+from .factory import extractor
+from magic_html import GeneralExtractor


我们整理一个 requirement.txt 文件吧，看起来要安装很多依赖

每一种 extractor tests 目录下也加一下单测用例

…ilatura magic-html

add extractor: resiliparse trafilatura magic-html

5492816

e06084 reviewed Jul 30, 2025

View reviewed changes

pekopoke added 16 commits July 31, 2025 14:29

add requirements and extractor demo : resiliparse trafilatura magic-html

f3d8b82

add requirements and extractor demo : resiliparse trafilatura magic-html

d0b827c

add requirements and extractor demo : resiliparse trafilatura magic-html

a03e980

fix test.yml / add requirements and extractor demo : resiliparse traf…

8731592

…ilatura magic-html

fix test.yml / add requirements and extractor demo : resiliparse traf…

f3002e4

…ilatura magic-html

fix test.yml / add requirements and extractor demo : resiliparse traf…

78157f0

…ilatura magic-html

update basic_usage of extractor

40ba4ca

update basic_usage of extractor

7f59780

update basic_usage of extractor and fix text_edit metric

c21b3d9

update basic_usage of extractor and fix text_edit metric

2d1bd8b

update basic_usage of extractor and fix text_edit metric

b7994f9

delete requirements

38f0430

delete requirements

5fe92cf

update text edit

efa7b02

update text edit

19b47ca

update text edit

9b63678

e06084 merged commit 07b095c into opendatalab:main Aug 4, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add extractor: resiliparse trafilatura magic-html #3

add extractor: resiliparse trafilatura magic-html #3

Uh oh!

pekopoke commented Jul 30, 2025

Uh oh!

e06084 Jul 30, 2025

Uh oh!

e06084 Jul 30, 2025

Uh oh!

e06084 Jul 30, 2025

Uh oh!

e06084 Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add extractor: resiliparse trafilatura magic-html #3

add extractor: resiliparse trafilatura magic-html #3

Uh oh!

Conversation

pekopoke commented Jul 30, 2025

Uh oh!

e06084 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

e06084 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

e06084 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

e06084 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants