Releases: opendatalab/WebMainBench
Releases · opendatalab/WebMainBench
release v1.0.0
Include 4 extractors and bench for 545 data
What's Changed
- fix bug:table 重复 by @pekopoke in #42
- Optimize table edit distance calculation by using normalize by @pekopoke in #43
- add extractor version in results by @pekopoke in #44
- fix back to old formula match by @pekopoke in #45
- feat: add language and style classify by @e06084 in #46
- 使用LLM修正预测公式 by @1041206149 in #47
- feat: refactor _extract_from_markdown with LLM-enhanced table/formula/code extraction by @1041206149 in #48
- Dev:增加trafilatura输出txt的方法 by @pekopoke in #50
- 将LLM api 配置放到config.py中 by @1041206149 in #51
- fix:行内行间代码块中不进行表格和公式提取 by @pekopoke in #52
New Contributors
- @1041206149 made their first contribution in #47
Full Changelog: v0.2.0...v1.0.0
v0.2.0
What's Changed
- feat: add multi extractor compare script by @e06084 in #34
- feat(metrics): implement comprehensive memoization for TEDS algorithm by @SHUzhangshuo in #35
- Main html by @darkrush in #37
- feat: add dataset statics script by @e06084 in #38
- feat: text_edit metric use all text by @e06084 in #39
- Dev:优化表格分割、删除code行内分割、teds性能提升 by @pekopoke in #40
- fix code match by @pekopoke in #41
New Contributors
Full Changelog: v0.0.1...v0.2.0
v0.1.0
What's Changed
- feat: add ci by @e06084 in #1
- feat: refine llm-webkit extractor by @e06084 in #2
- docs: update readme by @e06084 in #4
- feat: refine data saver by @e06084 in #5
- feat: commit results by @e06084 in #6
- feat: extractor no content_list by @e06084 in #7
- feat: update leaderboard by @e06084 in #8
- tests: update metrics test by @e06084 in #9
- feat: update text metrics calculate by @e06084 in #10
- feat: update summary info by @e06084 in #11
- feat: add save_dataset_with_extraction by @e06084 in #12
- feat: add evaluate_batched by @e06084 in #13
- add extractor: resiliparse trafilatura magic-html by @pekopoke in #3
- feat: update llm-webkit extract by @e06084 in #14
- 添加了可以直接用于评估的抽取器test_model_extractor by @SHUzhangshuo in #15
- add three extractors by @pekopoke in #16
- feat: add llm_webkit_with_preprocessed_html by @e06084 in #17
- Dev:update metrics by @pekopoke in #19
- 修改了_extract_from_markdown方法,并基于新的方法进行统计测试 by @SHUzhangshuo in #21
- Revert "修改了_extract_from_markdown方法,并基于新的方法进行统计测试" by @e06084 in #22
- fix: llm-webkit extraction_success status capture by @e06084 in #23
- fix: llm_web_kit extrator by @e06084 in #24
- fix:extract from markdown by @SHUzhangshuo in #25
- Dev:add gt and pre of code formula table text in result jsonl by @pekopoke in #26
- fix: llm_web_kit extrator by @e06084 in #27
- Dev:fix match formula and code by @pekopoke in #28
- feat:Data Modification Tools by @SHUzhangshuo in #29
- update label tool dir by @e06084 in #30
- fix:label_tool bug by @SHUzhangshuo in #32
- update llm_webkit_extractor by @e06084 in #31
- fix: table TEDS bug by @SHUzhangshuo in #33
New Contributors
- @e06084 made their first contribution in #1
- @pekopoke made their first contribution in #3
- @SHUzhangshuo made their first contribution in #15
Full Changelog: https://github.com/ccprocessor/WebMainBench/commits/v0.0.1