Skip to content

Commit 4baad03

Browse files
add CEPF
1 parent 4ca71ec commit 4baad03

File tree

2 files changed

+8
-0
lines changed

2 files changed

+8
-0
lines changed

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,11 @@ WebCollector is an open source web crawler framework based on Java.It provides
33
some simple interfaces for crawling the Web,you can setup a
44
multi-threaded web crawler in less than 5 minutes.
55

6+
7+
In addition to a general crawler framework, WebCollector also integrates __CEPF__, a well-designed state-of-the-art web content extraction algorithm proposed by Wu, et al.:
8+
+ Wu GQ, Hu J, Li L, Xu ZH, Liu PC, Hu XG, Wu XD. Online Web news extraction via tag path feature fusion. Ruan Jian Xue Bao/Journal of Software, 2016,27(3):714-735 (in Chinese). http://www.jos.org.cn/1000-9825/4868.htm
9+
10+
611
## HomePage
712
[https://github.com/CrawlScript/WebCollector](https://github.com/CrawlScript/WebCollector)
813

README.zh-cn.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ WebCollector
44
### 爬虫简介
55
WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架(内核),它提供精简的的API,只需少量代码即可实现一个功能强大的爬虫。
66

7+
除了爬虫框架,WebCollector还集成了CEPF,它是由吴共庆老师等提出的网页内容自动抽取算法,是目前最先进的算法之一:
8+
+ 吴共庆,胡骏,李莉,徐喆昊,刘鹏程,胡学钢,吴信东.基于标签路径特征融合的在线Web 新闻内容抽取.软件学报,2016,27(3):714-735. http://www.jos.org.cn/1000-9825/4868.htm
9+
710
### 爬虫内核:
811
WebCollector致力于维护一个稳定、可扩的爬虫内核,便于开发者进行灵活的二次开发。内核具有很强的扩展性,用户可以在内核基础上开发自己想要的爬虫。源码中集成了Jsoup,可进行精准的网页解析。
912

0 commit comments

Comments
 (0)