Skip to content

Commit 038ad23

Browse files
committed
reformat
1 parent 5b0d720 commit 038ad23

File tree

2 files changed

+46
-2
lines changed

2 files changed

+46
-2
lines changed

blogs/htmlcleaner/htmlcleaner.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
htmlcleaner代码学习
2+
---
3+
相比Jsoup,htmlcleaner支持XPath进行抽取,也是挺有用的。
4+
5+
htmlcleaner托管在sourceforge下[http://htmlcleaner.sourceforge.net/‎](http://htmlcleaner.sourceforge.net/‎
6+
),由于某种原因,访问sourceforge不是那么顺畅,最后选了这个比较新的github上的fork:[https://github.com/amplafi/htmlcleaner](https://github.com/amplafi/htmlcleaner)
7+
8+
htmlcleaner的包结构与Jsoup还是有些差距,一开始就被一字排开的类给吓到了。
9+
10+
htmlcleaner仍然有一套自己的树结构,继承自:`HtmlNode`。但是它提供了到`org.w3c.dom.Document``org.jdom2.Document`的转换。
11+
12+
`HtmlTokenizer`是词法分析部分,有状态但是没用状态机,而是用了一些基本类型来保存状态,例如:
13+
14+
public class HtmlTokenizer {
15+
16+
private BufferedReader _reader;
17+
private char[] _working = new char[WORKING_BUFFER_SIZE];
18+
19+
private transient int _pos;
20+
private transient int _len = -1;
21+
private transient int _row = 1;
22+
private transient int _col = 1;
23+
24+
25+
private transient StringBuffer _saved = new StringBuffer(512);
26+
27+
private transient boolean _isLateForDoctype;
28+
private transient DoctypeToken _docType;
29+
private transient TagToken _currentTagToken;
30+
private transient List<BaseToken> _tokenList = new ArrayList<BaseToken>();
31+
private transient Set<String> _namespacePrefixes = new HashSet<String>();
32+
33+
private boolean _asExpected = true;
34+
35+
private boolean _isScriptContext;
36+
}
37+
38+
浓烈的面向过程编程的味道。
39+
40+
`Tokenize`之后就是简单的用栈将树组合起来。
41+
42+
测试了一下,一个44k的文档,用Jsoup做parse是3.5ms,而htmlcleaner是7.9ms,差距在一倍左右。
43+
44+
XPath部分也是云里雾里,

blogs/jsoup6.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -167,15 +167,15 @@ Jsoup代码解读之六-parser(下)
167167

168168
1. 漏写了开始标签,只写了结束标签
169169

170-
```java
170+
```java
171171
case EndTag:
172172
if (StringUtil.in(name,"div","dl", "fieldset", "figcaption", "figure", "footer", "header", "pre", "section", "summary", "ul")) {
173173
if (!tb.inScope(name)) {
174174
tb.error(this);
175175
return false;
176176
}
177177
}
178-
```
178+
```
179179

180180
恭喜你,这个`</div>`会被当做错误处理掉,于是你的页面就毫无疑问的乱掉了!当然,如果单纯多写了一个`</div>`,好像也不会有什么影响哦?(记得有人跟我讲过为了防止标签未闭合,而在页面底部多写了几个`</div>`的故事)
181181

0 commit comments

Comments
 (0)