Skip to content

Commit cfd2ce3

Browse files
committed
readme
1 parent c06f6d1 commit cfd2ce3

File tree

4 files changed

+224
-1
lines changed

4 files changed

+224
-1
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ Fork of Jsoup in [https://github.com/jhy/jsoup](https://github.com/jhy/jsoup)
1818

1919
### [Jsoup代码解读之五-parser(中)](https://github.com/code4craft/jsoup/blob/master/blogs/jsoup5.md)
2020

21+
### [Jsoup代码解读之六-parser(下)](https://github.com/code4craft/jsoup/blob/master/blogs/jsoup6.md)
22+
2123
-------
2224

2325
## 协议:

blogs/jsoup6.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
Jsoup代码解读之六-parser(下)
2+
--------
3+
最近生活上有点忙,女儿老是半夜不睡,精神状态也不是很好。工作上的事情也谈不上顺心,有很多想法但是没有几个被认可,有些事情也不是说代码写得好就行的。算了,还是端正态度,毕竟资历尚浅,我还是继续我的。
4+
5+
读Jsoup源码并非无聊,目的其实是为了将webmagic做的更好一点,毕竟parser也是爬虫的重要组成部分之一。读了代码后,收获也不少,对HTML的知识也更进一步了。
6+
7+
## DOM树产生过程
8+
9+
这里单独将`TreeBuilder`部分抽出来叫做语法分析过程可能稍微不妥,其实就是根据Token生成DOM树的过程,不过我还是沿用这个编译器里的称呼了。
10+
11+
`TreeBuilder`同样是一个facade对象,真正进行语法解析的是以下一段代码:
12+
13+
<!-- lang: java -->
14+
protected void runParser() {
15+
while (true) {
16+
Token token = tokeniser.read();
17+
18+
process(token);
19+
20+
if (token.type == Token.TokenType.EOF)
21+
break;
22+
}
23+
}
24+
25+
`TreeBuilder`有两个子类,`HtmlTreeBuilder``XmlTreeBuilder``XmlTreeBuilder`自然是构建XML树的类,实现颇为简单,基本上是维护一个栈,并根据不同Token插入节点即可:
26+
27+
<!-- lang: java -->
28+
@Override
29+
protected boolean process(Token token) {
30+
// start tag, end tag, doctype, comment, character, eof
31+
switch (token.type) {
32+
case StartTag:
33+
insert(token.asStartTag());
34+
break;
35+
case EndTag:
36+
popStackToClose(token.asEndTag());
37+
break;
38+
case Comment:
39+
insert(token.asComment());
40+
break;
41+
case Character:
42+
insert(token.asCharacter());
43+
break;
44+
case Doctype:
45+
insert(token.asDoctype());
46+
break;
47+
case EOF: // could put some normalisation here if desired
48+
break;
49+
default:
50+
Validate.fail("Unexpected token type: " + token.type);
51+
}
52+
return true;
53+
}
54+
55+
`insertNode`的代码大致是这个样子(为了便于展示,对方法进行了一些整合):
56+
57+
<!-- lang: java -->
58+
Element insert(Token.StartTag startTag) {
59+
Tag tag = Tag.valueOf(startTag.name());
60+
Element el = new Element(tag, baseUri, startTag.attributes);
61+
stack.getLast().appendChild(el);
62+
if (startTag.isSelfClosing()) {
63+
tokeniser.acknowledgeSelfClosingFlag();
64+
if (!tag.isKnownTag()) // unknown tag, remember this is self closing for output. see above.
65+
tag.setSelfClosing();
66+
} else {
67+
stack.add(el);
68+
}
69+
return el;
70+
}
71+
72+
## HTML解析状态机
73+
74+
相比`XmlTreeBuilder``HtmlTreeBuilder`则实现较为复杂,除了类似的栈结构以外,还用到了`HtmlTreeBuilderState`来构建了一个状态机来分析HTML。这是为什么呢?不妨看看`HtmlTreeBuilderState`到底用到了哪些状态吧(在代码中中用&lt;!-- State: --\&gt;标明状态):
75+
76+
<!-- lang: html -->
77+
<!-- State: Initial -->
78+
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
79+
<!-- State: BeforeHtml -->
80+
<html lang='zh-CN' xml:lang='zh-CN' xmlns='http://www.w3.org/1999/xhtml'>
81+
<!-- State: BeforeHead -->
82+
<head>
83+
<!-- State: InHead -->
84+
<script type="text/javascript">
85+
//<!-- State: Text -->
86+
function xx(){
87+
}
88+
</script>
89+
<noscript>
90+
<!-- State: InHeadNoscript -->
91+
Your browser does not support JavaScript!
92+
</noscript>
93+
</head>
94+
<!-- State: AfterHead -->
95+
<body>
96+
<!-- State: InBody -->
97+
<textarea>
98+
<!-- State: Text -->
99+
xxx
100+
</textarea>
101+
<table>
102+
<!-- State: InTable -->
103+
<!-- State: InTableText -->
104+
xxx
105+
<tbody>
106+
<!-- State: InTableBody -->
107+
</tbody>
108+
<tr>
109+
<!-- State: InRow -->
110+
<td>
111+
<!-- State: InCell -->
112+
</td>
113+
</tr>
114+
</table>
115+
</html>
116+
117+
这里可以看到,HTML标签是有嵌套要求的,例如`<tr>`,`<td>`需要组合`<table>`来使用。根据Jsoup的代码,可以发现,`HtmlTreeBuilderState`做了以下一些事情:
118+
119+
* ### 语法检查
120+
121+
例如`tr`没有嵌套在`table`标签内,则是一个语法错误。当`InBody`状态直接出现以下tag时,则出错。Jsoup里遇到这种错误,会发现这个Token的解析并记录错误,然后继续解析下面内容,并不会直接退出。
122+
123+
<!-- lang: java -->
124+
InBody {
125+
boolean process(Token t, HtmlTreeBuilder tb) {
126+
if (StringUtil.in(name,
127+
"caption", "col", "colgroup", "frame", "head", "tbody", "td", "tfoot", "th", "thead", "tr")) {
128+
tb.error(this);
129+
return false;
130+
}
131+
}
132+
133+
* ### 标签补全
134+
135+
例如`head`标签没有闭合,就写入了一些只有body内才允许出现的标签,则自动闭合`</head>`。`HtmlTreeBuilderState`有的方法`anythingElse()`就提供了自动补全标签,例如`InHead`状态的自动闭合代码如下:
136+
137+
<!-- lang: java -->
138+
private boolean anythingElse(Token t, TreeBuilder tb) {
139+
tb.process(new Token.EndTag("head"));
140+
return tb.process(t);
141+
}
142+
143+
还有一种标签闭合方式,例如下面的代码:
144+
145+
<!-- lang: java -->
146+
private void closeCell(HtmlTreeBuilder tb) {
147+
if (tb.inTableScope("td"))
148+
tb.process(new Token.EndTag("td"));
149+
else
150+
tb.process(new Token.EndTag("th")); // only here if th or td in scope
151+
}
152+
153+
## 实例研究
154+
155+
### 缺少标签时,会发生什么事?
156+
157+
好了,看了这么多parser的源码,不妨回到我们的日常应用上来。我们知道,在页面里多写一个两个未闭合的标签是很正常的事,那么它们会被怎么解析呢?
158+
159+
就拿`<div>`标签为例:
160+
161+
1. 漏写了开始标签,只写了结束标签
162+
163+
<!-- lang: java -->
164+
case EndTag:
165+
if (StringUtil.in(name,"div","dl", "fieldset", "figcaption", "figure", "footer", "header", "pre", "section", "summary", "ul")) {
166+
if (!tb.inScope(name)) {
167+
tb.error(this);
168+
return false;
169+
}
170+
}
171+
172+
恭喜你,这个`</div>`会被当做错误处理掉,于是你的页面就毫无疑问的乱掉了!当然,如果单纯多写了一个`</div>`,好像也不会有什么影响哦?(记得有人跟我讲过为了防止标签未闭合,而在页面底部多写了几个`</div>`的故事)
173+
174+
2. 写了开始标签,漏写了结束标签
175+
176+
这个情况分析起来更复杂一点。如果是无法在内部嵌套内容的标签,那么在遇到不可接受的标签时,会进行闭合。而`<div>`标签可以包括大多数标签,这种情况下,其作用域会持续到HTML结束。
177+
178+
好了,parser系列算是分析结束了,其间学到不少HTML及状态机内容,但是离实际使用比较远。下面开始select部分,这部分可能对日常使用更有意义一点。

src/main/java/org/jsoup/parser/HtmlTreeBuilderState.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -315,7 +315,8 @@ boolean process(Token t, HtmlTreeBuilder tb) {
315315
tb.transition(InFrameset);
316316
}
317317
} else if (StringUtil.in(name,
318-
"address", "article", "aside", "blockquote", "center", "details", "dir", "div", "dl",
318+
"address", "article", "aside", "blockquote", "center", "details", "dir", "" +
319+
"div", "dl",
319320
"fieldset", "figcaption", "figure", "footer", "header", "hgroup", "menu", "nav", "ol",
320321
"p", "section", "summary", "ul")) {
321322
if (tb.inButtonScope("p")) {
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
package us.codecraft.learning.parser;
2+
3+
import org.jsoup.nodes.Document;
4+
import org.jsoup.parser.ParseError;
5+
import org.jsoup.parser.Parser;
6+
7+
import java.util.List;
8+
9+
/**
10+
* @author code4crafter@gmail.com
11+
*/
12+
public class ParserCorrectorTest {
13+
14+
public static void main(String[] args) {
15+
String htmlWithDivUnclosed = "<body>\n" +
16+
" <textarea>\n" +
17+
" &lt;!-- Text --&gt;\n" +
18+
" xxx\n" +
19+
" </textarea> \n" +
20+
" <div> \n" +
21+
" <div>\n" +
22+
" <table> \n" +
23+
" <!-- InTable --> \n" +
24+
" <!-- InTableText --> xxx \n" +
25+
" <tbody> \n" +
26+
" <tr> \n" +
27+
" <!-- InRow --> \n" +
28+
" <td> \n" +
29+
" <!-- InCell --> </td> \n" +
30+
" </tr> \n" +
31+
" </tbody> \n" +
32+
" </table> \n" +
33+
" </div> \n" +
34+
"</body>";
35+
Parser parser = Parser.htmlParser();
36+
parser.setTrackErrors(100);
37+
Document document = parser.parseInput(htmlWithDivUnclosed, "");
38+
List<ParseError> errors = parser.getErrors();
39+
System.out.println(errors);
40+
41+
}
42+
}

0 commit comments

Comments
 (0)