reformat

code4craft · code4craft · commit 038ad231382e · 2013-09-01T08:15:31.000+08:00
diff --git a/blogs/htmlcleaner/htmlcleaner.md b/blogs/htmlcleaner/htmlcleaner.md
@@ -0,0 +1,44 @@
+htmlcleaner代码学习
+---
+相比Jsoup，htmlcleaner支持XPath进行抽取，也是挺有用的。
+
+htmlcleaner托管在sourceforge下[http://htmlcleaner.sourceforge.net/‎](http://htmlcleaner.sourceforge.net/‎
+)，由于某种原因，访问sourceforge不是那么顺畅，最后选了这个比较新的github上的fork:[https://github.com/amplafi/htmlcleaner](https://github.com/amplafi/htmlcleaner)。
+
+htmlcleaner的包结构与Jsoup还是有些差距，一开始就被一字排开的类给吓到了。
+
+htmlcleaner仍然有一套自己的树结构，继承自:`HtmlNode`。但是它提供了到`org.w3c.dom.Document`和`org.jdom2.Document`的转换。
+
+`HtmlTokenizer`是词法分析部分，有状态但是没用状态机，而是用了一些基本类型来保存状态，例如：
+
+    public class HtmlTokenizer {
+
+        private BufferedReader _reader;
+        private char[] _working = new char[WORKING_BUFFER_SIZE];
+
+        private transient int _pos;
+        private transient int _len = -1;
+        private transient int _row = 1;
+        private transient int _col = 1;
+        
+
+        private transient StringBuffer _saved = new StringBuffer(512);
+
+        private transient boolean _isLateForDoctype;
+        private transient DoctypeToken _docType;
+        private transient TagToken _currentTagToken;
+        private transient List<BaseToken> _tokenList = new ArrayList<BaseToken>();
+        private transient Set<String> _namespacePrefixes = new HashSet<String>();
+
+        private boolean _asExpected = true;
+
+        private boolean _isScriptContext;
+    }
+
+浓烈的面向过程编程的味道。
+
+`Tokenize`之后就是简单的用栈将树组合起来。
+
+测试了一下，一个44k的文档，用Jsoup做parse是3.5ms，而htmlcleaner是7.9ms，差距在一倍左右。
+
+XPath部分也是云里雾里，
diff --git a/blogs/jsoup6.md b/blogs/jsoup6.md
@@ -167,15 +167,15 @@ Jsoup代码解读之六-parser(下)
 
 1. 漏写了开始标签，只写了结束标签
 
-```java
+	```java
 		case EndTag:
 			if (StringUtil.in(name,"div","dl", "fieldset", "figcaption", "figure", "footer", "header", "pre", "section", "summary", "ul")) {                
 				if (!tb.inScope(name)) {
 				tb.error(this);
 				return false;
 				} 
 			}	
-```
+	```
 			
 	恭喜你，这个`</div>`会被当做错误处理掉，于是你的页面就毫无疑问的乱掉了！当然，如果单纯多写了一个`</div>`，好像也不会有什么影响哦？(记得有人跟我讲过为了防止标签未闭合，而在页面底部多写了几个`</div>`的故事)
 	

Original file line number	Diff line number	Diff line change
`@@ -167,15 +167,15 @@ Jsoup代码解读之六-parser(下)`
`167`	`167`
`168`	`168`	`1. 漏写了开始标签，只写了结束标签`
`169`	`169`
`170`		-```java
	`170`	+ ```java
`171`	`171`	`case EndTag:`
`172`	`172`	`if (StringUtil.in(name,"div","dl", "fieldset", "figcaption", "figure", "footer", "header", "pre", "section", "summary", "ul")) {`
`173`	`173`	`if (!tb.inScope(name)) {`
`174`	`174`	`tb.error(this);`
`175`	`175`	`return false;`
`176`	`176`	`}`
`177`	`177`	`}`
`178`		-```
	`178`	+ ```
`179`	`179`
`180`	`180`	恭喜你，这个`</div>`会被当做错误处理掉，于是你的页面就毫无疑问的乱掉了！当然，如果单纯多写了一个`</div>`，好像也不会有什么影响哦？(记得有人跟我讲过为了防止标签未闭合，而在页面底部多写了几个`</div>`的故事)
`181`	`181`