webptt_crawler

Running Environment:

Editor: Sublime Text

Python version: 2.7.x

OS: Windows 7

Need-to-import packages:
```
 bs4 (beautifulsoup)
 
 lxml
 
 mechanize
 
 glob
```
Single board crawling (with customized start page, endless checking latest posts and crawling)
```
 $python crawler_auto.py [Boardname (case sensitive)] [Start page number]
```
Multiple boards crawling

!!!!! Warning: due to network and multiple requests issues, very unstable.

Modify "boardlist.txt" using Sublime Text (to avoid newline problem caused by Notepad), with each board separated by a newline ('\n' in C language, an "Enter" in editors) then,
```
 $python crawler_multi.py
```
To parse data

Raw data must be stored in folder "raw_data", then execute following commands:

(1) Parse contents arranged by authors, with "post_info.txt" which provides post_id, author, title, and datetime, using '\t' as the separator. The results are stored in folder "output".
```
 $python content_parser.py [Boardname (case sensitive)]
```
(2) Parse pushes arranged by boards. Each post_id named file provides push_status, author, push_content, using '\t' as the separator. The results are stored in folder "output_push".
```
 $python push_parser.py [Boardname (case sensitive)] 
```

Feel free to ask any question -> joekaojoekao@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
output/Gossiping		output/Gossiping
output_push/Gossiping		output_push/Gossiping
raw_data/Gossiping		raw_data/Gossiping
README.md		README.md
all_board_list.txt		all_board_list.txt
boardlist.txt		boardlist.txt
content_parser.py		content_parser.py
crawler.py		crawler.py
crawler_auto.py		crawler_auto.py
crawler_multi.py		crawler_multi.py
get_all_board_list.py		get_all_board_list.py
push_parser.py		push_parser.py

Provide feedback