Happy-Crawler

There are crawlers made by HC-team. If you like them or want to see more upgrade of them, please press Star or join us (contact: zoucongyu1993@hotmail.com, wechat:zoucongyu1109) at the right up corner, thank you!

To use the crawlers just

    from xxx_crawler import xxx_crawler
    xxx_c = xxx_crawler()
    xxx_c.run()

Please be careful: Those crawlers are only designed for research purposes, no commercial usage is allowed. The user should take the fully responsibility for illegal use and intented attack.

1) bbc_crawler

It is good at dealing with news page.

2) douban_crawler

It is good at finding the hot words.

To use it, just

    from douban_crawler import douban_crawler
    d_c = douban_crawler()
    d_c.run()

It will crawl 5 pages of douban/blabla (https://www.douban.com/group/blabla//discussion?start=0), get the posts, break them down to several words, calculate the heat of the words and then save an wordCloud graph as html in material folder.

Example:

2.1 Pages and flags

You can adjust the pages and interested flag of words.

    d_c.run(num_pages=5, consider_flags=['n','nr','nrt'])  # this is default value

Some basic flags are: "n" for noun，“a” for adj，“v” for verb. For more details, please see: https://blog.csdn.net/suibianshen2012/article/details/53487157

2.2 Heat calculation

You can also adjust the method to calculate heat, currently it supports three different method.

    d_c.run(include_heat=False)  # this will only consider the frequency of words.
    d_c.run(adjustment = 'log')  # this is a compromise between frequency of words and the heat of post. Default is None.

2.3 Get raw data

The douban_crawler also provide other possibilities to visualize the data.

    d_c.get_words_list(num_pages=5, include_heat=True)
    d_c.create_words_table(get_raw_data=True)

It will output a csv file to material folder. This DataFrame has three columns: word, flag, heat. The heat is the heat of the origin post.

3) wg_crawler

It will scrape 10 pages of the website wg_gesucht.de with the filter be set as {Munich, WG}, then get a dataframe which contains information about the WG ans store it in the material folder name as 'The_wg_information_in_munich'.

The DataFrame contains 4 columns: name, link, room_size, price

To use it, just:

    from wg_crawler import wg_crawler
    w_c = wg_cralwer()
    w_c.run()

If you want to access the DataFrame:

    the_dataframe = w_c.df

3.1 Relationship between room size and price

    w_c = wg_crawler()
    w_a = wg_analyse(w_c)   # to analyse the data, we need to first connect analyser to the crawler 
    w_a.size_price()  # It will store an image to material folder, you can set the path. with path='...'

The figure will look like:

Main contributors:

Apollo1840

Another folders:

tutorial

There are some learning material.

(A very important tool I often use is Pandas, tutorial about Pandas is in another repo of me: Data-Analysis-Tools)

The coming parts are:

how to login
how to use req.pull
more about BeautifulSoup

dummy_websites

There are some simple demonstrations of html5.

material

Useless. It stores the sample output of the crawlers.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.spyproject		.spyproject
__pycache__		__pycache__
dummy_websites		dummy_websites
material		material
tutorial		tutorial
README.md		README.md
basic_crawler.py		basic_crawler.py
bbc_crawler.py		bbc_crawler.py
desktop.ini		desktop.ini
douban_crawler.py		douban_crawler.py
rent_crawler.py		rent_crawler.py
testFunctions.py		testFunctions.py
wg_crawler.py		wg_crawler.py
wg_crawler_local.py		wg_crawler_local.py
wg_crawler_test.py		wg_crawler_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Happy-Crawler

1) bbc_crawler

2) douban_crawler

2.1 Pages and flags

2.2 Heat calculation

2.3 Get raw data

3) wg_crawler

3.1 Relationship between room size and price

Main contributors:

Another folders:

tutorial

dummy_websites

material

About

Releases

Packages

Languages

YangLiu14/Happy-Crawler

Folders and files

Latest commit

History

Repository files navigation

Happy-Crawler

1) bbc_crawler

2) douban_crawler

2.1 Pages and flags

2.2 Heat calculation

2.3 Get raw data

3) wg_crawler

3.1 Relationship between room size and price

Main contributors:

Another folders:

tutorial

dummy_websites

material

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages