get the most popular repository in the github python trending page.
- Here crawl and parse the HTML from https://github.com/trending/python?since=daily
Although you can get it from the api.github.com
- Get the request args
- Use the URL: https://github.com/trending/python?since=daily
- Or copy the curl-string from chrome
- Create crawler rule
- Open watchdog page. Default http://127.0.0.1:9901/
- Click <New> tab.
- First step is to set the CrawlerRule's meta info.
- Now start to ensure the request is correct.
- Click <cURL Parse> link.
- Input the cURL string or URL got before.
- Then it generates the default regex & request args, maybe need some change for match more url pattern.
- Click <Download> button, wait for it finish downloading => Response Body [200]
- If after downloading, <Rule Name> is still null, need to input manually.
- Check the source code downloaded, ensure it is what you want.
- Also you can check it in the parse rules by using a rule named
__schema__
, the parser will raise Error except this__schema__
rule returnsTrue
.
- Also you can check it in the parse rules by using a rule named
- Now to set the ParseRules of this CrawlerRule.
- A valid CrawlerRule should contains
text
rule andurl
rule, and theurl
rule is optional. - Delete the existing text rule, create a new parse rule named
list
. - Create a new Parse Rule like as below:
- Here we got the list item for child rules.
- Then need two child rules named
text
andurl
for thelist
rule. - Create a new parse rule named
text
like this:- Click the button send the
text
rule tolist
rule.
- Click the button send the
- Create a new parse rule named
url
liketext
, or ignore this rule. But$text
attribute should use@href
for get href attribute. Also need to send this rule tolist
rule.
- A valid CrawlerRule should contains
- OK, now click
Parse
button to parse this CrawlerRule, and get the result. - Click the <1. Save Crawler Rule> button to save rule into database.
Parse result
{'Trending Python repositories on GitHub today · GitHub': {'list': {'text': 'gwen001 / pentest-tools', 'url': 'https://github.com/gwen001/pentest-tools'}}}
CrawlerRule JSON. This JSON string can be loaded by clicking the <Loads> button.
{"name":"Trending Python repositories on GitHub today · GitHub","request_args":{"method":"get","url":"https://github.com/trending/python?since=daily","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"list","chain_rules":[["css","h1.lh-condensed>a","$string"],["python","index","0"],["re","=\"/","@=\"https://github.com/"]],"child_rules":[{"name":"text","chain_rules":[["css","a","$text"],["py","index","0"],["udf","input_object.strip().replace('\\n', '')",""]],"child_rules":[],"iter_parse_child":false},{"name":"url","chain_rules":[["css","a","@href"],["python","index","0"]],"child_rules":[],"iter_parse_child":false}],"iter_parse_child":false}],"regex":"^https://github\\.com/trending/python\\?since=daily$","encoding":""}
- Click the <2. Add New Task> button.
- Ensure the task info.
- Click <Submit> button. Create task success.
- Click <Tasks> tab.
- Double click the task's row.
- Update it, submit.