Skip to content

Python simple module for data grabbing from websites with JavaScript support

Notifications You must be signed in to change notification settings

rootKot/invader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Invader

Invader is a Python simple module for data grabbing from websites. Also with JavaScript support!

Invader is based on BeautifulSoup and dryscrape


Dependencies

Getting Started

  • install all dependecies if you haven't
$ sudo pip install requests
$ sudo apt-get install python-bs4
$ sudo pip install beautifulsoup4
$ sudo apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb
$ sudo pip install dryscrape
  • intall invader
$ sudo pip install invader

Items list data grabbing example:

from invader import Invader

url = 'https://duckduckgo.com/?q=python&t=hb&ia=web'
invader = Invader(url, js=True)

res = invader.take_list('#links .result', {
    'title': ['.result__a', 'text'],
    'src': ['.result__a', 'href']
})

print(res)

the response will be a list of dictionaries wich containing each item's image url and title

[
    {"title": "Welcome to Python.org", "src": "https://www.python.org/"},
    {"title": "Python (programming language) - Wikipedia", "src": "https://en.wikipedia.org/wiki/Python_%28programming_language%29"},
    {"title": "Python | Codecademy", "src": "https://www.codecademy.com/learn/python"}
]

Here is some examples of usage

Documentation

First of all create import Invader class from invader. Create instance of Invader and pass for argument the url address of website, and js=True if need to support javascript.

from invader import Invader
invader = Invader('http://some.site', js=True)

After that, content of website will be getted and saved in instace.

Public functions

take(selector_list)

For example if you have a link address of a concrete topic page of some forum, and you need to just pull topic title, or you need to get a list with all pictures sources, then you easly can use this function. take() function receives a one list argument, where first element of a list is a CSS selector of a html element, and second is a thing that needs you to take, and returns a string, or list with results.

res = invader.take(['.content .topic-title', 'text'])

in this example, we getting text of the element with class topic-title. Also you can take some attribute value from the element.

res = invader.take(['.content .topic-title a', 'href'])

the result will be:

http://some.site/link

take_list(wrapper, fields_dict)

If you need to get each item's information of some shoping site, then use this function! take_list() function receives a two arguments. First one is a string with selector of item wrapper element. Second argument is a dictionary with keys and with their selectors and things that we need (text, src, href, etc.)

res = invader.take_list('.products-wrap > a', {
    'img_url': ['.pr-item-wrap > img', 'src'],
    'title': ['.pr-title', 'text']
})

the response will be a list of dictionaries wich containing each item's image_url and title

[
  {"img_url": "/files/items/30735/icon_219x270.jpg", "title": "Поло  Vit 16 9713tr"},
  {"img_url": "/files/items/30734/icon_219x240.jpg", "title": "Поло  Vit 16 9713tr"}
]

also you can leave first argument None, if items havn't wrapper element, and just go one by one. But Warning! Be careful in that case! Be sure that each item have the same html elements that you want to get! Otherwise the order will be destroyed, and result going to be wrong.

screenshot(path)

If js-is enabled, requests goes with virtual browser, using dryscrape. you can take a screenshot of website that you visited. Give a path where to save screenshot if needs.

invader = Invader('https://google.com', js=True)
invader.sceenshot('/var/www/screenshots/')