|
| 1 | +### Introduction to Webscraping. |
| 2 | + - **Beautifulsoup** |
| 3 | + - *Introduction* |
| 4 | + Beautiful Soup is a python package which allows us to pull data out of HTML and XML documents. |
| 5 | + - *Beautiful Soup - Installation* |
| 6 | + |
| 7 | + pip install beautifulsoup4 |
| 8 | + - *Import beautifulsoup* |
| 9 | + |
| 10 | + from bs4 import BeautifulSoup |
| 11 | + - *Important Methods* |
| 12 | + |
| 13 | + 1) **find**(name, attrs, recursive, string, **kwargs) |
| 14 | + |
| 15 | + scan the entire document to find only one result. |
| 16 | + |
| 17 | + 2) **find_all**(name, attrs, recursive, string, limit, **kwargs) |
| 18 | + |
| 19 | + You can use find_all to extract all the occurrences of a particular tag from the page response as |
| 20 | + |
| 21 | + - **Regex** |
| 22 | + |
| 23 | + ***Introduction*** |
| 24 | + |
| 25 | + The Python module re provides full support for Perl-like regular expressions in Python |
| 26 | + |
| 27 | + ***Important methods*** |
| 28 | + |
| 29 | + * **re.match**(pattern, string, flags=0) |
| 30 | + |
| 31 | + The re.match function returns a match object on success, None on failure. We usegroup(num) or groups() function of the match object to get a matched expression |
| 32 | + |
| 33 | + * **re.search**(pattern, string, flags=0) |
| 34 | + |
| 35 | + The search() function searches the string for a match, and returns a Match object if there is a match. |
| 36 | + |
| 37 | + * **re.findall**(pattern, string, flags=0)) |
| 38 | + |
| 39 | + function returns a list containing all matches. |
| 40 | + |
| 41 | + * **re.sub**(pattern,replace_string,string) |
| 42 | + |
| 43 | + The sub() function replaces the matches with the text of your choice: |
| 44 | + |
| 45 | + ***Metacharacters*** |
| 46 | + |
| 47 | + [] a set of a character |
| 48 | + . any character |
| 49 | + ^ start with |
| 50 | + $ end with |
| 51 | + * zero or more occurrences |
| 52 | + + one or more occurrences |
| 53 | + ? zero or one occurrences |
| 54 | + {} exactly specified number of occurrence |
| 55 | + () capture a group |
| 56 | + *Important Special Sequences* |
| 57 | + \w Matches word characters. |
| 58 | + \W Matches nonword characters. |
| 59 | + \s Matches whitespace. Equivalent to [\t\n\r\f]. |
| 60 | + \S Matches nonwhitespace |
| 61 | + \d Matches digits. Equivalent to [0-9]. |
| 62 | + \D Matches Nondigits |
| 63 | + - urllib2/requests |
| 64 | + |
| 65 | + ***Request*** |
| 66 | + |
| 67 | + Introduction |
| 68 | + |
| 69 | + The requests module allows you to send HTTP requests using Python. |
| 70 | + |
| 71 | + Importent Methods |
| 72 | + |
| 73 | + 1)get(url,params,args) |
| 74 | + Sends a GET request to the specified url |
| 75 | + 2)Post(url,data,json,args) |
| 76 | + Sends a POST request to the specified url |
| 77 | + 3)delete(url,args) |
| 78 | + Sends a DELETE request to the specified url |
| 79 | + ***Urllib*** |
| 80 | + |
| 81 | + Introduction |
| 82 | + It is a Python 3 package that allows you to access, and interact with, websites using their URL’s (Uniform Resource Locator). It has several modules for working with URL’s, these are shown in the illustration below: |
| 83 | + |
| 84 | + Urllib.request |
| 85 | + Using urllib.request, with urlopen, allows you to open the specified URL. |
| 86 | + Urllib.error |
| 87 | + This module is used to catch exceptions encountered from url.request |
| 88 | + |
| 89 | + - **Writing a script using the above packages and run it in Docker**. |
| 90 | + |
| 91 | + *web_scraping_sample.py* |
| 92 | + import requests |
| 93 | + from bs4 import BeautifulSoup |
| 94 | + import re |
| 95 | + res = requests.get('https://www.lipsum.com/') |
| 96 | + soup = BeautifulSoup(res.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install html5lib |
| 97 | + data=soup.find(re.compile(r'div'),attrs={'id':"Panes"}) |
| 98 | + print(data.find("lorem")) |
| 99 | + qes_list=[] |
| 100 | + ans_list=[] |
| 101 | + for row in data.findAll("div"): |
| 102 | + qes_list.append(row.h2.text) |
| 103 | + tempstring="" |
| 104 | + counter=0 |
| 105 | + for i in row.findAll("p"): |
| 106 | + tempstring=tempstring+"\n"+i.text |
| 107 | + ans_list.append(tempstring) |
| 108 | + tempstring="" |
| 109 | + for i in range(len(qes_list)): |
| 110 | + tempstring=tempstring+"\n"+qes_list[i]+"\n"+ans_list[i]+"\n--------------------------------------------------------------------------------------------------\n\n" |
| 111 | + print(tempstring) |
| 112 | + |
| 113 | + ***Creating a dockerfile in same directory*** |
| 114 | + |
| 115 | + FROM python:3.10.2-alpine3.15 |
| 116 | + # Create directories |
| 117 | + RUN mkdir -p /root/workspace/src |
| 118 | + COPY ./web_scraping_sample.py /root/workspace/src |
| 119 | + # Switch to project directory |
| 120 | + WORKDIR /root/workspace/src |
| 121 | + RUN python web_scraping_sample.py |
| 122 | + - ***Build dokcer image*** |
| 123 | + |
| 124 | + docker build -t simple_python |
| 125 | + - ***Run image as a docker container*** |
| 126 | + |
| 127 | + docker run -d --name container1 simple_python |
0 commit comments