Web scraping with Python

This is a repo about basic web scraping, with Python. This application uses the BeautifulSoup python library to retrieve specific data from a website and output the top posts.

Installation

You'll have to install Python 3.7 on your system. Download the appropriate for your OS installer from https://www.python.org/downloads/release/python-370/.

You'll install the packages needed for this project with the pip installer that allows you to install, reinstall, or uninstall PyPI packages. Pip is already installed with Python 3.7.

The packages you'll need are:

requests

Requests library is designed to be used by humans to interact with the language. We need it in the project to get the content at 'url' by making an HTTP GET request, with requests.get() function.

To install this module, type on your terminal:

pip install requests

For this packages (and for the next packages),

pip3 install requests

might also work.

contextlib

Contextlib provides utilities for common tasks involving the with statement, that "sets things up" and "tear things down" automatically, when needed. The contextlib.closing() function is used here to ensure that any network resources are freed when they go out of scope in a with block.

For installation:

pip install contectlib

bs4

Beautiful Soup library is a toolkit for dissecting a document and extracting what you need. We ned the bs4.BeautifulSoup function to parse the HTML.

Type on your terminal

pip install bs4

to install it.

re

We need the re regular expressions in this project returns the "compiled" regular expression object with re.compiled() function, which find_all consumes.

Type on your terminal

pip install re

to install it.

json

We need the json library here, for storing the data in the JSON format, with the json.dumps() function.

Type on your terminal

pip install json

to install it.

collections This module implements specialized container alternative datatypes. We need the collections.OrderedDict() function to initialize an ordered dictionary.

The main file of this project is hacker_news_scraping.py. Type "py hacker_news_scraping.py" in a terminal, to run it. Type also "py test.py" to run the tests.

Run the project

Type in your terminal:

python hacker_news_scraping.py

You will be prompted with: Enter the number of URLs to be fetched. Type an integer.

The expected output is:

Enter the number of URLs to be fetched: 2
[
  {
    "title": "New alternatives to HSL and HSV that better match color perception",
    "author": "bjornornorn",
    "uri": "https://bottosson.github.io/posts/colorpicker/",
    "points": 150,
    "comments": 20,
    "rank": 1
  },
  {
    "title": "2MW Electric Aircraft Engine",
    "author": "nixass",
    "uri": "https://www.weflywright.com/technology#motors",
    "points": 42,
    "comments": 32,
    "rank": 2
  }
]

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
__pycache__		__pycache__
HN_Data.py		HN_Data.py
README.md		README.md
hacker_news_scraping.py		hacker_news_scraping.py
helping_functions.py		helping_functions.py
requirements.txt		requirements.txt
test.py		test.py
tests_helping_functions.py		tests_helping_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web scraping with Python

Installation

The packages you'll need are:

Run the project

About

Releases

Packages

Languages

Punchyou/PythonWebScraping

Folders and files

Latest commit

History

Repository files navigation

Web scraping with Python

Installation

The packages you'll need are:

Run the project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages