My Site URL Finder

A simple Python-based web crawler that extracts and filters URLs from a given website while avoiding unwanted paths and file types. The crawler follows links recursively within the same domain and provides a clean list of URLs found across the website.

Features

Crawl a website recursively to extract all unique links.
Exclude specific paths (e.g., shopping cart, login) and file extensions (e.g., PDFs, images, scripts).
Clean URLs by removing query parameters.
Filter out external links and only crawl within the specified domain.
Randomized delay between requests to avoid overwhelming the server.

Requirements

To run this script, you need Python 3 and the following libraries:

requests
beautifulsoup4

You can install these dependencies using pip:

pip install requests beautifulsoup4

How to Use

Clone the repository:

git clone https://github.com/BaseMax/my-site-url-finders.git
cd my-site-url-finders

Modify the start_url variable in crawler.py to the website you want to crawl.

Run the script:

python crawler.py

The script will begin crawling the specified website, starting from the start_url, and output the links it finds to the console.

Configuration

`start_url`

The starting URL for the crawl. Change this to the website URL you want to crawl.

Example:

start_url = "https://example.com/"

`exclude_paths`

A list of URL path segments that should be ignored by the crawler. The crawler will skip links that match any of these paths.

Example:

exclude_paths = [
    "shop/",
    "cart/",
    "checkout/",
    "my-account/",
    "wp-admin/",
    "wp-content/",
    "wp-includes/",
]

`exclude_extensions`

A list of file extensions that should be ignored. The crawler will skip links that end with any of these file types.

Example:

exclude_extensions = [
    ".pdf", ".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".css", ".js", ".zip", ".tar", ".mp3", ".mp4", ".rar"
]

`clean_url`

This function removes query parameters from URLs to return the clean base URL.

Example:

cleaned_url = clean_url("https://example.com/page?query=1&param=2")
print(cleaned_url)
# Output: https://example.com/page

`get_links`

This function fetches all the links from a given webpage and returns them as a list.

Example:

links = get_links("https://example.com/")
print(links)
# List of all the links on the page

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt
urls.txt		urls.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

My Site URL Finder

Features

Requirements

How to Use

Configuration

`start_url`

`exclude_paths`

`exclude_extensions`

`clean_url`

`get_links`

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

BaseMax/my-site-url-finders

Folders and files

Latest commit

History

Repository files navigation

My Site URL Finder

Features

Requirements

How to Use

Configuration

start_url

exclude_paths

exclude_extensions

clean_url

get_links

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`start_url`

`exclude_paths`

`exclude_extensions`

`clean_url`

`get_links`

Packages