Skip to content

A lightweight, Python-based web crawler designed to recursively scan and analyze websites for data extraction, link discovery, and basic site structure mapping. Ideal for cybersecurity assessments, educational purposes, or as a foundational component for more advanced data mining tools.

DanielVihorev/Crawler

Repository files navigation

🕷️ Web Crawler

A lightweight, Python-based web crawler designed to recursively scan and analyze websites for data extraction, link discovery, and basic site structure mapping. Ideal for cybersecurity assessments, educational purposes, or as a foundational component for more advanced data mining tools.

List of External and Third Party Libraries

✅ Standard Library • urllib.parse – for parsing and joining URLs

📦 Third-Party Libraries (must be installed via pip) • requests – for sending HTTP requests • beautifulsoup4 – for parsing and navigating HTML

🔍 Features

  • Recursive URL crawling with depth control
  • Duplicate URL detection and prevention
  • Configurable user-agent header
  • Simple and readable console output
  • Lightweight and dependency-free (built on Python standard libraries)

📁 Project Structure

crawler/
│
├── crawler.py          # Main crawler script
├── requirements.txt    # (Optional) List of dependencies (currently none)
├── README.md           # This file

⚙️ Installation
	1.	Clone the repository:

git clone https://github.com/DanielVihorev/Crawler.git
cd Crawler


2.(Optional) Create a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3.Install requirements:
This project uses only Python standard libraries. If future dependencies are added, use:

pip install -r requirements.txt



🚀 Usage

Run the crawler script from the command line:

python crawler.py

You may edit the script to modify the starting URL or crawling depth.

🛠️ Customization

To change the starting URL or behavior:
	•	Modify the start_url and max_depth variables in crawler.py
	•	Enhance with modules like BeautifulSoup or Scrapy if HTML parsing is needed
	•	Integrate logging or export results to a file or database

🧠 Use Cases
	•	Cybersecurity Reconnaissance – Discover site structure during initial scans
	•	SEO Audits – Map out links and page connections
	•	Learning Tool – Great starting point to understand how crawlers work

⚖️ License

This project is licensed under the MIT License – see the LICENSE file for details.

👨‍💻 Author

Daniel Vihorev
Cybersecurity enthusiast & Python developer
GitHub Profile

Ilay Zendani
Cybersecurity enthusiast & Python developer
GitHub Profile

⸻

🧠 “Built from curiosity, crafted for exploration.”

---

Let me know if you want to add badges, Docker support, or split functionality for CLI args!

@All rights saved to Daniel Vihorev and Ilay zendani (Wild Life Cyber Security)

About

A lightweight, Python-based web crawler designed to recursively scan and analyze websites for data extraction, link discovery, and basic site structure mapping. Ideal for cybersecurity assessments, educational purposes, or as a foundational component for more advanced data mining tools.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages