1. Overview

Protext Scraper is a high-performance, concurrent web scraper designed to extract press releases and PR content from Protext.cz. It utilizes Tor circuit rotation and randomized user agents to bypass rate limiting and ensure reliable data collection.

2. Motivation

This project was implemented as part of the semester essay for the course 4IT550 - Competitive Intelligence. It demonstrates engineering capabilities in handling protected web sources, concurrent network programming, and robust error handling, serving as a proof-of-concept for reliable technical data extraction infrastructure.

3. What This Project Does

The application systematically iterates through article IDs or processes RSS feeds to locate valid content. For each identified article, it downloads the HTML, detects the encoding (handling legacy Windows-1250 often used in Czech web archives), sanitizes the content by stripping non-text elements, and extracts metadata such as title, date, and keywords. The final output is aggregated into a machine-readable JSON format.

4. Architecture

The system follows a concurrent worker pool pattern:

Input Generation: An ID range is generated based on RSS feed analysis or user input.
Worker Pool: A ThreadPoolExecutor spawns worker threads to process IDs in parallel.
Network Layer: Each request is routed through a local Tor SOCKS5 proxy.
Resilience Layer: The system monitors response codes; 429 (Too Many Requests) or 403 (Forbidden) trigger a Tor circuit renewal (New Identity) and exponential backoff.
Storage: Validated data is written to a JSON file using a thread-safe locking mechanism.

5. Tech Stack

Language: Python 3.13
Networking: requests, pysocks (SOCKS proxy support)
Parsing: BeautifulSoup4 (HTML parsing)
Encoding: chardet (Character set detection)
Concurrency: concurrent.futures (Threading)
Proxy: Tor Service (external dependency)

6. Data Sources

Primary: https://www.protext.cz/ (Direct article access via ID)
Discovery: https://www.protext.cz/rss/cz.php (RSS feed for latest ID detection)

7. Key Design Decisions

ID-based Iteration: The target site exposes sequential integer IDs for articles. Iterating through these IDs O(1) proved more reliable and exhaustive than traversing pagination, which can be inconsistent or limited in depth.
Tor & User-Agents: Standard IP rotation was insufficient due to aggressive blocking. Integrating the Tor control port allows the application to programmatically request a new exit node ("New Identity") immediately upon detection of a block, rather than waiting for timeouts.
JSON Storage: JSON was selected for the output format to prioritize portability and ease of inspection over the complexity of setting up a relational database for this specific demonstration scope.

8. Limitations

Tor Dependency: The application requires a running Tor service on localhost:9050 and control port on 9051. It cannot function without this external service.
Stateless Execution: The scraper does not maintain a persistent state file of scraped IDs between runs. Restarting a job requires manually specifying the range or relying on the "check for duplicates" logic which incurs network overhead.
Vertical Scaling: As a single-node application, scraping speed is tied to the local machine's resource limits and the latency of the Tor network.

9. How to Run

Install Tor:
- macOS: brew install tor && brew services start tor
- Linux: sudo apt install tor && sudo systemctl start tor
Install Dependencies:
```
pip install -r requirements.txt
```
Run Application:
```
python3 main.py
```

10. Example Usage

Upon launching, the interactive menu provides several modes. To scrape the latest 100 articles:

🥷 TOR SCRAPING MODE:
1. TEST - range 199900-200000 (quick test)
...
Enter choice: 1

11. Future Improvements

Database Backend: Migrate from JSON to SQLite or PostgreSQL to support resumable scrapes and complex querying.
Dockerization: Containerize the application and the Tor service into a docker-compose setup for one-command deployment.
Distributed Workers: Decouple the scraping logic from the scheduler to allow multiple worker nodes to scrape distinct ranges concurrently.

12. Author

Jan Alexandr Kopřiva jan.alexandr.kopriva@gmail.com

13. License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Overview

2. Motivation

3. What This Project Does

4. Architecture

5. Tech Stack

6. Data Sources

7. Key Design Decisions

8. Limitations

9. How to Run

10. Example Usage

11. Future Improvements

12. Author

13. License

About

Uh oh!

Releases

Packages

Languages

License

koprjaa/protext-scraper

Folders and files

Latest commit

History

Repository files navigation

1. Overview

2. Motivation

3. What This Project Does

4. Architecture

5. Tech Stack

6. Data Sources

7. Key Design Decisions

8. Limitations

9. How to Run

10. Example Usage

11. Future Improvements

12. Author

13. License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages