A specialized web crawler designed for collecting structured football data from sportnet.sme.sk futbalnet pages. Built with Crawl4AI, this crawler automatically extracts relevant content while excluding non-essential sections like "Správy z Futbalnetu" and "Inzercia". The extracted content is converted to markdown format and includes quality scoring for AI training purposes.
- Asynchronous web crawling
- Automatic HTML to Markdown conversion
- Content quality scoring
- Intelligent content filtering
- Targeted extraction of football data
- Exclusion of "Správy z Futbalnetu" and "Inzercia" sections
- Focus on program and match information
- JSON output format
- Rate limiting support
- Create and activate virtual environment:
# Windows
python -m venv venv
venv\Scripts\activate
# Linux/Mac
python3 -m venv venv
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Install Playwright browsers:
playwright install- Update the URLs in
crawler.pywith sportnet.sme.sk futbalnet URLs:
urls = [
"https://sportnet.sme.sk/futbalnet/k/fk-fc-raznany/tim/dospeli-m-a/program/",
# Add more futbalnet URLs as needed
]- Run the crawler:
python crawler.pyThe crawler will:
- Extract content from the specified futbalnet pages
- Remove "Správy z Futbalnetu" and "Inzercia" sections
- Process the entire page content
- Save the results in the
ai_training_datadirectory in JSON format
You can configure the crawler by modifying these parameters in crawler.py:
threshold: Content quality threshold (0.0 to 1.0)min_word_threshold: Minimum word count for content blocks (default: 50)headless: Browser visibility (True/False)delay: Delay between requests (seconds)excluded_tags: HTML tags to exclude (e.g., 'form', 'header', 'footer', 'nav')
The crawler uses a custom content filter that:
- Identifies and removes the "Správy z Futbalnetu" section
- Processes the entire page content
- Applies the standard PruningContentFilter to improve content quality
The crawler saves data in JSON format with the following structure:
{
"url": "crawled_url",
"timestamp": "ISO-8601 timestamp",
"content": {
"raw_markdown": "original markdown",
"filtered_markdown": "filtered content"
},
"metadata": {
"length": "content length",
"quality_score": "calculated quality score"
}
}This project includes a GitHub Actions workflow that automatically runs the crawler on a weekly schedule and sends the data to an n8n webhook.
- In your GitHub repository, go to Settings > Secrets and Variables > Actions
- Add a new repository secret named
N8N_WEBHOOK_URLwith your n8n webhook URL - The workflow will run automatically every Sunday at 23:59 UTC
The GitHub Actions workflow:
- Runs on a weekly schedule (Sunday at 23:59 UTC)
- Installs all required dependencies
- Runs the crawler script
- Sends the collected data to your n8n webhook
- Uploads the crawled data as a GitHub Actions artifact
Note: The workflow uses the latest versions of GitHub Actions (checkout@v4, setup-python@v5, upload-artifact@v4)
You can also manually trigger the workflow from the Actions tab in your GitHub repository.
The data sent to the n8n webhook has the following structure:
{
"data": [
{
"url": "crawled_url_1",
"timestamp": "ISO-8601 timestamp",
"content": {
"raw_markdown": "original markdown",
"filtered_markdown": "filtered content"
},
"metadata": {
"length": "content length",
"quality_score": "calculated quality score"
}
},
// Additional crawled pages...
],
"metadata": {
"total_files": 2,
"total_items": 2
}
}MIT