An intelligent web crawling and content extraction tool that transforms web content into structured JSON data suitable for fine-tuning Large Language Models (LLMs). This project combines web scraping capabilities with Google's Gemini AI to automatically generate question-answer pairs from crawled content.
- Intelligent Web Crawling: Automatically discovers and crawls web pages from a starting URL
- Content Extraction: Uses advanced content extraction techniques to get clean, readable text
- AI-Powered Q&A Generation: Leverages Google Gemini AI to create contextual question-answer pairs
- Modern GUI Interface: User-friendly tkinter-based graphical interface
- CLI Support: Command-line interface for automation and scripting
- Configurable Settings: Customizable crawling parameters, delays, and AI model selection
- JSON Output: Generates structured JSON data perfect for LLM fine-tuning
- Robots.txt Compliance: Respects website crawling policies
- Real-time Progress Tracking: Monitor crawling progress with detailed logs
- Python 3.7 or higher
- Google Gemini API key
- Internet connection for web crawling and AI processing
-
Clone the repository:
git clone https://github.com/razee4315/AI-agent.git cd AI-agent
-
Install dependencies:
pip install -r requirements.txt
-
Set up your Gemini API Key:
- Get your API key from Google AI Studio
- You can either:
- Set it as an environment variable:
export GENAI_API_KEY="your_api_key_here"
- Enter it directly in the GUI
- Pass it as a command-line argument
- Set it as an environment variable:
Run the modern graphical interface:
python gui_agent.py
GUI Features:
- Step 1: Enter and verify your Gemini API key
- Step 2: Configure crawling settings (start URL, number of pages)
- Advanced Settings: Customize AI model, request delays, and character limits
- Real-time Progress: Monitor crawling and processing in the log area
- Save Results: Export generated data to JSON files
For automation and scripting:
python web_to_json_agent.py "https://example.com" -n 5 -o output.json --api_key "your_api_key"
CLI Parameters:
start_url
: The starting URL to crawl (required)-n, --num_pages
: Maximum number of pages to crawl (default: 10)-o, --output_file
: Output JSON file path (default: fine_tuning_data.json)--api_key
: Gemini API key (optional if set as environment variable)
The tool generates JSON data in the following format:
[
{
"context": "Extracted content from the webpage...",
"question": "What is the main topic discussed in this content?",
"answer": "The main topic is about..."
},
{
"context": "Another piece of extracted content...",
"question": "How does this relate to the previous topic?",
"answer": "This relates to the previous topic by..."
}
]
- Gemini Model: Choose from available models (gemini-1.5-flash-latest, gemini-1.5-pro-latest, etc.)
- Request Delay: Configure delay between requests (0.5-60 seconds)
- Max Characters: Set character limit for AI processing (1000-100000)
- Number of Pages: Control crawling depth (1-1000 pages)
The tool uses respectful crawling practices with appropriate user agents and follows robots.txt guidelines.
This project utilizes Google's Gemini AI for intelligent content processing:
- Context Understanding: AI analyzes extracted content to understand main themes
- Question Generation: Creates relevant, contextual questions about the content
- Answer Synthesis: Provides concise, accurate answers based on the source material
- JSON Formatting: Ensures output is properly structured for LLM training
- API Key Errors: Ensure your Gemini API key is valid and has appropriate permissions
- Network Issues: Check internet connection and firewall settings
- Content Extraction Failures: Some websites may block crawling or have complex structures
- Memory Issues: For large crawls, consider processing in smaller batches
- Check the progress log for detailed error messages
- Verify the starting URL is accessible
- Ensure sufficient disk space for output files
- Monitor rate limits with request delays
- LLM Fine-tuning: Generate training data for language models
- Content Analysis: Extract and analyze website content at scale
- Knowledge Base Creation: Build Q&A datasets from web sources
- Research Automation: Automate content gathering for research projects
- Educational Resources: Create study materials from educational websites
Developer: Saqlain Abbas
- GitHub: @razee4315
- Email: saqlainrazee@gmail.com
Developer: AleenaTahir1
- GitHub: @AleenaTahir1
- Email: aleenatahirf23@nutech.edu.pk
This project is licensed under the MIT License - see the LICENSE file for details.
- Google for providing the Gemini AI API
- The open-source community for the excellent libraries used in this project
Happy Crawling! 🕷️🤖