This project implements a multi-stage automated pipeline designed to generate high-quality Question & Answer (Q&A) datasets for a given technical topic. It leverages Large Language Models (LLMs) for various tasks including query generation, content filtering, relevance scoring, content extraction, and Q&A pair creation. The pipeline is configurable, supporting different LLM providers (Groq, Mistral, Ollama), search engines, and processing parameters.
The core objective of this pipeline is to automate the laborious process of finding relevant information on a specific topic, extracting the core explanatory content, and then transforming that content into a structured Q&A dataset suitable for training smaller LLMs or for knowledge base construction.
- Multi-LLM Provider Support: Easily switch between Groq, Mistral, and local Ollama instances.
- Configurable LLM Models: Specify different LLM models for various tasks (query generation, scoring, extraction, etc.).
- Multi-Search Engine Support: Utilizes Google or DuckDuckGo for initial information retrieval.
- Advanced Web Scraping: Employs
crawl4ai
,Trafilatura
, andBeautifulSoup
for robust content extraction from web pages and PDFs. - Multi-Stage Filtering:
- LLM-based filtering of initial search results.
- Relevance scoring of downloaded documents using LLM and sentence splitting.
- Detailed focus scoring of relevant documents using LLM and semantic splitting.
- Semantic Chunking: Uses
llama-index
(SentenceSplitter and SemanticSplitter) for intelligent document segmentation. - LLM-Powered Content Extraction: Isolates core explanatory content from noisy text chunks.
- Automated Q&A Generation: Creates question-answer pairs from the extracted content using an LLM.
- Dataset Refinement: An optional LLM-powered step to review and improve the generated Q&A pairs.
- Flexible Configuration: Most parameters are configurable via a
.env
file andconfig.py
. - Comprehensive Logging: Detailed logging for monitoring and debugging.
- Structured Output: Organizes outputs from each stage into dedicated directories.
The pipeline is divided into several distinct stages, each performing a specific part of the data processing workflow:
- Purpose: To generate a diverse set of search queries relevant to the input
TOPIC_FOR_PROCESSING
and select the most promising ones for web searching. - Process:
- An LLM (configured by
PRIMARY_LLM_PROVIDER
andLLM_MODELS[\"query_generation\"]
) generates an initial list of search queries based on the topic. The prompt for this isprompts.get_query_generation_prompt
. - Another LLM call (model:
LLM_MODELS[\"query_evaluation\"]
) evaluates these generated queries and selects a diverse subset (NUM_QUERIES_TO_SELECT_AFTER_GENERATION
). The prompt for this isprompts.get_query_evaluation_prompt
.
- An LLM (configured by
- Input:
TOPIC_FOR_PROCESSING
(fromconfig.py
). - Output: A list of selected query strings.
- Modules:
stages/query_generation.py
,core/prompts.py
,processing/llm_interactor.py
.
- Purpose: To use the selected queries to find relevant web pages/documents, filter these results using an LLM, and download/scrape their content.
- Process:
- For each selected query:
- Perform a web search using the configured
SEARCH_ENGINE_TYPE
(Google or DuckDuckGo) to getSEARCH_MAX_RESULTS_PER_QUERY
results. - An LLM (model:
LLM_MODELS[\"results_filtering\"]
) filters these search results to identify the most relevant URLs. The prompt isprompts.get_search_results_filtering_prompt
.
- Perform a web search using the configured
- All unique, filtered URLs are collected.
- The content from these URLs is downloaded (for PDFs) or scraped (for HTML pages using
DEFAULT_TEXT_EXTRACTION_STRATEGY
liketrafilatura
orbeautifulsoup
).
- For each selected query:
- Input: List of selected queries from Stage 1.
- Output: Downloaded PDF files and scraped text files stored in the
pipeline_output/01_initial_downloads/
directory. A dictionary mapping processed URLs to their local file paths. - Modules:
stages/search_and_download.py
,processing/search_engine.py
,processing/web_scraper.py
,processing/llm_interactor.py
,core/utils.py
.
- Purpose: To perform an initial pass on all downloaded/scraped files, scoring them for overall relevance to the topic.
- Process:
- For each file in
01_initial_downloads/
:- The content is read (PDFs are parsed using
unstructured
). - The text is split into chunks using
llama-index.SentenceSplitter
(configured bySTAGE3_CHUNK_SIZE
,STAGE3_CHUNK_OVERLAP
). - An LLM (model:
LLM_MODELS[\"relevance_scoring\"]
) assigns a relevance score (0-100) to each chunk based on theprompts.get_chunk_relevance_scoring_prompt
. - The average relevance score for the file is calculated.
- The content is read (PDFs are parsed using
- Files with an average score above
STAGE3_RELEVANCE_THRESHOLD
are copied topipeline_output/02_relevant_files/
.
- For each file in
- Input: Files from
pipeline_output/01_initial_downloads/
. - Output: A list of paths to relevant files copied to
pipeline_output/02_relevant_files/
. - Modules:
stages/initial_filtering.py
,processing/document_parser.py
,processing/llm_interactor.py
.
- Purpose: To further refine the set of relevant files by assessing how focused their content is on explaining the topic, rather than just being generally related.
- Process:
- For each file in
02_relevant_files/
:- The content is split into chunks using
llama-index.SemanticSplitterNodeParser
. This uses an embedding model (EMBEDDING_MODEL_PROVIDER
likeollama
orsentence_transformers
) to create more semantically coherent chunks. - An LLM (model:
LLM_MODELS[\"focus_scoring\"]
) assigns a focus score (0-100) to each semantic chunk based onprompts.get_chunk_focus_scoring_prompt
. - The average focus score for the file is calculated.
- The content is split into chunks using
- Files with an average score above
STAGE4_FOCUS_THRESHOLD
are copied topipeline_output/03_focused_files/
.
- For each file in
- Input: Files from
pipeline_output/02_relevant_files/
. - Output: A list of paths to focused files copied to
pipeline_output/03_focused_files/
. - Modules:
stages/detailed_focus_filtering.py
,processing/document_parser.py
,processing/llm_interactor.py
.
- Purpose: To extract only the core, explanatory text related to the topic from the focused files, discarding irrelevant sections.
- Process:
- For each file in
03_focused_files/
:- The content is split into semantic chunks (as in Stage 4).
- An LLM (model:
LLM_MODELS[\"content_extraction\"]
) processes each chunk, guided byprompts.get_clean_content_extraction_prompt
, to extract only the relevant explanatory paragraphs.
- All extracted text segments from all files are compiled into a single text file, with separators indicating original file sources. Text is wrapped for readability (
STAGE5_TEXT_WRAP_WIDTH
).
- For each file in
- Input: Files from
pipeline_output/03_focused_files/
. - Output: A single compiled text file (
compiled_dataset_for_qa.txt
) inpipeline_output/04_extracted_content/
. - Modules:
stages/content_extraction.py
,processing/document_parser.py
,processing/llm_interactor.py
.
- Purpose: To generate question-answer pairs from the final compiled text.
- Process:
- The compiled text file from Stage 5 is read.
- The content is split into semantic chunks.
- For each chunk:
- An LLM (model:
LLM_MODELS[\"qa_generation\"]
) generates a specified number of questions (STAGE6_NUM_QUESTIONS_PER_CHUNK
) based only on the content of that chunk. The prompt isprompts.get_questions_generate_prompt
. - The same LLM then answers each generated question based strictly on the same chunk's content. The prompt is
prompts.get_answered_questions_prompt
.
- An LLM (model:
- All generated Q&A pairs are collected and formatted into a dataset (e.g., default JSON list of input/output pairs, or Gemma/Llama specific formats based on
DEFAULT_DATASET_TEMPLATE
).
- Input: The compiled text file from
pipeline_output/04_extracted_content/
. - Output: A Q&A dataset file (e.g.,
qna_dataset_YYYYMMDD_HHMMSS.json
) inpipeline_output/05_final_datasets/
. - Modules:
stages/question_generation_and_answers.py
,processing/document_parser.py
,processing/llm_interactor.py
,core/utils.py
(for templating).
- Purpose: To use an LLM to review and potentially improve the generated Q&A dataset for accuracy, consistency, clarity, and completeness.
- Process:
- The Q&A dataset file from Stage 6 is read.
- The entire dataset (as a JSON string) is provided to an LLM (model:
LLM_MODELS[\"dataset_refinement\"]
) along withprompts.get_check_dataset_prompt
. This prompt instructs the LLM to:- Verify factual accuracy.
- Resolve inconsistencies.
- Attempt to answer questions previously marked as "information not available" using general knowledge of the topic (if appropriate).
- Enhance clarity and completeness of answers.
- Improve natural language flow.
- The LLM returns a revised JSON dataset.
- Input: The Q&A dataset file from
pipeline_output/05_final_datasets/
. - Output: A refined Q&A dataset file (e.g.,
qna_dataset_refined_YYYYMMDD_HHMMSS.json
) inpipeline_output/refined_dataset/
. - Modules:
stages/dataset_refinement.py
,processing/llm_interactor.py
.
- Python 3.9+
- Access to LLM APIs:
- Groq API Key (if using Groq)
- Mistral API Key (if using Mistral)
- A running Ollama instance with desired models pulled (if using Ollama, e.g.,
ollama pull llama3
,ollama pull nomic-embed-text
)
git
for cloning the repository.- (Recommended) Tesseract OCR and Poppler-utils for robust PDF processing with
unstructured
:- Ubuntu/Debian:
sudo apt-get install tesseract-ocr poppler-utils
- MacOS:
brew install tesseract poppler
- Windows: Requires manual installation. Add Tesseract to PATH.
- Ubuntu/Debian:
-
Clone the repository:
git clone https://github.com/gokhaneraslan/pipeline.git cd pipeline
-
Create and activate a virtual environment (recommended):
python -m venv venv # On Windows venv\\Scripts\\activate # On macOS/Linux source venv/bin/activate
-
Install Python dependencies:
pip install -r requirements.txt
-
Install Playwright browsers (for
crawl4ai
if its default browser-based strategies are used, though this project primarily uses LXML/Trafilatura):playwright install
(Note: The current
web_scraper.py
usesLXMLWebScrapingStrategy
forcrawl4ai
which may not require this, but it's good practice for fullcrawl4ai
functionality.)
The pipeline is configured primarily through environment variables, which can be set in a .env
file in the project root directory. A config.py
file loads these variables and provides defaults.
-
Create a
.env
file in the root of the project:cp .env.example .env # If an example file is provided, otherwise create manually
-
Edit
.env
andconfig.py
with your settings:Key Environment Variables for
.env
:# --- General --- LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL LOG_TO_FILE=True LOG_FILE_NAME=data_pipeline.log BASE_OUTPUT_DIR=./pipeline_output # --- LLM Provider --- PRIMARY_LLM_PROVIDER=groq # Options: \"groq\", \"mistral\", \"ollama\" # API Keys (only fill for the provider you use) GROQ_API_KEY=your_groq_api_key MISTRAL_API_KEY=your_mistral_api_key # Ollama Settings (if PRIMARY_LLM_PROVIDER=\"ollama\") OLLAMA_BASE_URL=http://localhost:11434 OLLAMA_CLIENT_TIMEOUT=120 # --- Embedding Model (for Semantic Splitting) --- EMBEDDING_MODEL_PROVIDER=sentence_transformers # \"ollama\" or \"sentence_transformers\" OLLAMA_EMBEDDING_MODEL=nomic-embed-text # If EMBEDDING_MODEL_PROVIDER=\"ollama\" SENTENCE_TRANSFORMERS_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 # If EMBEDDING_MODEL_PROVIDER=\"sentence_transformers\" # --- Search Engine --- SEARCH_ENGINE_TYPE=google # \"google\" or \"duckduckgo\" SEARCH_MAX_RESULTS_PER_QUERY=5 SEARCH_LANGUAGE=en # --- Topic --- TOPIC_FOR_PROCESSING=\"Your chosen technical topic here\" # e.g., \"Quantum Entanglement Explained\" # --- Other configurations can be adjusted directly in config.py if needed --- # LLM_MODELS (per provider, per task) # LLM_TEMPERATURE, LLM_MAX_RETRIES, LLM_RETRY_DELAY_SECONDS # STAGE3_CHUNK_SIZE, STAGE3_CHUNK_OVERLAP, STAGE3_RELEVANCE_THRESHOLD # SEMANTIC_SPLITTER_BUFFER_SIZE, SEMANTIC_SPLITTER_BREAKPOINT_PERCENTILE # STAGE4_FOCUS_THRESHOLD # STAGE5_TEXT_WRAP_WIDTH # STAGE6_NUM_QUESTIONS_PER_CHUNK # DEFAULT_DATASET_TEMPLATE
config.py
adjustments:- You can customize
LLM_MODELS
inconfig.py
to specify different models for each LLM provider and for each task (e.g.,query_generation
,relevance_scoring
). - Other numerical thresholds and parameters for pipeline stages are also defined in
config.py
.
- You can customize
After installation and configuration:
- Ensure your LLM provider is accessible (e.g., Ollama server is running if selected).
- Run the main pipeline script:
python main.py
- Monitor the console output and the log file (
pipeline_output/data_pipeline.log
by default) for progress and any errors. - Outputs from each stage will be saved in subdirectories within
pipeline_output/
(or your configuredBASE_OUTPUT_DIR
). The final refined dataset will typically be inpipeline_output/refined_dataset/
.
.
├── core/ # Core utilities and prompt definitions
│ ├── llm_client.py # LLM client initialization
│ ├── prompts.py # LLM prompt templates
│ └── utils.py # General utility functions
├── processing/ # Modules for specific data processing tasks
│ ├── document_parser.py # PDF/text parsing, document splitting
│ ├── llm_interactor.py # Logic for LLM interactions for various tasks
│ ├── search_engine.py # Web search functionalities
│ └── web_scraper.py # Web scraping and PDF downloading
├── stages/ # Modules defining each pipeline stage
│ ├── query_generation.py
│ ├── search_and_download.py
│ ├── initial_filtering.py
│ ├── detailed_focus_filtering.py
│ ├── content_extraction.py
│ ├── question_generation_and_answers.py
│ └── dataset_refinement.py
├── pipeline_output/ # Default directory for all generated outputs and logs
│ ├── 01_initial_downloads/ # Raw downloaded/scraped files
│ ├── 02_relevant_files/ # Files passing initial relevance filter
│ ├── 03_focused_files/ # Files passing detailed focus filter
│ ├── 04_extracted_content/ # Compiled extracted text
│ ├── 05_final_datasets/ # Generated Q&A datasets (before refinement)
│ ├── refined_dataset/ # Refined Q&A datasets
│ └── data_pipeline.log # Log file
├── config.py # Main configuration loading settings and defaults
├── logger_setup.py # Logging configuration
├── main.py # Main script to run the pipeline
├── requirements.txt # Python dependencies
├── .env.example # Example environment file (optional)
└── README.md # This file
- LLM Interaction:
groq
: For Groq API access.mistralai
: For Mistral API access.ollama
: For local LLM interaction via Ollama.
- Web Interaction & Scraping:
googlesearch-python
: For Google search.httpx
: For robust HTTP requests.BeautifulSoup4
: For HTML parsing (DDG search, basic scraping).trafilatura
: For extracting main content from web pages.crawl4ai
: Advanced asynchronous web crawling framework.
- Document Processing & Chunking:
unstructured
: For partitioning PDFs and extracting text from various elements.PyMuPDF
(fitz): Often a dependency for PDF processing.llama-index-core
: Core components for document representation and node parsing.llama-index-embeddings-ollama
: Ollama embeddings for LlamaIndex.llama-index-embeddings-huggingface
: HuggingFace Sentence Transformers embeddings.sentence-transformers
: For local text embeddings.
- Utilities:
python-dotenv
: For loading environment variables.pandas
: For saving datasets to CSV (for default template).nest_asyncio
: To manage asyncio event loops, especially whencrawl4ai
is used.
Contributions are welcome! Please feel free to submit a pull request or open an issue for bugs, feature requests, or improvements.
- Fork the repository.
- Create your feature branch (
git checkout -b feature/AmazingFeature
). - Commit your changes (
git commit -m 'Add some AmazingFeature'
). - Push to the branch (
git push origin feature/AmazingFeature
). - Open a Pull Request.