LLM-Extractify is an end-to-end data ingestion and extraction pipeline that leverages large language models (LLMs) and vector search to transform unstructured web content into structured, queryable knowledge. With support for multiple LLM providers and Firecrawl integration, this project simplifies the process of scraping, chunking, embedding, and indexing data.
- Multi-Provider LLM Support: OpenAI, Gemma, Mistral
- Web Scraping: Integrated with Firecrawl for dynamic and semantic extraction
- Vector Storage: Zilliz Milvus for efficient similarity search
- Configurable Pipeline: YAML-based prompt templates, environmental config
- Streamlit UI: Quick start interface for URL intake and retrieval testing
- Install Poetry (if not already installed):
pip install poetry
- Clone the repo and install dependencies:
git clone https://github.com/your-org/llm-extractify.git cd llm-extractify poetry install - Activate the virtual environment:
source .venv/bin/activate # macOS/Linux # or .\.venv\\Scripts\\activate # Windows PowerShell
Create a .env file in the project root (use .env.example as a template) and populate the following keys:
OPENAI_API_KEY=
GEMMA_API_KEY=
MISTRAL_API_KEY=
FIRECRAWL_API_KEY=
ZILLIZ_AUTH_TOKEN=
ZILLIZ_CLOUD_URI=Note (Windows): If you encounter execution policy issues, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process
Launch the interactive frontend:
poetry run streamlit run frontend/streamlit_ui.py- Onboard URLs/files (end-to-end processing):
poetry run python scripts/onboard.py
Run unit and integration tests under the tests/ folder:
# Single test
poetry run python tests/test_onboard.py
Model Evaluations (on feat/model-evals branch):
poetry run python tests/test_gpt_models.pyUse these for quick testing or demos:
- https://foundation.wikimedia.org/wiki/Policy:Privacy_policy
- https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/
- https://www.sas.com/en/events/sas-innovate/faq.html
- https://aiconference.com/faq/
- Missing API keys? Ensure all keys are set in
.env. - Zilliz cluster access: Confirm
ZILLIZ_CLOUD_URIandZILLIZ_AUTH_TOKENmatch your cloud cluster configuration. - Windows venv issues: Use the PowerShell activation command above.