LLM-Extractify

LLM-Extractify is an end-to-end data ingestion and extraction pipeline that leverages large language models (LLMs) and vector search to transform unstructured web content into structured, queryable knowledge. With support for multiple LLM providers and Firecrawl integration, this project simplifies the process of scraping, chunking, embedding, and indexing data.

🚀 Features

Multi-Provider LLM Support: OpenAI, Gemma, Mistral
Web Scraping: Integrated with Firecrawl for dynamic and semantic extraction
Vector Storage: Zilliz Milvus for efficient similarity search
Configurable Pipeline: YAML-based prompt templates, environmental config
Streamlit UI: Quick start interface for URL intake and retrieval testing

📦 Installation

Install Poetry (if not already installed):
```
pip install poetry
```

Clone the repo and install dependencies:

git clone https://github.com/your-org/llm-extractify.git
cd llm-extractify
poetry install

Activate the virtual environment:

source .venv/bin/activate    # macOS/Linux
# or
.\.venv\\Scripts\\activate  # Windows PowerShell

🔑 Configuration

Create a .env file in the project root (use .env.example as a template) and populate the following keys:

OPENAI_API_KEY=
GEMMA_API_KEY=
MISTRAL_API_KEY=
FIRECRAWL_API_KEY=
ZILLIZ_AUTH_TOKEN=
ZILLIZ_CLOUD_URI=

Note (Windows): If you encounter execution policy issues, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process

⚙️ Usage

1. Streamlit UI

Launch the interactive frontend:

poetry run streamlit run frontend/streamlit_ui.py

2. CLI Scripts

Onboard URLs/files (end-to-end processing):
```
poetry run python scripts/onboard.py
```

🧪 Testing

Run unit and integration tests under the tests/ folder:

# Single test
poetry run python tests/test_onboard.py

Model Evaluations (on feat/model-evals branch):

poetry run python tests/test_gpt_models.py

🌐 Sample URLs

Use these for quick testing or demos:

🛠️ Troubleshooting

Missing API keys? Ensure all keys are set in .env.
Zilliz cluster access: Confirm ZILLIZ_CLOUD_URI and ZILLIZ_AUTH_TOKEN match your cloud cluster configuration.
Windows venv issues: Use the PowerShell activation command above.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
collection_creator		collection_creator
config		config
data		data
frontend		frontend
onboard_workflow		onboard_workflow
tests		tests
utils		utils
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Extractify

🚀 Features

📦 Installation

🔑 Configuration

⚙️ Usage

1. Streamlit UI

2. CLI Scripts

🧪 Testing

🌐 Sample URLs

🛠️ Troubleshooting

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

EshanW313/LLM-Extractify

Folders and files

Latest commit

History

Repository files navigation

LLM-Extractify

🚀 Features

📦 Installation

🔑 Configuration

⚙️ Usage

1. Streamlit UI

2. CLI Scripts

🧪 Testing

🌐 Sample URLs

🛠️ Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages