This project provides a comprehensive system for processing ZIM files (compressed Wikipedia/offline content databases) and creating a vector database for Retrieval-Augmented Generation (RAG) with Large Language Models, effectively having an offline knowledge base.
This is an good solution for running local LLMs on underpowered computers while providing access to extensive knowledge bases. By leveraging pre-compressed ZIM files and efficient vector search, you can have a powerful AI assistant with encyclopedic knowledge without requiring high-end hardware or constant internet connectivity.
graph TB
A[ZIM Files<br/>from Kiwix Library] --> B[ZIM Processing<br/>libzim/zimply]
B --> C[Text Extraction<br/>& Chunking]
C --> D[Embedding<br/>Generation<br/>sentence-transformers]
D --> E[Vector Database<br/>ChromaDB/FAISS]
F[User Query] --> G[Semantic Search<br/>Vector Similarity]
G --> H[Context Retrieval<br/>Top-K Results]
H --> I[RAG Pipeline<br/>with Local LLM]
I --> J[Generated Answer<br/>with Sources]
E --> G
- Dynamic ZIM Library: Automatically discovers and processes multiple ZIM files from a dedicated library directory
- Multi-format ZIM Support: Supports libzim, zimply, and CLI-based ZIM file reading
- Flexible Vector Databases: Choose between ChromaDB and FAISS for storage
- Local LLM Support: Docker Model Runner for offline LLM capabilities
- Smart Text Processing: Automatic text chunking and preprocessing with source attribution
- RAG Pipeline: Complete retrieval-augmented generation pipeline with multi-source answers
- CLI Interface: Easy-to-use command-line interface with library management commands
- Export Functionality: Export articles from multiple ZIM files to JSON
- Source Tracking: Maintains source information for all extracted content
./setup.sh
Download ZIM files from Kiwix Library or the Wikimedia Dump and place them in the zim_library
directory:
# Example: Add multiple ZIM files to your library
cp ~/Downloads/*.zim ./zim_library/
# List available ZIM files
python zim_rag.py list-zim
⚠️ Important Note: The first build is necessary only once and can take a very long time depending on the ZIM file size and your machine specifications. Large ZIM files (2GB+) may take several hours on slower machines. Subsequent queries will be much faster as they use the pre-built vector database.
# Build from all ZIM files in library
python zim_rag.py build
# Or build from specific ZIM file
python zim_rag.py build --zim-file "wikipedia_en_medicine_maxi_2023-12.zim"
# Limit articles per ZIM file for faster processing
python zim_rag.py build --limit 1000
This system is configured to automatically use Docker Model Runner with the ai/smollm3:Q4_K_M model for local, offline LLM capabilities.
# Simple semantic search
python zim_rag.py query "What are treatments for PTSD?"
# Full RAG with LLM generation
python zim_rag.py rag-query "Explain the latest developments in military medicine"
# Get system information
python zim_rag.py info
python zim_rag.py build [OPTIONS]
Options:
--zim-file TEXT Specific ZIM file to process (optional, processes all if not specified)
--limit INTEGER Limit number of articles per ZIM file
--force Force rebuild even if vector DB exists
# Semantic search only
python zim_rag.py query [OPTIONS] QUESTION
Options:
--k INTEGER Number of documents to retrieve [default: 5]
# RAG with LLM generation
python zim_rag.py rag-query QUESTION
# List all ZIM files in library
python zim_rag.py list-zim
# Show library and database information
python zim_rag.py info
# Export articles to JSON
python zim_rag.py export [OPTIONS]
Options:
--zim-file TEXT Specific ZIM file to export (optional, exports all if not specified)
--output TEXT Output file [default: zim_articles.json]
--limit INTEGER Limit number of articles per ZIM file
Create a config.json
file to customize settings:
{
"zim_library_path": "./zim_library",
"embedding_model": "all-MiniLM-L6-v2",
"vector_db_type": "chroma",
"chunk_size": 1000,
"chunk_overlap": 200,
"persist_directory": "./vector_db",
"collection_name": "zim_articles",
"llm_provider": "docker_model_runner",
"llm_model": "ai/smollm3:Q4_K_M",
"max_articles_per_zim": null
}
Use the config file:
python zim_rag.py --config config.json build
all-MiniLM-L6-v2
(fast, good quality)all-mpnet-base-v2
(higher quality, slower)paraphrase-multilingual-MiniLM-L12-v2
(multilingual support)
chroma
: ChromaDB (recommended, persistent, metadata-rich)faiss
: FAISS (faster search, less metadata)
This system uses Docker Model Runner for local, offline LLM capabilities with the ai/smollm3:Q4_K_M
model. The LLM provider is configured in config.json
and currently supports only Docker Model Runner.
- Python 3.8+
- For ZIM reading: libzim or zimply (recommended)
- For embeddings: sentence-transformers
- For vector storage: faiss-cpu or chromadb
- For LLMs: appropriate client libraries
- RAM: 4GB minimum, 8GB+ recommended
- Storage: 2-3x the size of your ZIM file for the vector database
- GPU: Optional, but recommended for faster embedding generation
For large ZIM files:
- Process in batches using
--limit
- Increase system RAM
- Use smaller chunk sizes
- Consider using FAISS instead of ChromaDB
- Choose appropriate embedding model size
- Adjust chunk size based on content type
- Use FAISS for faster search on large datasets
- Download from Kiwix Library: Visit https://library.kiwix.org/ or the Wikimedia Dump
- Place in library directory: Copy ZIM files to
./zim_library/
- Scan library: Run
python3 zim_rag.py list-zim
to see available files - Build database: Run
python3 zim_rag.py build
to process all files
# Query across multiple ZIM files
python3 zim_rag.py rag-query "What are the current treatments for PTSD?"
# Get information about programming concepts
python3 zim_rag.py rag-query "Explain machine learning algorithms"
# Research medical and technical topics
python3 zim_rag.py rag-query "How do neural networks relate to brain function?"
# List all available ZIM files
python3 zim_rag.py list-zim
# Process only specific ZIM file
python3 zim_rag.py build --zim-file "wikipedia_en_medicine_maxi_2023-12.zim"
# Get detailed library information
python3 zim_rag.py info
# Export articles from all ZIM files
python3 zim_rag.py export --output all_articles.json
# Export from specific ZIM file
python3 zim_rag.py export --zim-file "fas-military-medicine_en_2025-06.zim" --output military_medicine.json
# Export sample articles
python3 zim_rag.py export --limit 100 --output sample_articles.json
# Build with different embedding model
python3 zim_rag.py build --embedding-model all-mpnet-base-v2
# Limit articles per ZIM file for faster processing
python3 zim_rag.py build --limit 500
# Force rebuild existing database
python3 zim_rag.py build --force
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
For development setup, follow the Quick Start guide above.
This project is open source. Please check the license file for details.
For issues and questions:
- Check the troubleshooting section
- Review the logs in
zim_rag.log
- Open an issue with relevant details