PDFLens is a powerful document analysis tool that allows you to upload PDF, Word, and Text files, then ask questions about their content with accurate page references.
Click the image above to watch the demo video
- Core: Python, Streamlit
- AI & NLP: LangChain, Cerebras AI, Hugging Face Transformers
- Document Processing: PyPDF2, python-docx, Unstructured
- Vector Database: FAISS
- Embeddings: Sentence Transformers
- 📄 Multi-format Support: Upload PDF, DOCX, or TXT files with ease
- 🔍 Accurate Page References: Get answers with precise page numbers (for PDFs) or sections (for other formats)
- 🤖 AI-Powered Insights: Advanced language understanding for comprehensive answers
- 📊 Document Structure View: Preview document layout and content organization
- ⚡ Fast Processing: Quick document analysis and response generation
- 🎯 Source Citations: See which pages the answers come from
- Python 3.8 or higher
- Cerebras API key (Get one here)
-
Clone or download this repository
-
Create a virtual environment
python -m venv venv
-
Activate the virtual environment
- Windows (PowerShell):
.\venv\Scripts\Activate.ps1
- Windows (Command Prompt):
.\venv\Scripts\activate.bat
- macOS/Linux:
source venv/bin/activate
- Windows (PowerShell):
-
Install dependencies
pip install -r requirements.txt
-
Set up your API key
Create a
.envfile in the project root and add your Cerebras API key:CEREBRAS_API_KEY=your_api_key_here
-
Start the application
streamlit run app.py
Or using the venv Python directly:
.\venv\Scripts\python.exe -m streamlit run app.py
-
Open your browser
The app will automatically open at
http://localhost:8501 -
Upload a document
Click "Browse files" and select a PDF, DOCX, or XLSX file
-
Ask questions
Type your question in the chat input and get AI-powered answers based on your document
chatbot/
├── app.py # Main Streamlit application
├── requirements.txt # Python dependencies
├── .env # Environment variables (API key)
├── README.md # This file
└── venv/ # Virtual environment (created after setup)
- Streamlit: Web interface
- LangChain: LLM orchestration framework
- Cerebras AI: Language model API
- FAISS: Vector database for semantic search
- Sentence Transformers: Text embeddings
- PDFPlumber: PDF text extraction
- python-docx: Word document processing
- Pandas: Excel file handling
The app currently uses qwen-3-235b-a22b-instruct-2507. You can change this in app.py:
llm = ChatCerebras(
model="qwen-3-235b-a22b-instruct-2507", # Change model here
temperature=0,
max_tokens=600,
)Default: sentence-transformers/all-MiniLM-L6-v2
You can change this in the create_vectorstore() function.
Make sure you're using the virtual environment:
.\venv\Scripts\python.exe -m streamlit run app.py- Verify your API key is correct in the
.envfile - Ensure the
.envfile is in the same directory asapp.py - Restart the Streamlit app after changing the
.envfile
Reinstall dependencies in the virtual environment:
.\venv\Scripts\python.exe -m pip install -r requirements.txt- Never commit your
.envfile to version control - Keep your API key confidential
- Add
.envto your.gitignorefile
This project is open source and available for personal and educational use.
Feel free to fork this project and submit pull requests for improvements!
For issues with:
- Cerebras API: Visit Cerebras Documentation
- Streamlit: Check Streamlit Documentation
- LangChain: See LangChain Documentation
