A powerful, privacy-first PDF search application that uses AI-powered semantic search to find information in your documents. Built with React and TensorFlow.js, everything runs locally in your browser, your documents never leave your device.
Unlike traditional keyword search, Smart Docs understands the meaning of your query. Search for "cost" and find results containing "pricing", "expenses", or "budget", even if your exact words aren't in the document.
Combines the best of both worlds:
- 60% Semantic matching : understands context and meaning
- 40% Keyword matching : boosts exact word matches
This ensures you get relevant results whether you're searching for specific terms or general concepts.
- All processing happens in your browser
- Documents are never uploaded to any server
- No data collection, no tracking
- Works completely offline after initial load
- Real-time search with debounced input
- Instant results as you type
- Efficient chunking and batch embedding
- Side-by-side view with search results
- Text highlighting for matched terms
- Zoom controls and page navigation
- Click any result to jump to that page
| Technology | Purpose |
|---|---|
| React 18 | UI framework with hooks |
| TensorFlow.js | Machine learning in the browser |
| Universal Sentence Encoder | Text embeddings for semantic search |
| PDF.js | PDF parsing and rendering |
| Vite | Fast development and bundling |
- Node.js 18+
- npm or yarn
-
Clone the repository
git clone https://github.com/NehaChaudhary311/smart-docs.git cd smart-docs -
Install dependencies
npm install
-
Start the development server
npm run dev
-
Open your browser Navigate to
http://localhost:5173
npm run build
npm run previewWhen you upload a PDF:
- Text Extraction: PDF.js extracts text from each page
- Chunking: Text is split into ~400 character chunks (by sentence boundaries)
- Embedding: Each chunk is converted to a 512-dimensional vector using Universal Sentence Encoder
- Indexing: Embeddings are stored in memory for fast retrieval
Hybrid Score = (Semantic Score × 0.6) + (Keyword Score × 0.4)
Semantic Score: Cosine similarity between query embedding and document chunk embeddings
Keyword Score: Based on:
- Percentage of query words found in the chunk
- Bonus for multiple occurrences (capped at 0.3)
Results are filtered to ensure quality:
- With keyword match: Minimum 15% semantic similarity
- Without keyword match: Minimum 50% semantic similarity (prevents noise)
- Be specific: "quarterly revenue growth" works better than "money"
- Use natural language: "what are the main conclusions?"
- Try synonyms: The semantic search understands related concepts
- ✅ Text-based PDFs
- ❌ Scanned documents (image-only PDFs)
- ❌ Password-protected PDFs
Modify in useSearchEngine.js:
const chunks = splitIntoChunks(doc.text, 400); // characters per chunkAdjust semantic vs keyword balance:
const hybridScore = (semanticScore * 0.6) + (kwScore * 0.4);Your PDF is likely a scanned document (image-based). Smart Docs requires text-based PDFs.
The Universal Sentence Encoder is ~30MB. First load may take 30-60 seconds depending on your connection.
- Try broader search terms
- Check if text was extracted (look for console warnings)
- Ensure the AI model finished loading
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request