🔍 Smart Docs

A powerful, privacy-first PDF search application that uses AI-powered semantic search to find information in your documents. Built with React and TensorFlow.js, everything runs locally in your browser, your documents never leave your device.

Features

Semantic Search

Unlike traditional keyword search, Smart Docs understands the meaning of your query. Search for "cost" and find results containing "pricing", "expenses", or "budget", even if your exact words aren't in the document.

Hybrid Search

Combines the best of both worlds:

60% Semantic matching : understands context and meaning
40% Keyword matching : boosts exact word matches

This ensures you get relevant results whether you're searching for specific terms or general concepts.

100% Private

All processing happens in your browser
Documents are never uploaded to any server
No data collection, no tracking
Works completely offline after initial load

Fast & Responsive

Real-time search with debounced input
Instant results as you type
Efficient chunking and batch embedding

Built-in PDF Viewer

Side-by-side view with search results
Text highlighting for matched terms
Zoom controls and page navigation
Click any result to jump to that page

Tech Stack

Technology	Purpose
React 18	UI framework with hooks
TensorFlow.js	Machine learning in the browser
Universal Sentence Encoder	Text embeddings for semantic search
PDF.js	PDF parsing and rendering
Vite	Fast development and bundling

Getting Started

Prerequisites

Node.js 18+
npm or yarn

Installation

Clone the repository

git clone https://github.com/NehaChaudhary311/smart-docs.git
cd smart-docs

Install dependencies
```
npm install
```
Start the development server
```
npm run dev
```
Open your browser Navigate to http://localhost:5173

Production Build

npm run build
npm run preview

How It Works

1. Document Processing

When you upload a PDF:

Text Extraction: PDF.js extracts text from each page
Chunking: Text is split into ~400 character chunks (by sentence boundaries)
Embedding: Each chunk is converted to a 512-dimensional vector using Universal Sentence Encoder
Indexing: Embeddings are stored in memory for fast retrieval

2. Search Algorithm

Hybrid Score = (Semantic Score × 0.6) + (Keyword Score × 0.4)

Semantic Score: Cosine similarity between query embedding and document chunk embeddings

Keyword Score: Based on:

Percentage of query words found in the chunk
Bonus for multiple occurrences (capped at 0.3)

3. Result Filtering

Results are filtered to ensure quality:

With keyword match: Minimum 15% semantic similarity
Without keyword match: Minimum 50% semantic similarity (prevents noise)

Usage Tips

Effective Searching

Be specific: "quarterly revenue growth" works better than "money"
Use natural language: "what are the main conclusions?"
Try synonyms: The semantic search understands related concepts

Supported Documents

✅ Text-based PDFs
❌ Scanned documents (image-only PDFs)
❌ Password-protected PDFs

Configuration

Chunk Size

Modify in useSearchEngine.js:

const chunks = splitIntoChunks(doc.text, 400); // characters per chunk

Search Weights

Adjust semantic vs keyword balance:

const hybridScore = (semanticScore * 0.6) + (kwScore * 0.4);

🐛 Troubleshooting

"No text extracted" warning

Your PDF is likely a scanned document (image-based). Smart Docs requires text-based PDFs.

Model loading takes too long

The Universal Sentence Encoder is ~30MB. First load may take 30-60 seconds depending on your connection.

Search returns no results

Try broader search terms
Check if text was extracted (look for console warnings)
Ensure the AI model finished loading

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
FLOWCHART.md		FLOWCHART.md
README.md		README.md
index.html		index.html
package.json		package.json
vite.config.js		vite.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Smart Docs

Features

Semantic Search

Hybrid Search

100% Private

Fast & Responsive

Built-in PDF Viewer

Tech Stack

Getting Started

Prerequisites

Installation

Production Build

How It Works

1. Document Processing

2. Search Algorithm

3. Result Filtering

Usage Tips

Effective Searching

Supported Documents

Configuration

Chunk Size

Search Weights

🐛 Troubleshooting

"No text extracted" warning

Model loading takes too long

Search returns no results

🤝 Contributing

About

Uh oh!

Releases

Packages

Languages

NehaChaudhary311/smart-docs

Folders and files

Latest commit

History

Repository files navigation

🔍 Smart Docs

Features

Semantic Search

Hybrid Search

100% Private

Fast & Responsive

Built-in PDF Viewer

Tech Stack

Getting Started

Prerequisites

Installation

Production Build

How It Works

1. Document Processing

2. Search Algorithm

3. Result Filtering

Usage Tips

Effective Searching

Supported Documents

Configuration

Chunk Size

Search Weights

🐛 Troubleshooting

"No text extracted" warning

Model loading takes too long

Search returns no results

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages