Skip to content

NehaChaudhary311/smart-docs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Smart Docs

A powerful, privacy-first PDF search application that uses AI-powered semantic search to find information in your documents. Built with React and TensorFlow.js, everything runs locally in your browser, your documents never leave your device.

Smart Docs TensorFlow.js Vite

Features

Semantic Search

Unlike traditional keyword search, Smart Docs understands the meaning of your query. Search for "cost" and find results containing "pricing", "expenses", or "budget", even if your exact words aren't in the document.

Hybrid Search

Combines the best of both worlds:

  • 60% Semantic matching : understands context and meaning
  • 40% Keyword matching : boosts exact word matches

This ensures you get relevant results whether you're searching for specific terms or general concepts.

100% Private

  • All processing happens in your browser
  • Documents are never uploaded to any server
  • No data collection, no tracking
  • Works completely offline after initial load

Fast & Responsive

  • Real-time search with debounced input
  • Instant results as you type
  • Efficient chunking and batch embedding

Built-in PDF Viewer

  • Side-by-side view with search results
  • Text highlighting for matched terms
  • Zoom controls and page navigation
  • Click any result to jump to that page

Tech Stack

Technology Purpose
React 18 UI framework with hooks
TensorFlow.js Machine learning in the browser
Universal Sentence Encoder Text embeddings for semantic search
PDF.js PDF parsing and rendering
Vite Fast development and bundling

Getting Started

Prerequisites

  • Node.js 18+
  • npm or yarn

Installation

  1. Clone the repository

    git clone https://github.com/NehaChaudhary311/smart-docs.git
    cd smart-docs
  2. Install dependencies

    npm install
  3. Start the development server

    npm run dev
  4. Open your browser Navigate to http://localhost:5173

Production Build

npm run build
npm run preview

How It Works

1. Document Processing

When you upload a PDF:

  1. Text Extraction: PDF.js extracts text from each page
  2. Chunking: Text is split into ~400 character chunks (by sentence boundaries)
  3. Embedding: Each chunk is converted to a 512-dimensional vector using Universal Sentence Encoder
  4. Indexing: Embeddings are stored in memory for fast retrieval

2. Search Algorithm

Hybrid Score = (Semantic Score × 0.6) + (Keyword Score × 0.4)

Semantic Score: Cosine similarity between query embedding and document chunk embeddings

Keyword Score: Based on:

  • Percentage of query words found in the chunk
  • Bonus for multiple occurrences (capped at 0.3)

3. Result Filtering

Results are filtered to ensure quality:

  • With keyword match: Minimum 15% semantic similarity
  • Without keyword match: Minimum 50% semantic similarity (prevents noise)

Usage Tips

Effective Searching

  • Be specific: "quarterly revenue growth" works better than "money"
  • Use natural language: "what are the main conclusions?"
  • Try synonyms: The semantic search understands related concepts

Supported Documents

  • ✅ Text-based PDFs
  • ❌ Scanned documents (image-only PDFs)
  • ❌ Password-protected PDFs

Configuration

Chunk Size

Modify in useSearchEngine.js:

const chunks = splitIntoChunks(doc.text, 400); // characters per chunk

Search Weights

Adjust semantic vs keyword balance:

const hybridScore = (semanticScore * 0.6) + (kwScore * 0.4);

🐛 Troubleshooting

"No text extracted" warning

Your PDF is likely a scanned document (image-based). Smart Docs requires text-based PDFs.

Model loading takes too long

The Universal Sentence Encoder is ~30MB. First load may take 30-60 seconds depending on your connection.

Search returns no results

  • Try broader search terms
  • Check if text was extracted (look for console warnings)
  • Ensure the AI model finished loading

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

About

pdf search with synonymous semantic search feature

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published