🚀 Update (This project is hosted)

Hosted Backend: https://pdf-search.niteshjangid.tech/

☁️ AWS / Cloud Services for Deployment

For deploying this PDF search backend, a robust and scalable set of AWS services would be ideal:

Backend API:
- AWS EC2 Machine: For hosting the backend API.
Database (Postgres):
- AWS RDS (Relational Database Service): Managed PostgreSQL database.
Search (Elasticsearch):
- AWS OpenSearch Service: Managed OpenSearch (Elasticsearch) service.
Asynchronous Processing:
- AWS SQS (Simple Queue Service): Api will use SQS to queue PDF processing tasks, worker will pick up tasks from the queue.
- AWS Lambda: Serverless functions to process PDFs. This will download the pdf from S3, perform extraction, embedding and indexing in OpenSearch.
- AWS S3: Storage for uploaded PDFs and processed results.

🚀 Workflow: PDF Upload & Search

1. Get Access 🔑: First, you register and login to get a secure access token (JWT). You'll need this token for most actions.

2. Upload PDF ⬆️: You upload a PDF file to the app. The backend then:

Extracts everything from the PDF: text, tables (formatted into searchable strings), and even text from images (using OCR).
Creates embeddings for all this content.
Stores the content in Elasticsearch for fast searching and saves the PDF's details in PostgreSQL.

3. Search Your PDFs 🔍: You can now search by typing a query. The app uses your query's embeddings to find the most relevant pieces of content across all your uploaded PDFs.

4. Manage Your Files 📂: You can also easily view a list of all the PDFs you've uploaded.

🚀 Technologies Used

Backend Framework: FastAPI (Python)
Database: PostgreSQL (with SQLAlchemy ORM)
Search Engine: Elasticsearch
Vector Embeddings: sentence-transformers
PDF Parsing: PyMuPDF
OCR: PyTesseract (requires Tesseract-OCR engine)
Containerization: Docker

🚀 Project Setup Locally

1. Using Docker (Recommended)

Clone the repository using Git

git clone https://github.com/niteshjangid29/pdf-search.git
cd pdf-search

Build the Docker image of the project (in root of the project)
```
docker build -t pdf-search .
```
Run the Docker container
```
docker run -p 8000:8000 pdf-search
```

2. Setup without Docker

Clone the repository

git clone https://github.com/niteshjangid29/pdf-search.git
cd pdf-search

Create the virtual environment

python3 -m venv venv
source venv/bin/activate

Install Dependencies
```
pip install -r requirements.txt
```
Run the Application
```
uvicorn main:app --reload --port 8000
```

Note: This may additionally require to install PyTesseract for image-ocr

Lets go!

The project is now set up and running! You can start using the PDF search API at http://localhost:8000

http://localhost:8000

API Endpoints

This document provides a detailed overview of the API endpoints for the PDF Semantic Search application. All API endpoints are available under the main application URL.

Authentication Endpoints

This section covers endpoints related to user registration, login, and profile management.

1. Register a New User

Endpoint: POST /register
Description: Creates a new user account.
Authorization: None required.

Request Body:

{
  "fullname": "Nitesh Jangid",
  "email": "nitesh@gmail.com",
  "password": "nitesh123"
}

Success Response (200 OK):

{
    "email": "hitendra@gmail.com",
    "fullname": "Hitendra",
    "id": "ba39a875-53fc-470f-9210-cc2787c24f4b",
    "created_at": "2025-08-26T14:44:15.284624"
}

2. User Login

Endpoint: POST /login
Description: Authenticates a user and returns a JWT access token.
Authorization: None required.

Request Body:

{
  "email": "nitesh@gmail.com",
  "password": "nitesh123"
}

Success Response (200 OK):

{
  "access_token": "ey...",
  "token_type": "bearer"
}

3. Get User Profile

Endpoint: GET /profile
Description: Retrieves the profile information of the currently authenticated user.
Authorization: Bearer Token required.

Success Response (200 OK):

{
    "email": "hitendra@gmail.com",
    "fullname": "Hitendra",
    "id": "ba39a875-53fc-470f-9210-cc2787c24f4b",
    "created_at": "2025-08-26T14:44:15.284624"
}

PDF Management Endpoints

These endpoints are for uploading, searching, and managing PDF documents. All endpoints in this section require authentication.

Authorization Header For all protected routes, you must include the JWT token in the Authorization header:

Authorization: Bearer <your_access_token>

1. Upload a PDF

Endpoint: POST /upload-pdf
Description: Uploads a PDF file. The system processes the file, extracts its content, creates vector embeddings, and indexes it for searching.
Authorization: Bearer Token required.

Request Body:

This is a multipart/form-data request. The file should be sent under the key file.

Success Response (200 OK):

Returns the metadata of the newly created PDF record.

{
    "id": "4e59d204-b32f-4e2f-a37a-cb38bff6852c",
    "user_id": "ba39a875-53fc-470f-9210-cc2787c24f4b",
    "file_name": "Stock-market (1).pdf",
    "created_at": "2025-08-26T21:58:35.283575"
}

Note: id of the pdf is required in searching, so copy that.

2. Search Within a PDF

Endpoint: POST /search
Description: Performs a semantic search for a query string within a specific PDF document.
Return: Return top 5 search results
Authorization: Bearer Token required.

Request Body:

{
    "query": "Stock Market",
    "pdf_id": "4e59d204-b32f-4e2f-a37a-cb38bff6852c"
}

Success Response (200 OK):

[
    {
        "pdf_id": "4e59d204-b32f-4e2f-a37a-cb38bff6852c",
        "type": "text",
        "page_number": 1,
        "block_index": 2,
        "content": "How the Stock Market Works"
    },
    {
        "pdf_id": "4e59d204-b32f-4e2f-a37a-cb38bff6852c",
        "type": "text",
        "page_number": 3,
        "block_index": 2,
        "content": "Stock Market Analysis"
    },
    {
        "pdf_id": "4e59d204-b32f-4e2f-a37a-cb38bff6852c",
        "type": "text",
        "page_number": 0,
        "block_index": 2,
        "content": "An Introduction to the Stock Market"
    },
    {
        "pdf_id": "4e59d204-b32f-4e2f-a37a-cb38bff6852c",
        "type": "text",
        "page_number": 0,
        "block_index": 3,
        "content": "The stock market refers to the collection of markets and exchanges where the issuing and \ntrading of shares of publicly-held companies take place. It's a cornerstone of modern capitalism, \nproviding a platform for companies to raise capital for growth and for investors to own a piece of \nthose companies and share in their potential profits. At its core, the stock market is a dynamic \nenvironment where the ownership of corporations is bought and sold, driven by the collective \nassessment of their current and future value."
    },
    {
        "pdf_id": "4e59d204-b32f-4e2f-a37a-cb38bff6852c",
        "type": "text",
        "page_number": 3,
        "block_index": 0,
        "content": "Value \nStocks"
    }
]

3. Get All User Documents

Endpoint: GET /pdfs
Description: Retrieves a list of metadata for all PDF documents uploaded by the authenticated user.
Authorization: Bearer Token required.

Success Response (200 OK):

[
    {
        "id": "4e59d204-b32f-4e2f-a37a-cb38bff6852c",
        "user_id": "ba39a875-53fc-470f-9210-cc2787c24f4b",
        "file_name": "Stock-market (1).pdf",
        "created_at": "2025-08-26T21:58:35.283575"
    },
    {
        "id": "f0e9d8c7-b6a5-4321-fedc-ba0987654321",
        "user_id": "ba39a875-53fc-470f-9210-cc2787c24f4b",
        "file_name": "project_report.pdf",
        "created_at": "2025-08-26T21:58:35.283575"
    }
]

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
app		app
.DS_Store		.DS_Store
.env		.env
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
stock-market-intro.pdf		stock-market-intro.pdf
upload-pdf-api.png		upload-pdf-api.png
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Update (This project is hosted)

☁️ AWS / Cloud Services for Deployment

🚀 Workflow: PDF Upload & Search

🚀 Technologies Used

🚀 Project Setup Locally

1. Using Docker (Recommended)

2. Setup without Docker

Lets go!

API Endpoints

Authentication Endpoints

1. Register a New User

2. User Login

3. Get User Profile

PDF Management Endpoints

1. Upload a PDF

2. Search Within a PDF

3. Get All User Documents

About

Uh oh!

Releases

Packages

Languages

niteshjangid29/pdf-search

Folders and files

Latest commit

History

Repository files navigation

🚀 Update (This project is hosted)

☁️ AWS / Cloud Services for Deployment

🚀 Workflow: PDF Upload & Search

🚀 Technologies Used

🚀 Project Setup Locally

1. Using Docker (Recommended)

2. Setup without Docker

Lets go!

API Endpoints

Authentication Endpoints

1. Register a New User

2. User Login

3. Get User Profile

PDF Management Endpoints

1. Upload a PDF

2. Search Within a PDF

3. Get All User Documents

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages