Web & PDF Data Extraction Tool

The project focuses on developing an application with Streamlit for the user interface and FastAPI for API endpoints, utilizing a Python backend. It processes website URLs and PDF files to extract structured data, including text, images, and tables, using various parsing techniques. The extracted data is securely stored in an AWS S3 bucket and rendered in the Streamlit UI using a markdown-standardized format for consistency. The implementation integrates open-source libraries and enterprise tools, alongside document-linguistic approaches, to evaluate tool compatibility and performance. This prototype serves as a scalable framework for testing and validating data extraction capabilities across diverse input formats.

Team Members

Vedant Mane
Abhinav Gangurde
Yohan Markose

Attestation:

WE ATTEST THAT WE HAVEN’T USED ANY OTHER STUDENTS’ WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK

Resources

Application: Streamlit Deployment

Backend API: Google Cloud Run

Google Codelab: Codelab

Google Docs: Project Document

Video Walkthrough: Video

Technologies Used

Streamlit: Frontend Framework
FastAPI: API Framework
Google Cloud Run: Backend Deployment
AWS S3: External Cloud Storage
Scrapy: Website Data Extraction Open Source Tool
PyMuPDF: PDF Data Extraction Open Source Tool
Diffbot: Website Data Extraction Enterprise Tool
Microsoft Document Intelligence: PDF Data Extraction Enterprise Tool
Docling: Document Data Extraction Tool

Application Workflow Diagram

Workflow

User submits a request via the Streamlit UI.
Frontend forwards the request to the respective API:
- Web Processing API for URLs.
- PDF Processing API for document uploads.
Backend processes the request using appropriate tools.
Processed data is stored in Amazon S3.
The user receives extracted content back in the UI.

Environment Setup

Required Python Version 3.12.*

1. Clone the Repository

git clone https://github.com/BigDataIA-Spring2025-4/DAMG7245_Assignment01.git
cd DAMG7245_Assignment01

2. Setting up the virtual environment

python -m venv venvsource venv/bin/activate
pip install -r requirements.txt

3. AWS S3 Setup

Step 1: Create an AWS Account

Go to AWS Signup and click Create an AWS Account.
Follow the instructions to enter your email, password, and billing details.
Verify your identity and choose a support plan.

Step 2: Log in to AWS Management Console

Visit AWS Console and log in with your credentials.
Search for S3 in the AWS services search bar and open it.

Step 3: Create an S3 Bucket

Click Create bucket.
Enter a unique Bucket name.
Select a region closest to your users.
Configure settings as needed (e.g., versioning, encryption).
Click Create bucket to finalize.

4. Google Cloud SDK Setup

Step 1: Download and Install Google Cloud SDK

Visit the Google Cloud SDK documentation for platform-specific installation instructions.
Download the installer for your operating system (Windows, macOS, or Linux).
Follow the installation steps provided for your system.

Step 2: Initialize Google Cloud SDK

Open a terminal or command prompt.
Run gcloud init to begin the setup process.
Follow the prompts to log in with your Google account and select a project.

Step 3: Verify Installation

Run gcloud --version to confirm installation.
Use gcloud config list to check the active configuration.

5. Setting up the Docker Image on Google Cloud Run

Build the Docker Image

# Build and tag your image (make sure you're in the project directory)
docker build --platform=linux/amd64 --no-cache -t gcr.io/<YOUR_PROJECT_ID>/fastapi-app .

Test Locally (Optional but Recommended)

# Run the container locally
docker run -p 8080:8080 gcr.io/<YOUR_PROJECT_ID>/fastapi-app

# For Managing Environment Variables
docker run --env-file .env -p 8080:8080 gcr.io/<YOUR_PROJECT_ID>/fastapi-app

Visit http://localhost:8080/docs to verify the API works.

Push to Google Container Registry

# Push the image
docker push gcr.io/<YOUR_PROJECT_ID>/fastapi-app

Deploy to Cloud Run

gcloud run deploy fastapi-service \
  --image gcr.io/<YOUR_PROJECT_ID>/fastapi-app \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated

Get your Service URL

gcloud run services describe fastapi-service \
  --platform managed \
  --region <REGION> \
  --format 'value(status.url)'

Check Application Logs

gcloud run services logs read fastapi-service --region <REGION>

Data Flow & Backend Processes

1. User Input

The user provides input via:

A Web URL Input for extracting content from websites.
A PDF File Upload for extracting text and data from PDFs.

2. Frontend (Streamlit)

The Streamlit UI acts as the interface where users enter URLs or upload PDFs.
This UI sends requests to the Backend (FastAPI) for processing.

3. Backend (FastAPI)

The FastAPI service is responsible for handling requests separately for:

Web Processing API – Handles web scraping and content extraction from URLs.
PDF Processing API – Extracts text and structured data from uploaded PDFs.

4. Processing Components

The FastAPI service interacts with multiple tools for processing:

Open Source Tool – Provides additional processing features for both Web and PDF content.
Enterprise Tool – Enhances text extraction and analysis.
Docling Tool – Extracts structured content from PDFs.

5. Deployment & Execution

The backend runs on Google Cloud Run, packaged as a Docker Image.
Requests are processed in a cloud-based environment for scalability.

6. Data Storage & Output

Extracted data is stored in Amazon S3, ensuring persistence and easy retrieval.

References

Streamlit documentation

FastAPI Documentation

Scrapy Documentation

PyMuPDF Documentation

Diffbot Documentation

Microsoft Document Intelligence Documentation

Docling Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.devcontainer		.devcontainer
.vscode		.vscode
Prototypes		Prototypes
backend/app		backend/app
features		features
frontend		frontend
services		services
src		src
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_extractor.png		data_extractor.png
diagram.py		diagram.py
dockerfile		dockerfile
project_directorystructure.py		project_directorystructure.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web & PDF Data Extraction Tool

Team Members

Attestation:

Resources

Technologies Used

Application Workflow Diagram

Workflow

Environment Setup

1. Clone the Repository

2. Setting up the virtual environment

3. AWS S3 Setup

4. Google Cloud SDK Setup

5. Setting up the Docker Image on Google Cloud Run

Data Flow & Backend Processes

1. User Input

2. Frontend (Streamlit)

3. Backend (FastAPI)

4. Processing Components

5. Deployment & Execution

6. Data Storage & Output

References

About

Contributors 3

Languages

License

BigDataIA-Spring2025-4/DAMG7245_Assignment01

Folders and files

Latest commit

History

Repository files navigation

Web & PDF Data Extraction Tool

Team Members

Attestation:

Resources

Technologies Used

Application Workflow Diagram

Workflow

Environment Setup

1. Clone the Repository

2. Setting up the virtual environment

3. AWS S3 Setup

4. Google Cloud SDK Setup

5. Setting up the Docker Image on Google Cloud Run

Data Flow & Backend Processes

1. User Input

2. Frontend (Streamlit)

3. Backend (FastAPI)

4. Processing Components

5. Deployment & Execution

6. Data Storage & Output

References

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages