MLPapers Scraper and Summarizer

A web application that scrapes recent Machine Learning research papers from arXiv and generates summaries using either OpenAI or Claude API. This project demonstrates how to implement a resilient AI-powered system that can seamlessly switch between different language model providers.

Live Demo: https://mlpapers-summarizer.onrender.com

Project Motivation

I created this project to address a personal pain point. As I started building ML projects both for university assignments and personal skill improvement, I found myself spending excessive time searching through research papers to find useful techniques and stay current with the industry.

This tool automates the most time-consuming parts:

Discovering new papers in specific ML categories
Reading and understanding technical papers quickly
Extracting the key contributions and methodologies

By scraping papers from arXiv and leveraging AI to generate comprehensive summaries directly from the full PDF content, this project has significantly cut down the labor involved in staying up-to-date with ML research.

Key Features

Dual API Support: Integration with both OpenAI and Claude APIs, with fallback options if one service is unavailable
Full PDF Analysis: Extracts text directly from research paper PDFs to generate more comprehensive and accurate summaries
Automated Scraping: Retrieves recent ML research papers from arXiv based on categories and keywords
AI Summarization: Generates concise, structured summaries of complex research papers
Provider Selection: Choose which AI model to use for summarization through an intuitive admin interface
API Diagnostics: Built-in tools to check API connection status and troubleshoot problems

The application is designed to be resilient to API outages or quota limitations by supporting multiple providers. If one service is unavailable, it can automatically switch to the other.

Screenshots

Home Page

View recently scraped papers from arXiv with their summaries, categorized for easy browsing.

Paper Detail

Read detailed information about a paper along with an AI-generated summary that breaks down complex research into digestible sections. Summaries are now generated using the full paper PDF, not just the abstract.

Admin Panel

Manage papers and control which AI provider to use for summarization. The dropdown allows switching between OpenAI, Claude, or automatic selection.

API Diagnostics

Check the status of your API connections and troubleshoot issues. As shown here, the system can detect when one API has quota limitations while another is working properly.

Setup

Clone the repository

git clone <repository-url>
cd MLPapers_scraper-summarizer

Create and activate a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Set up environment variables
- Copy .env.example to .env
- Configure your API keys (see API Configuration section)
Run the application
```
python app.py
```
Access the web interface at http://localhost:5001

Note for macOS users: If you encounter an error with port 5000 (the default Flask port), this is likely due to AirPlay using this port. The application has been configured to use port 5001 instead, but if you need to change it, you can modify the port in app.py.

API Configuration

The application supports two AI providers for generating summaries:

Option 1: OpenAI API

Go to https://platform.openai.com/account/api-keys to get your API key

Add your key to the .env file:

OPENAI_API_KEY=sk-your-openai-key-goes-here

Option 2: Claude API

Go to https://console.anthropic.com/ to get your API key

Add your key to the .env file:

ANTHROPIC_API_KEY=sk-ant-your-key-goes-here

You can provide either one or both API keys. The application is designed to work with just one of the services if needed.

Summary Generation

The application uses two sophisticated approaches to generate summaries, depending on the available API:

Claude Approach (PDF Direct Upload)

When using Claude API, the application:

Downloads the complete PDF from arXiv
Sends the entire PDF file directly to Claude's API using its multimodal capabilities
Enables Claude to "see" all figures, tables, and formatted content in the paper
Produces more comprehensive summaries that include visual information

OpenAI Approach (Text Extraction)

When using OpenAI API or when Claude is unavailable:

Downloads the PDF from arXiv
Extracts text content from the document (up to 20 pages)
Sends the extracted text to the API
Generates summaries based on the text content only

Both methods produce well-structured summaries with technical details, but the Claude direct approach can often capture more nuanced information from figures and complex formatting that might be lost in text extraction.

Switching Between API Providers

In the Admin Panel, you can select which AI provider to use for summarization:

Auto: The application will automatically use available APIs, with Claude preferred if both are available
OpenAI: Force using OpenAI's GPT model even if Claude is available
Claude: Force using Claude even if OpenAI is available

You can also check the API connection status with the API Diagnostics tool in the Admin Panel.

Usage

Home Page: View recently scraped papers
Paper Details: Click on a paper to view its details and AI-generated summary
Admin Panel: Access admin functionality to manage papers and trigger updates
- Set API provider preferences
- Check API status
- Generate missing summaries

Deployment

Deploying on Render

This project is configured for easy deployment to Render's free tier:

Fork/clone this repository to your GitHub account
Create a new Web Service on Render, connecting to your GitHub repository
Render will automatically detect the configuration in render.yaml
Add your API keys as environment variables in the Render dashboard:
- OPENAI_API_KEY (if using OpenAI)
- ANTHROPIC_API_KEY (if using Claude)
The application will be deployed and accessible via your Render URL

Auto-Deployment: Once set up, any changes pushed to your main branch will automatically trigger a new deployment on Render. This continuous deployment feature ensures your live site always reflects the latest code in your repository.

Note: On Render's free tier, the application will sleep after 15 minutes of inactivity. The first request after inactivity may take 30-60 seconds to respond.

Troubleshooting

If you encounter API issues:

Check your API keys are correctly set in the .env file
Use the API Diagnostics tool in the Admin Panel to test connections
Look at the app.log and summarizer.log files for detailed error messages

If you encounter port issues on macOS:

The application is configured to use port 5001 to avoid conflicts with AirPlay (which uses port 5000)
If needed, you can disable AirPlay Receiver in System Preferences > General > AirDrop & Handoff
Alternatively, you can change the port in app.py

Known Deployment Issues

Note About the Live Demo: The live demo at https://mlpapers-summarizer.onrender.com may display an "Internal Server Error" when attempting to generate summaries. This is a limitation of the current hosting plan, not a bug in the application itself.

The error occurs due to:

Memory Limitations: PDF processing and AI summarization require significant memory which exceeds the limits of the current hosting plan
Execution Timeouts: The hosting provider imposes strict timeouts which are insufficient for downloading PDFs and generating summaries
Resource Constraints: The application faces resource limitations when processing large PDF files

The application works properly in local environments with sufficient resources. The deployed demo still allows you to browse papers and view existing summaries, but generating new summaries might fail due to these hosting constraints.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
scraper		scraper
screenshots		screenshots
static		static
summarizer		summarizer
templates		templates
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.py		app.py
database.py		database.py
render.yaml		render.yaml
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLPapers Scraper and Summarizer

Project Motivation

Key Features

Screenshots

Home Page

Paper Detail

Admin Panel

API Diagnostics

Setup

API Configuration

Option 1: OpenAI API

Option 2: Claude API

Summary Generation

Claude Approach (PDF Direct Upload)

OpenAI Approach (Text Extraction)

Switching Between API Providers

Usage

Deployment

Deploying on Render

Troubleshooting

Known Deployment Issues

Contributing

License

About

Uh oh!

Packages

Uh oh!

Languages

License

vatsalmehta2001/MLPapers_scraper-summarizer

Folders and files

Latest commit

History

Repository files navigation

MLPapers Scraper and Summarizer

Project Motivation

Key Features

Screenshots

Home Page

Paper Detail

Admin Panel

API Diagnostics

Setup

API Configuration

Option 1: OpenAI API

Option 2: Claude API

Summary Generation

Claude Approach (PDF Direct Upload)

OpenAI Approach (Text Extraction)

Switching Between API Providers

Usage

Deployment

Deploying on Render

Troubleshooting

Known Deployment Issues

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Languages

Packages