A web application that scrapes recent Machine Learning research papers from arXiv and generates summaries using either OpenAI or Claude API. This project demonstrates how to implement a resilient AI-powered system that can seamlessly switch between different language model providers.
Live Demo: https://mlpapers-summarizer.onrender.com
I created this project to address a personal pain point. As I started building ML projects both for university assignments and personal skill improvement, I found myself spending excessive time searching through research papers to find useful techniques and stay current with the industry.
This tool automates the most time-consuming parts:
- Discovering new papers in specific ML categories
- Reading and understanding technical papers quickly
- Extracting the key contributions and methodologies
By scraping papers from arXiv and leveraging AI to generate comprehensive summaries directly from the full PDF content, this project has significantly cut down the labor involved in staying up-to-date with ML research.
- Dual API Support: Integration with both OpenAI and Claude APIs, with fallback options if one service is unavailable
- Full PDF Analysis: Extracts text directly from research paper PDFs to generate more comprehensive and accurate summaries
- Automated Scraping: Retrieves recent ML research papers from arXiv based on categories and keywords
- AI Summarization: Generates concise, structured summaries of complex research papers
- Provider Selection: Choose which AI model to use for summarization through an intuitive admin interface
- API Diagnostics: Built-in tools to check API connection status and troubleshoot problems
The application is designed to be resilient to API outages or quota limitations by supporting multiple providers. If one service is unavailable, it can automatically switch to the other.
View recently scraped papers from arXiv with their summaries, categorized for easy browsing.
Read detailed information about a paper along with an AI-generated summary that breaks down complex research into digestible sections. Summaries are now generated using the full paper PDF, not just the abstract.
Manage papers and control which AI provider to use for summarization. The dropdown allows switching between OpenAI, Claude, or automatic selection.
Check the status of your API connections and troubleshoot issues. As shown here, the system can detect when one API has quota limitations while another is working properly.
-
Clone the repository
git clone <repository-url> cd MLPapers_scraper-summarizer
-
Create and activate a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
- Copy
.env.example
to.env
- Configure your API keys (see API Configuration section)
- Copy
-
Run the application
python app.py
-
Access the web interface at
http://localhost:5001
Note for macOS users: If you encounter an error with port 5000 (the default Flask port), this is likely due to AirPlay using this port. The application has been configured to use port 5001 instead, but if you need to change it, you can modify the port in
app.py
.
The application supports two AI providers for generating summaries:
- Go to https://platform.openai.com/account/api-keys to get your API key
- Add your key to the
.env
file:OPENAI_API_KEY=sk-your-openai-key-goes-here
- Go to https://console.anthropic.com/ to get your API key
- Add your key to the
.env
file:ANTHROPIC_API_KEY=sk-ant-your-key-goes-here
You can provide either one or both API keys. The application is designed to work with just one of the services if needed.
The application uses two sophisticated approaches to generate summaries, depending on the available API:
When using Claude API, the application:
- Downloads the complete PDF from arXiv
- Sends the entire PDF file directly to Claude's API using its multimodal capabilities
- Enables Claude to "see" all figures, tables, and formatted content in the paper
- Produces more comprehensive summaries that include visual information
When using OpenAI API or when Claude is unavailable:
- Downloads the PDF from arXiv
- Extracts text content from the document (up to 20 pages)
- Sends the extracted text to the API
- Generates summaries based on the text content only
Both methods produce well-structured summaries with technical details, but the Claude direct approach can often capture more nuanced information from figures and complex formatting that might be lost in text extraction.
In the Admin Panel, you can select which AI provider to use for summarization:
- Auto: The application will automatically use available APIs, with Claude preferred if both are available
- OpenAI: Force using OpenAI's GPT model even if Claude is available
- Claude: Force using Claude even if OpenAI is available
You can also check the API connection status with the API Diagnostics tool in the Admin Panel.
- Home Page: View recently scraped papers
- Paper Details: Click on a paper to view its details and AI-generated summary
- Admin Panel: Access admin functionality to manage papers and trigger updates
- Set API provider preferences
- Check API status
- Generate missing summaries
This project is configured for easy deployment to Render's free tier:
- Fork/clone this repository to your GitHub account
- Create a new Web Service on Render, connecting to your GitHub repository
- Render will automatically detect the configuration in
render.yaml
- Add your API keys as environment variables in the Render dashboard:
OPENAI_API_KEY
(if using OpenAI)ANTHROPIC_API_KEY
(if using Claude)
- The application will be deployed and accessible via your Render URL
Auto-Deployment: Once set up, any changes pushed to your main branch will automatically trigger a new deployment on Render. This continuous deployment feature ensures your live site always reflects the latest code in your repository.
Note: On Render's free tier, the application will sleep after 15 minutes of inactivity. The first request after inactivity may take 30-60 seconds to respond.
If you encounter API issues:
- Check your API keys are correctly set in the
.env
file - Use the API Diagnostics tool in the Admin Panel to test connections
- Look at the
app.log
andsummarizer.log
files for detailed error messages
If you encounter port issues on macOS:
- The application is configured to use port 5001 to avoid conflicts with AirPlay (which uses port 5000)
- If needed, you can disable AirPlay Receiver in System Preferences > General > AirDrop & Handoff
- Alternatively, you can change the port in
app.py
Note About the Live Demo: The live demo at https://mlpapers-summarizer.onrender.com may display an "Internal Server Error" when attempting to generate summaries. This is a limitation of the current hosting plan, not a bug in the application itself.
The error occurs due to:
- Memory Limitations: PDF processing and AI summarization require significant memory which exceeds the limits of the current hosting plan
- Execution Timeouts: The hosting provider imposes strict timeouts which are insufficient for downloading PDFs and generating summaries
- Resource Constraints: The application faces resource limitations when processing large PDF files
The application works properly in local environments with sufficient resources. The deployed demo still allows you to browse papers and view existing summaries, but generating new summaries might fail due to these hosting constraints.
Contributions are welcome! Please feel free to submit a Pull Request.