Automatically collect papers from ArXiv and organize them in your Zotero library with AI-powered paper summarization! Perfect for researchers, students, and academics who want to keep their paper collections and references organized.
- 🔍 Search ArXiv papers using keywords, authors, or categories
- 📥 Automatically download PDFs
- 🤖 AI-powered summarization of papers
- 📝 Add papers to Zotero with complete metadata
- 📁 Organize papers into collections
- 📅 Filter papers by date range
- 🎯 Search specific types of content (journals, conference papers, preprints)
-
Install Python (version 3.7 or newer)
- Download from Python's website
- During installation, check "Add Python to PATH"
-
Install Git
- Download from Git's website
-
Get your Zotero Library ID:
- Visit Zotero Settings
- Navigate to "Feed Settings"
- Find "Your user ID for use in API calls is XXXXXX"
-
Create your API Key:
- In Zotero Settings, go to "API Settings"
- Click "Create new private key"
- Enable all permissions
- Click "Save Key"
- Copy the generated key
-
(Optional) Get a Collection Key:
- Open your Zotero library in a web browser
- Select the desired collection (folder)
- The collection key is the last part of the URL (format: "XXX1XXX0")
Open your terminal/command prompt and run:
# 1. Clone the repository
git clone https://github.com/StepanKropachev/arxiv-zotero-connector.git
cd arxiv-zotero-connector
# 2. Create a virtual environment
python -m venv .venv
# 3. Activate the environment
# On Windows:
.\.venv\Scripts\activate
# On Mac/Linux:
source .venv/bin/activate
# 4. Install dependencies
pip install -r requirements.txt
- Create a
.env
file in the project folder - Add your credentials:
ZOTERO_LIBRARY_ID=your_library_id
ZOTERO_API_KEY=your_api_key
COLLECTION_KEY=your_collection_key # Optional
Search for papers about "machine learning" in computer science:
python main.py --keywords "machine learning" --categories cs.AI --max-results 10
Search for recent papers by a specific author:
python main.py --author "John Smith" --start-date 2024-01-01
Download papers without PDFs:
python main.py --keywords "deep learning" --no-pdf
- Create
my_search.yaml
:
keywords:
- "reinforcement learning"
- "deep learning"
categories:
- "cs.AI"
- "cs.LG"
max_results: 20
start_date: "2024-01-01"
- Run with config:
python main.py --config my_search.yaml
from src.core.connector import ArxivZoteroCollector
from src.core.search_params import ArxivSearchParams
from src.utils.credentials import load_credentials
# Load credentials
credentials = load_credentials()
# Create collector
collector = ArxivZoteroCollector(
zotero_library_id=credentials['library_id'],
zotero_api_key=credentials['api_key'],
collection_key=credentials['collection_key'] # Optional
)
# Configure search
search_params = ArxivSearchParams(
keywords=["artificial intelligence"],
categories=["cs.AI"],
max_results=10
)
# Run collection
successful, failed = await collector.run_collection_async(
search_params=search_params,
download_pdfs=True
)
--keywords
or-k
: Search terms--title
or-t
: Search in titles only--categories
or-c
: ArXiv categories--author
or-a
: Author name--start-date
: Start date (YYYY-MM-DD)--end-date
: End date (YYYY-MM-DD)--max-results
or-m
: Maximum papers to retrieve--no-pdf
: Skip PDF downloads
cs.AI
: Artificial Intelligencecs.LG
: Machine Learningcs.CL
: Computation and Languagecs.CV
: Computer Visionphysics.comp-ph
: Computational Physicsmath.NA
: Numerical Analysisq-bio
: Quantitative Biology
-
Get your Gemini API Key:
- Visit Google AI Studio
- Create a new API key
- Add to
.env
:
GOOGLE_API_KEY=your_api_key
-
Configure in
my_search.yaml
:summarizer: enabled: true prompt: "Your custom prompt here" max_length: 300 rate_limit_delay: 5
--summarizer-enabled
: Toggle summarization--summarizer-prompt
: Custom AI prompt--summary-length
: Maximum summary length--rate-limit
: API request delay (seconds)
python main.py \
--keywords "survey" "review" "deep learning" \
--categories cs.AI cs.LG \
--start-date 2023-01-01 \
--content-type journal
python main.py \
--keywords "transformer" "attention mechanism" \
--categories cs.CL \
--start-date 2024-01-01 \
--max-results 20
python main.py \
--keywords "reinforcement learning" \
--content-type conference \
--start-date 2023-06-01 \
--end-date 2024-01-01
python main.py \
--keywords "quantum computing" \
--summarizer-prompt "Explain this research paper as if you're talking to a high school student" \
--max-results 5
python main.py \
--keywords "machine learning" \
--summarizer-prompt "Summarize this paper in Russian, focusing on methodology and results" \
--max-results 3
python main.py \
--keywords "artificial intelligence" "ethics" \
--summarizer-prompt "Create a Twitter-style thread (5 tweets max) explaining the key findings" \
--start-date 2024-01-01
- "Command not found" error:
- Verify correct directory
- Confirm Python installation
- Try
python3
instead ofpython
- Credentials not working:
- Verify Library ID and API key
- Check
.env
file formatting - Confirm API permissions
- Downloads failing:
- Check internet connection
- Verify available disk space
- Reduce
max_results
- Program seems slow:
- Consider ArXiv rate limits
- Be patient with large downloads
- Reduce
max_results
- "API Key Invalid" error:
- Verify GEMINI_API_KEY in
.env
- Check API key permissions
- Confirm key validity
- Verify GEMINI_API_KEY in
- Rate Limit Issues:
- Increase
rate_limit_delay
- Reduce batch size
- Monitor API quotas
- Increase
- Summary Quality:
- Refine prompts
- Adjust
max_length
- Use specific instructions
- 🐛 Report bugs via GitHub Issues
- 💡 Share suggestions for improvement
- 🤝 Contributions welcome!
MIT License - Free to use and modify!