A Python tool to enhance BibTeX entries with metadata from various academic APIs and DOI services. It was mostly written by ChatGPT o3, i just needed to help it a bit with structuring the code and fixing some bugs.
Tested performance on sample bibliography:
- Sync mode time: 78.21129846572876 seconds
- Async mode time: 22.707439184188843 seconds
- Two-stage processing:
- Metadata enhancement from multiple APIs
- DOI-based BibTeX conversion using doi2bib.org
- Fetches metadata from multiple sources:
- CrossRef API
- arXiv API
- Semantic Scholar API
- DBLP API
- Web scraping support for doi2bib.org
- Supports both synchronous and asynchronous operations (something broke with async mode :/// )
- Modular provider system for easy extension
- Handles DOI and title-based searches
- Comprehensive logging system
- Organized output file management
pip install bibtexparser aiohttp requests beautifulsoup4.
├── bibtex_processor.py # Combined processing orchestrator
├── metadata_enhancer.py # API-based metadata enhancement
├── doi_converter.py # DOI to BibTeX conversion
├── bibtex_output/ # Enhanced and converted BibTeX files
├── logs/ # Detailed processing logs
├── providers/ # API providers directory
│ ├── __init__.py
│ ├── arxiv_provider.py # arXiv API integration
│ └── crossref_provider.py # CrossRef API integration + base class
└── README.md
Basic usage:
from bibtex_processor import BibTexProcessor
# Initialize processor with optional API keys
processor = BibTexProcessor(crossref_api_key="YOUR_API_KEY")
# Run asynchronously (faster for many entries)
async def main():
final_bibtex = await processor.process_bibtex_file('your_file.bib', is_async=True)
print(f"Processed file saved to: {final_bibtex}")
asyncio.run(main())The processor creates two types of output:
- Enhanced BibTeX files with metadata from APIs (I mostly use this)
- Final BibTeX files with DOI-based conversions
- you can also use the doi_converter.py to convert the bibtex file to a new bibtex file with the doi
All files are saved in the
bibtex_outputdirectory with timestamps.
Three types of logs are generated in the logs directory:
- Success/failure logs for each operation
- Statistical information about processing
- Combined processing log with overall results
- CrossRef: Optional but recommended for better rate limits
- arXiv: No API key required
- Semantic Scholar: Optional but recommended for better rate limits
- DBLP: No API key required
Feel free to submit issues and enhancement requests!