Here is to assist with your next literature review ! This script retrieves article information from Google Scholar using a DOI and tracks the information in an Excel file. It also saves the PDF of the article if available : No more laborious manual file management !
- Validate DOI and fetch article metadata from Google Scholar.
- Save metadata into an Excel file (
papers.xlsx
). - Avoid duplicate entries in the Excel file.
- Download PDFs of articles into a
pdfs
folder, with filenames formatted asyear_authors_title.pdf
.
The script requires the following Python packages:
scholarly
requests
pandas
openpyxl
If these are not installed, you can run the provided example script to check and install missing dependencies.
- Clone the repository or copy the script into your Python environment.
- Run the script in your terminal or IDE.
- When prompted, enter a DOI or type
escape
to exit.
- Enter DOI:
10.1257/jel.41.3.788
- The script will retrieve the article information, save it in
papers.xlsx
, and download the PDF (if available). - Type
escape
or pressEnter
on an empty input to stop the program.
To ensure you have all required packages installed, you can use the following script:
import subprocess
import sys
def check_and_install(package):
try:
__import__(package)
print(f"{package} is already installed.")
except ImportError:
print(f"{package} is not installed. Installing...")
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
required_packages = ["scholarly", "requests", "pandas", "openpyxl"]
for package in required_packages:
check_and_install(package)
Run the above script before using the main program to ensure all dependencies are available.
The file can be run as is, simply press 'Enter' or type 'escape' to leave the console and exit the program. It can also be modified pretty simply to potentially retrieve a list of DOI from a csv file.
from scholarly import scholarly
import re
import os
import requests
import pandas as pd
# Example DOI list, could also be read from csv file
dois = [
"10.1257/jel.41.3.788",
"10.1016/j.jfineco.2020.04.003"
]
populate_excel_from_dois(dois)
The script will create and manage the following files and folders:
papers.xlsx
: The Excel file where metadata is stored.pdfs/
: A folder where downloaded PDFs are stored.
- The script ensures no duplicate DOIs are added to the Excel file.
- PDF filenames are sanitized to prevent issues with invalid characters.
This project is licensed under the MIT License.