diff --git a/README.md b/README.md index 71cb4a4..43b0d0a 100644 --- a/README.md +++ b/README.md @@ -2,22 +2,78 @@ ## Why would you want to rename scientific papers? -Well, if you're like me and you spend a lot of time going through research papers, you will have found it quite annoying to figure out which pdf is for which paper once you've -downloaded them all to a single folder. +Well, if you're like me and you've had to download a lot of research papers for literature reviews, you will have found it quite annoying to figure out which PDF is for which paper once you've downloaded them all to a single folder. You'll have some named as "231514232" while others will have more sophisticated naming but in the end, this is all gibberish to you. -So, I decided to make a web app where you can upload all the pdfs you've downloaded, and get them back in a zipped file with each renamed with the title and year of publication. +So, I decided to make a tool that you can use to rename each PDF with the title and year of publication so that it becomes a lot easier to organize and manage your research papers. -I will also make a simple script to use this application directly from the terminal (in the future). +I originally set out to make a Flask web app using Python where a user can upload all the papers and get them back (renamed) inside a zip file. However, this proved to be too cumbersome and slowed down the processing by a great deal. In the end, I settled on a simple script that can be run from the command line and gives excellent processing speed. -## How to use +## Requirements -### Requirements +This project only works on Linux-based operating systems or MacOS, for two reasons: (1) The service used for processing PDFs, Grobid, is not supported on Windows, and (2) Docker is not supported well on Windows either. So if you're a Windows user, I'm sorry :( -You will need the Docker Engine (with Docker Compose) to easily set up and run this application. If you run a Linux-based OS or MacOS, you will find using this application quite easy. +If you use Linux or MacOS, proceed with the instructions given below. You'll need the Docker Engine installed since this project uses a Docker image from the Docker Hub. -However, if you're using a version of Windows that does not support Docker, you won't be able to use it :( +## Usage Instructions + +### Grab the Grobid image from Docker + +Make sure docker is installed in your system. If so, run the following command to pull the Grobid image: + +``` +sudo docker pull lfoppiano/grobid:0.6.2 +``` + +### Clone this repo and go inside the directory + +``` +git clone https://github.com/ishaqibrahimbot/rename-scientific-publications.git +cd rename-scientific-publications +``` + +### Install the grobid client + +``` +git clone https://github.com/kermitt2/grobid_client_python.git +cd grobid_client_python +sudo python setup.py install +cd .. +``` + +All these commands are doing is cloning the repo for the grobid client, going inside its folder, installing the dependencies, and getting outside the folder. + +### Start the Grobid server + +``` +./start_grobid_server.sh +``` + +### Run the script + +Finally, make a directory named "pdfs" inside the project's root folder and paste all of your PDF files inside it. + +Now go back to your terminal (open a new tab since the first one will now be running the Grobid server) and run the following command: + +``` +python rename_pdf.py +``` +The project will start processing your files. If you check the terminal window where the Grobid server is running, you will see the requests being sent and processed sequentially. + +Run the above command with a -h argument to see all possible arguments. If you want your new PDF names to include the year of publication as well, run the following command: + +``` +python rename_pdf.py --include_year=True +``` + +Once the processing is finished (took me about 1-2 mins to process 10 PDFs with a total size of 9.8MB), you can go back to the "pdfs" folder and find your PDF files, renamed. + +For queries, suggestions, and issues (especially if you find glaring problems in the code), contact: ishaqibrahimbss@gmail.com + +## In case you want to check out the Flask app + +Although I don't recommend that you use the Flask app (it works fine but is just too slow compared to the command line method), you can still do so by following these instructions. ### Clone the repo @@ -35,7 +91,7 @@ Go into the cloned directory using your terminal and after that, run the followi sudo docker-compose up ``` -You will notice log messages from two apps: the Flask app and the Grobid server (that this application uses for processing of pdf content). +You will notice log messages from two apps: the Flask app and the Grobid server (that this application uses for processing of PDF content). ### Open the Flask URL @@ -45,4 +101,3 @@ That's it! Now you can upload as many PDFs as you want and get a zipped file in Note: The processing might take some time, so please be patient and carry on with your other work. The browser will prompt you automatically to download the zip file once the processing is completed. -For queries, suggestions, and issues, contact: ishaqibrahimbss@gmail.com diff --git a/__pycache__/getYear.cpython-38.pyc b/__pycache__/getYear.cpython-38.pyc index caeeca6..789b4a4 100644 Binary files a/__pycache__/getYear.cpython-38.pyc and b/__pycache__/getYear.cpython-38.pyc differ diff --git a/getYear.py b/getYear.py index e4f281f..eb97a94 100644 --- a/getYear.py +++ b/getYear.py @@ -1,3 +1,7 @@ +""" +A simple function to extract a YYYY format date from the date string extracted by Grobid +""" + def get_year(date): date = date.split() diff --git a/rename_pdf.py b/rename_pdf.py index 6bf4e47..c3b572d 100644 --- a/rename_pdf.py +++ b/rename_pdf.py @@ -7,6 +7,16 @@ import argparse def process_pdfs(UPLOAD_FOLDER, include_year, concurrency): + """ + Use grobid_client_python to send pdf files to the Grobid Server and save the *.tei.xml files + returned inside the "processed_pdfs" folder. + + Two modes: + + - include_year=True -> adds the year of publication to the start of the name of each pdf + - include_year=False -> Does not do the above + """ + client = GrobidClient(config_path="./grobid_client_python/config.json") if include_year: @@ -16,6 +26,10 @@ def process_pdfs(UPLOAD_FOLDER, include_year, concurrency): def get_xml_files(pdf_files): + """ + Get the list of corresponding .tei.xml files for each pdf file, once Grobid has completed its + processing. + """ xml_files = [] for file in pdf_files: @@ -28,14 +42,19 @@ def get_xml_files(pdf_files): def get_title(soup, pdfs_folder, include_year): + """ + Get a standardized title from the object returned by BeautifulSoup + """ title = soup.title.getText() - title = re.sub("[^a-zA-Z ]+", " ", title) + title = re.sub("[^a-zA-Z ]+", " ", title) #Remove any characters other than letters if len(title) > 40: - title = title[:40] + title = title[:40] #Just take the first 40 characters of the title for the new name of the pdf - if include_year: + + # Depending on the value of include_year, either add the YoP to the name or not + if include_year: date = get_year(soup.date.getText()) if date is not None: new_title = os.path.join(pdfs_folder, "(" + date + ")" + " " + title + ".pdf") @@ -48,6 +67,10 @@ def get_title(soup, pdfs_folder, include_year): def rename_pdfs(UPLOAD_FOLDER, include_year): + """ + Grab all the .tei.xml files, run them through BeautifulSoup to get the new file name, and rename + each pdf file. + """ processed_pdfs_folder = "processed_pdfs" pdfs_folder = UPLOAD_FOLDER @@ -72,9 +95,15 @@ def rename_pdfs(UPLOAD_FOLDER, include_year): def main(pdf_folder, include_year, n): + + # Make the "processed_pdfs" folder if it doesn't already exist + if not os.path.exists("processed_pdfs"): + os.mkdir("processed_pdfs") + process_pdfs(pdf_folder, include_year, n) #Process the pdfs using Grobid rename_pdfs(pdf_folder, include_year) #Rename by extracting the title and date from xml files + if __name__ == "__main__": parser = argparse.ArgumentParser()