Skip to content

Commit

Permalink
Update Readme
Browse files Browse the repository at this point in the history
  • Loading branch information
ishaqibrahimbot committed Jul 1, 2021
1 parent 57ee7d5 commit d8e57e4
Show file tree
Hide file tree
Showing 4 changed files with 101 additions and 13 deletions.
75 changes: 65 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,78 @@

## Why would you want to rename scientific papers?

Well, if you're like me and you spend a lot of time going through research papers, you will have found it quite annoying to figure out which pdf is for which paper once you've
downloaded them all to a single folder.
Well, if you're like me and you've had to download a lot of research papers for literature reviews, you will have found it quite annoying to figure out which PDF is for which paper once you've downloaded them all to a single folder.

You'll have some named as "231514232" while others will have more sophisticated naming but in the end, this is all gibberish to you.

So, I decided to make a web app where you can upload all the pdfs you've downloaded, and get them back in a zipped file with each renamed with the title and year of publication.
So, I decided to make a tool that you can use to rename each PDF with the title and year of publication so that it becomes a lot easier to organize and manage your research papers.

I will also make a simple script to use this application directly from the terminal (in the future).
I originally set out to make a Flask web app using Python where a user can upload all the papers and get them back (renamed) inside a zip file. However, this proved to be too cumbersome and slowed down the processing by a great deal. In the end, I settled on a simple script that can be run from the command line and gives excellent processing speed.

## How to use
## Requirements

### Requirements
This project only works on Linux-based operating systems or MacOS, for two reasons: (1) The service used for processing PDFs, Grobid, is not supported on Windows, and (2) Docker is not supported well on Windows either. So if you're a Windows user, I'm sorry :(

You will need the Docker Engine (with Docker Compose) to easily set up and run this application. If you run a Linux-based OS or MacOS, you will find using this application quite easy.
If you use Linux or MacOS, proceed with the instructions given below. You'll need the Docker Engine installed since this project uses a Docker image from the Docker Hub.

However, if you're using a version of Windows that does not support Docker, you won't be able to use it :(
## Usage Instructions

### Grab the Grobid image from Docker

Make sure docker is installed in your system. If so, run the following command to pull the Grobid image:

```
sudo docker pull lfoppiano/grobid:0.6.2
```

### Clone this repo and go inside the directory

```
git clone https://github.com/ishaqibrahimbot/rename-scientific-publications.git
cd rename-scientific-publications
```

### Install the grobid client

```
git clone https://github.com/kermitt2/grobid_client_python.git
cd grobid_client_python
sudo python setup.py install
cd ..
```

All these commands are doing is cloning the repo for the grobid client, going inside its folder, installing the dependencies, and getting outside the folder.

### Start the Grobid server

```
./start_grobid_server.sh
```

### Run the script

Finally, make a directory named "pdfs" inside the project's root folder and paste all of your PDF files inside it.

Now go back to your terminal (open a new tab since the first one will now be running the Grobid server) and run the following command:

```
python rename_pdf.py
```
The project will start processing your files. If you check the terminal window where the Grobid server is running, you will see the requests being sent and processed sequentially.

Run the above command with a -h argument to see all possible arguments. If you want your new PDF names to include the year of publication as well, run the following command:

```
python rename_pdf.py --include_year=True
```

Once the processing is finished (took me about 1-2 mins to process 10 PDFs with a total size of 9.8MB), you can go back to the "pdfs" folder and find your PDF files, renamed.

For queries, suggestions, and issues (especially if you find glaring problems in the code), contact: ishaqibrahimbss@gmail.com

## In case you want to check out the Flask app

Although I don't recommend that you use the Flask app (it works fine but is just too slow compared to the command line method), you can still do so by following these instructions.

### Clone the repo

Expand All @@ -35,7 +91,7 @@ Go into the cloned directory using your terminal and after that, run the followi
sudo docker-compose up
```

You will notice log messages from two apps: the Flask app and the Grobid server (that this application uses for processing of pdf content).
You will notice log messages from two apps: the Flask app and the Grobid server (that this application uses for processing of PDF content).

### Open the Flask URL

Expand All @@ -45,4 +101,3 @@ That's it! Now you can upload as many PDFs as you want and get a zipped file in

Note: The processing might take some time, so please be patient and carry on with your other work. The browser will prompt you automatically to download the zip file once the processing is completed.

For queries, suggestions, and issues, contact: ishaqibrahimbss@gmail.com
Binary file modified __pycache__/getYear.cpython-38.pyc
Binary file not shown.
4 changes: 4 additions & 0 deletions getYear.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
"""
A simple function to extract a YYYY format date from the date string extracted by Grobid
"""

def get_year(date):
date = date.split()

Expand Down
35 changes: 32 additions & 3 deletions rename_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@
import argparse

def process_pdfs(UPLOAD_FOLDER, include_year, concurrency):
"""
Use grobid_client_python to send pdf files to the Grobid Server and save the *.tei.xml files
returned inside the "processed_pdfs" folder.
Two modes:
- include_year=True -> adds the year of publication to the start of the name of each pdf
- include_year=False -> Does not do the above
"""

client = GrobidClient(config_path="./grobid_client_python/config.json")

if include_year:
Expand All @@ -16,6 +26,10 @@ def process_pdfs(UPLOAD_FOLDER, include_year, concurrency):


def get_xml_files(pdf_files):
"""
Get the list of corresponding .tei.xml files for each pdf file, once Grobid has completed its
processing.
"""
xml_files = []

for file in pdf_files:
Expand All @@ -28,14 +42,19 @@ def get_xml_files(pdf_files):


def get_title(soup, pdfs_folder, include_year):
"""
Get a standardized title from the object returned by BeautifulSoup
"""

title = soup.title.getText()
title = re.sub("[^a-zA-Z ]+", " ", title)
title = re.sub("[^a-zA-Z ]+", " ", title) #Remove any characters other than letters

if len(title) > 40:
title = title[:40]
title = title[:40] #Just take the first 40 characters of the title for the new name of the pdf

if include_year:

# Depending on the value of include_year, either add the YoP to the name or not
if include_year:
date = get_year(soup.date.getText())
if date is not None:
new_title = os.path.join(pdfs_folder, "(" + date + ")" + " " + title + ".pdf")
Expand All @@ -48,6 +67,10 @@ def get_title(soup, pdfs_folder, include_year):


def rename_pdfs(UPLOAD_FOLDER, include_year):
"""
Grab all the .tei.xml files, run them through BeautifulSoup to get the new file name, and rename
each pdf file.
"""

processed_pdfs_folder = "processed_pdfs"
pdfs_folder = UPLOAD_FOLDER
Expand All @@ -72,9 +95,15 @@ def rename_pdfs(UPLOAD_FOLDER, include_year):


def main(pdf_folder, include_year, n):

# Make the "processed_pdfs" folder if it doesn't already exist
if not os.path.exists("processed_pdfs"):
os.mkdir("processed_pdfs")

process_pdfs(pdf_folder, include_year, n) #Process the pdfs using Grobid
rename_pdfs(pdf_folder, include_year) #Rename by extracting the title and date from xml files


if __name__ == "__main__":

parser = argparse.ArgumentParser()
Expand Down

0 comments on commit d8e57e4

Please sign in to comment.