Skip to content

Latest commit

 

History

History
202 lines (146 loc) · 6.72 KB

DEPLOYMENT.md

File metadata and controls

202 lines (146 loc) · 6.72 KB

Jacow Validator's Deployment

April / May 2019

Where

The project is deployed on an Amazon micro ec2 instance running RHEL

currently on the lowest tier

Getting access

Provide your ssh key to the project developers who can then insert it into the authorized_keys file on the server when ssh'd in themselves

Updating the ec2 instance

To update the deployed instance to reflect the latest developments in the master branch:

  1. ssh into the ec2 instance

    ssh ec2-user@ec2-54-187-195-5.us-west-2.compute.amazonaws.com

  2. cd jacow-validator

  3. source gitpull.sh

This is a manual process, as given the project had a finite lifetime with an obvious end date and subsequent wrap-up, CI/CD pipelines were never established.

gitpull.sh

#sudo iptables -A PREROUTING -t nat -p tcp --dport 80 -j REDIRECT --to-ports 8080
# run this script to update this deployed instance with the latest changes from the github repo by running `source gitpull.sh`
sudo systemctl stop jacow
git pull
pipenv --rm
pipenv install --skip-lock
sudo systemctl start jacow

The commented out iptables command is a remnant from the beginnings when the server was running with the flask run command and the flask server was listening on port 8080 but users would be trying to access it through port 80.

Managing the gunicorn server

The gunicorn server is managed by systemd, the systemctl command is how you can interface with it.

stopping the server

sudo systemctl stop jacow

starting the server

sudo systemctl start jacow

viewing the logs

journalctl

or just the last hours worth:

journalctl --since "1 hour ago"

or constantly follow it and see the latest updates live:

journalctl -f

How gunicorn is set up with systemd

A file was created in /etc/systemd/system/jacow.service with the following content:

[Unit]
Description=The Jacow web server for verifying research papers
After=network.target
Wants=clean-jacow-docs.service

[Service]
PIDFile=/run/gunicorn/pid
User=ec2-user
WorkingDirectory=/home/ec2-user/jacow-validator
ExecStart=/usr/local/bin/pipenv run gunicorn -w 3 --pid /run/gunicorn/pid --timeout 180 wsgi:app -b 0.0.0.0:8080
Restart=always

[Install]
WantedBy=multi-user.target

And then its mode adjusted to rwxr-xr-x as suggested by online tutorials

A directory was also created (/run/gunicorn) with similar open permissions to ensure services would have no trouble accessing it.

That file declares that the service should be restarted if it ever goes down so now systemd will ensure that this happens automatically for us.

It also declares what the server run command is, the gunicorn command.

Cleaning up after a crash

In the above, the line Wants=clean-jacow-docs.service declares that the service 'clean-jacow-docs' should be run whenever it is started.

There was an issue where if the server crashed (or was killed for using too much memory) any documents it was currently working with would not be cleared from the folder they get stored in when uploaded by a user. Subsequent attempts to upload the same document would then save it with an appended _1 so as to not overwrite the existing one. The below serves to address the buildup of docx files.

clean-jacow-docs.service in /etc/systemd/system/

[Unit]
Description=Service to run script before starting jacow service to clear out remnant docx files after a previous jacow crash

[Service]
User=ec2-user
WorkingDirectory=/home/ec2-user
ExecStart=/bin/bash clean-jacow-docs.sh

[Install]
WantedBy=jacow.service

clean-jacow-docs.sh in /home/ec2-user

#!/bin/bash
#
# this script removes docx files from the jacow servers upload directory after
# logging them to journalctl. This script is set to run by an associated systemd
# service whenever the jacow.service is started up. This ensures that if the 
# jacow service is killed that the document folder is cleared out before it is 
# restarted, which is required to ensure users don't run into the issue that 
# their subsequent attempts to upload the same filename document don't get
# automatically renamed to have _1 appended to the filename and then subsequently
# fails the spms check.
#
#
cd /var/tmp/document
for filename in *.docx; do
  [ -e "$filename" ] || continue
  echo $filename removed from cache before restarting service | systemd-cat -p err -t jacow
  rm $filename
done

a note on the chosen gunicorn settings

The --timeout was set to 180 seconds (3mins) to ensure that if an upload of a document took longer than the default 30 seconds that the gunicorn worker would not be killed and replaced.

The number of workers -w 3 was set after stumbling into memory usage issues when attempting to run 20 workers and subsequently finding a suggestion that for gunicorn workers a rule of thumb is 1 worker + 2 workers for every cpu, and we only have one virtual cpu.

exerpt from the journalctl logs of the last time memory issues happened before the change that dropped the number of workers down to 3:

May 10 15:38:53 ip-172-31-20-79.us-west-2.compute.internal kernel: Out of memory: Kill process 8470 (gunicorn) score 63 or sacrifice child
May 10 15:38:53 ip-172-31-20-79.us-west-2.compute.internal kernel: Killed process 8470 (gunicorn) total-vm:340852kB, anon-rss:63568kB, file-rss:296kB, shmem-rss:0kB

Hopefully this will be sufficient given that the documentation states:

Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.

spms csv file

A cron job was set up to run every hour to automatically download the csv file from the official jacow website to keep locally for use by this jacow tool in comparing crucial information between uploaded documents and the csv file.

Issues encountered

The ec2 instance of RHEL that we have running appears to have an issue with its yum package manager, in that yum has issues contacting its repositories. This is unresolved at time of writing though research revealed a couple of possible solutions.

The python3 that comes with the ec2 instance doesn't appear to have a sqlite module installed that should come with python3, as a result we didn't have the ability to use an sqlite database and would get errors to the effect of no module called _sqlite3