Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
.DS_Store
*~
*#
*.pem
deploy/janrain.key
deploy/GITHUB_CLIENT_SECRET
Expand All @@ -10,3 +11,10 @@ deploy/setup/CONFIG
*.out
*.jar
*.tmp
asterales/ott
asterales/subset
asterales/synth/Source_nexsons
asterales/taxomachine.db
asterales/treemachine.db
trees_report/shards
trees_report/work
153 changes: 153 additions & 0 deletions files-server/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Setting up and maintaining the 'files server' on Amazon S3

There is an assumption on this site that files (other than
`index.html` and other 'glue' files) are not updated in place. When a
new file or directory is added it gets a new name that includes a
version number or date.

## Access to S3

Make sure you have an S3 account with rights to the
files.opentreeoflife.org bucket, and set up the `~/.aws/credentials`
and `~/.aws/config` files per instructions at Amazon.

Install the 'aws' command per instructions. The following is what I
used on a Macbook:

```
pip install --upgrade --user awscli
export PYTHON_BIN=~/Library/Python/2.7/bin
```

## Set up http://files.opentreeoflife.org web hosting at S3

This involves:
1. configuring the bucket for hosting
1. setting up name service for the files.opentreeoflife.org subdomain
with Amazon
1. establishing an A record for the subdomain using Amazon's Route 53
'alias' feature
1. updating host records at Namecheap.
The instructions on the AWS site are pretty good.
Don't add an A record at Namecheap; only add a set of NS records as
directed. Note that we are _not_ doing a domain (registrar) transfer, only
subdomain name server redirection. The instructions do not cover this
case, but if you have an understanding of DNS, it's not hard.

## Keeping track of metadata

Copy files to S3 using the `aws s3 cp` and `aws s3 sync` commands.
An unfortunate aspect of these commands is that
they do not preserve file modification date/time metadata. It is
useful to get the file write dates correct in each directory's
index.html. Therefore a script is provided for maintaining this
metadata.

The write dates are stored in a file `.write_dates.json` in each
directory and is updated using the `capture_write_dates.py` script in
either of two ways:

* `.write_dates.json` can be generated afresh from a local mirror of
a directory on the files.opentreeoflife.org web site, when it is
absent or when the `--refresh` flag is given.

* An existing `.write_dates.json` can be updated from a local mirror
directory, even if the mirror does not contain the entire contents
of the directory. Metadata for unmirrored files will be carried
forward, and metadata for local files will be added or updated as
necesary.

If the file's size does not change, then its write date in
`.write_dates.json` is left unmodified.

To generate all of the `.write_dates.json` files locally, based on a
local mirror or previous instantiation of the site (e.g. the one on
varela.csail.mit.edu):

```
cd files.opentreeoflife.org # local mirror
find . -type d -exec python {path-to-here}/capture_write_dates.py {} \;
```

where `{path-to-here}` is the path to the directory containing this
README.

## Creating index.html files

S3 does not automatically generate directory indexes, so we have to do
this explicitly. One solution is
[here](https://github.com/rufuspollock/s3-bucket-listing) but the
difficulty is that the wrong file modification dates will be
shown - the script shows the dates stored on S3, but we want the
actual dates from the origin of the file.

The `prepare_index.py` script creates an `index.html` file using the
data in the `.write_dates.json` (see above).

To generate all of the index.html files:

```
cd files.opentreeoflife.org # local mirror
find . -type d -exec python {path-to-here}/prepare_index.py {} \;
```

If an index.html already exists that was not generated by the script,
then it is not overwritten.

## Populating S3

To copy all local files (mirror or varela.csail.mit.edu) to S3:

```
cd files.opentreeoflife.org # local mirror
$PYTHON_BIN/aws s3 sync . s3://files.opentreeoflife.org/
```

`sync` documentation is [here](http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html).

It can take a long time to copy 11G.
If interrupted, the `sync` can be tried again, and files that have aleady
been copied are not copied redundantly.

## Maintenance

When there are new files to copy to S3 (e.g. a taxonomy or synthesis
release):

1. Place the new file(s) correctly in a local mirror of the site.
(This is not strictly necessary, but it makes things easier.)
1. If the `index.html` files are to be autogenerated, then
1. Create `.write_dates.json` files for any new directories
1. Ensure that `.write_dates.json` is up to date (using `aws s3
cp` if necessary), for directories that will be changing
1. Update `.write_dates.json` to account for the changing directories
1. Run `prepare_index.py` for new and changing directories
1. Copy the new file(s), as well as the updated `index.html` and
`.write_dates.json` files, to S3 using `aws s3 cp` or `aws s3 sync`.

For deletion: Do it manually using `aws s3 rm` or `aws s3 rm ...
--recursive` or `aws s3 sync ... --delete`, and edit or
regenerate the `.write_dates.json` and/or `index.html` files.

Don't worry too much about the `"timestamp"` field in `.write_dates.json`.
It's not currently used for anything, and can be omitted.
The write data is stored in a separate field.

## Redirections

Documentation [here](http://docs.aws.amazon.com/AmazonS3/latest/dev/how-to-page-redirect.html)

Make OTT version 3.0 be the current version of OTT:

```
echo foo >dummy.tmp
$PYTHON_BIN/aws s3 cp dummy.tmp s3://files.opentreeoflife.org/ott/current --website-redirect /ott/ott3.0
```

Similarly for synthesis 9.0:

```
echo foo >dummy.tmp
$PYTHON_BIN/aws s3 cp dummy.tmp s3://files.opentreeoflife.org/synthesis/current --website-redirect /synthesis/synthesis9.0
```

62 changes: 62 additions & 0 deletions files-server/capture_write_dates.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Get write dates for all files in a directory D, and put them
# in a JSON file as D/.write_dates.json

import sys, os, datetime, json, argparse

dotfile = '.write_dates.json'

def capture_write_dates(dirname, refreshp):

if '.git' in dirname:
sys.exit(0)

oldmeta = {}
if not refreshp:
if os.path.exists(dotfile):
with open(dotfile, 'r') as infile:
oldmeta = json.load(infile)

newmeta = {}
for f in os.listdir(dirname):

if f.startswith('.') or f.endswith('~'):
continue

path = os.path.join(dirname, f)
s = os.stat(path)

if f in oldmeta:
have = oldmeta[f]
if have['size'] == s.st_size:
newmeta[f] = have
continue
else:
print 'File changed: %s' % f
else:
print 'New file: %s' % f

t = s.st_mtime
d = datetime.datetime.utcfromtimestamp(t)
ymd = '%04d-%02d-%02d' % d.isocalendar()
newmeta[f] = {'timestamp': t, 'date': ymd, 'size': s.st_size,
'directory': os.path.isdir(path)}

for f in oldmeta:
if not f in newmeta:
print 'Carrying over metadata for: %s' % f
newmeta[f] = oldmeta[f]

# Write out new metadata
path = os.path.join(dirname, dotfile)

with open(path, 'w') as outfile:
print 'Writing %s' % path
json.dump(newmeta, outfile, indent=1)

if __name__ == '__main__':
argp = argparse.ArgumentParser()
argp.add_argument('--refresh', dest='refresh', action='store_true')
argp.add_argument('dirname')
args = argp.parse_args()
capture_write_dates(args.dirname, args.refresh)

18 changes: 18 additions & 0 deletions files-server/migrate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/sh

set -e

path=$1

# The following assumes you're on a Mac and have done
# pip install --upgrade --user awscli

[ x$PYTHON_BIN = x ] && \
PYTHON_BIN=~/Library/Python/2.7/bin

[ ! -e $path ] && (echo Cannot find $path; exit 1)

echo "Copying $path to s3://files.opentreeoflife.org/$path - no, dry run"

# $PYTHON_BIN/aws s3 cp $path s3://files.opentreeoflife.org/$path

94 changes: 94 additions & 0 deletions files-server/prepare_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# python prepare_index.py <directorypath>

# Get write dates for all files in a directory D, and put them
# in a JSON file as D/.write_dates.json

import sys, os, datetime, json

def main(dirname):

if '.git' in dirname:
sys.exit(0)

json_path = os.path.join(dirname, '.write_dates.json')

if not os.path.exists(json_path):
print >>sys.stderr, 'Metadata file %s not found' % json_path
sys.exit(1)

index_path = os.path.join(dirname, 'index.html')

if os.path.exists(index_path) and not autogenerated(index_path):
print >>sys.stderr, 'Index file %s already exists' % index_path
sys.exit(1)

with open(json_path, 'r') as infile:
dates = json.load(infile)

longest = 0

for f in dates:
longest = max(longest, len(f))

# Look for excess files (relative to metadata blob)
for g in os.listdir(dirname):
if keep(g) and not g in dates:
print >>sys.stderr, \
'Warning: File %s not present in .write_dates.json' % os.path.join(dirname, g)

with open(index_path, 'w') as outfile:
print 'Writing %s' % index_path
outfile.write('<!-- AUTOGENERATED by prepare_index.py -->\n')
outfile.write('<pre>\n')
components = dirname.split('/')
dots = ['..' for component in components[1:]]
for component in components:
if len(dots) > 0:
outfile.write('<a href="{}">{}</a> > '.format('/'.join(dots), component))
dots = dots[1:]
else:
outfile.write(component)
outfile.write('\n\n')
for f in sorted(dates.keys()):
if not keep(f): continue
f_path = os.path.join(dirname, f)
# Check for missing files (relative to metadata blob)
if not os.path.exists(f_path):
print >>sys.stderr, 'Warning: File %s does not exist any more' % f_path
blob = dates[f]
if blob['directory']:
size = ''
f = f + '/'
else:
size = '{:>11,}'.format(blob['size'])
outfile.write('<a href="{}">{}</a>{} {} {}\n' \
.format(f, f, (' ' * (longest-len(f))), blob['date'], size))
outfile.write('</pre>\n')

if not maybe_append(os.path.join(dirname, 'README.html'), outfile):
maybe_append(os.path.join(dirname, 'README'), outfile)

def keep(name):
if name.startswith('.') and name != '..': return False
if name.endswith('~'): return False
if name == 'index.html': return False
if name == 'README.md': return False
return True

def maybe_append(path, outfile):
if os.path.exists(path):
pre = not path.endswith('.html')
if pre: outfile.write('<pre>\n')
with open(path, 'r') as infile:
for line in infile:
outfile.write(line)
if pre: outfile.write('</pre>\n')
return True
else:
return False

def autogenerated(path):
with open(path, 'r') as infile:
return 'AUTOGENERATED' in infile.readline()

main(sys.argv[1])