OpenTreeOfLife · jar398 · Mar 12, 2017 · Mar 12, 2017 · Mar 13, 2017 · Mar 14, 2017
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,6 @@
 .DS_Store
 *~
+*#
 *.pem
 deploy/janrain.key
 deploy/GITHUB_CLIENT_SECRET
@@ -10,3 +11,10 @@ deploy/setup/CONFIG
 *.out
 *.jar
 *.tmp
+asterales/ott
+asterales/subset
+asterales/synth/Source_nexsons
+asterales/taxomachine.db
+asterales/treemachine.db
+trees_report/shards
+trees_report/work
diff --git a/files-server/README.md b/files-server/README.md
@@ -0,0 +1,153 @@
+# Setting up and maintaining the 'files server' on Amazon S3
+
+There is an assumption on this site that files (other than
+`index.html` and other 'glue' files) are not updated in place.  When a
+new file or directory is added it gets a new name that includes a
+version number or date.
+
+## Access to S3
+
+Make sure you have an S3 account with rights to the
+files.opentreeoflife.org bucket, and set up the `~/.aws/credentials`
+and `~/.aws/config` files per instructions at Amazon.
+
+Install the 'aws' command per instructions.  The following is what I
+used on a Macbook:
+
+```
+pip install --upgrade --user awscli  
+export PYTHON_BIN=~/Library/Python/2.7/bin
+```
+
+## Set up http://files.opentreeoflife.org web hosting at S3
+
+This involves:
+ 1. configuring the bucket for hosting
+ 1. setting up name service for the files.opentreeoflife.org subdomain
+    with Amazon
+ 1. establishing an A record for the subdomain using Amazon's Route 53
+    'alias' feature
+ 1. updating host records at Namecheap.
+The instructions on the AWS site are pretty good.
+Don't add an A record at Namecheap; only add a set of NS records as
+directed.  Note that we are _not_ doing a domain (registrar) transfer, only
+subdomain name server redirection.  The instructions do not cover this
+case, but if you have an understanding of DNS, it's not hard.
+
+## Keeping track of metadata
+
+Copy files to S3 using the `aws s3 cp` and `aws s3 sync` commands.
+An unfortunate aspect of these commands is that
+they do not preserve file modification date/time metadata.  It is
+useful to get the file write dates correct in each directory's
+index.html.  Therefore a script is provided for maintaining this
+metadata.
+
+The write dates are stored in a file `.write_dates.json` in each
+directory and is updated using the `capture_write_dates.py` script in
+either of two ways:
+
+ * `.write_dates.json` can be generated afresh from a local mirror of
+   a directory on the files.opentreeoflife.org web site, when it is
+   absent or when the `--refresh` flag is given.
+
+ * An existing `.write_dates.json` can be updated from a local mirror
+   directory, even if the mirror does not contain the entire contents
+   of the directory.  Metadata for unmirrored files will be carried
+   forward, and metadata for local files will be added or updated as
+   necesary.
+
+If the file's size does not change, then its write date in
+`.write_dates.json` is left unmodified.
+
+To generate all of the `.write_dates.json` files locally, based on a
+local mirror or previous instantiation of the site (e.g. the one on
+varela.csail.mit.edu):
+
+```
+cd files.opentreeoflife.org     # local mirror  
+find . -type d -exec python {path-to-here}/capture_write_dates.py {} \;
+```
+
+where `{path-to-here}` is the path to the directory containing this
+README.
+
+## Creating index.html files
+
+S3 does not automatically generate directory indexes, so we have to do
+this explicitly.  One solution is
+[here](https://github.com/rufuspollock/s3-bucket-listing) but the
+difficulty is that the wrong file modification dates will be
+shown - the script shows the dates stored on S3, but we want the
+actual dates from the origin of the file.
+
+The `prepare_index.py` script creates an `index.html` file using the
+data in the `.write_dates.json` (see above).
+
+To generate all of the index.html files:
+
+```
+cd files.opentreeoflife.org     # local mirror  
+find . -type d -exec python {path-to-here}/prepare_index.py {} \;
+```
+
+If an index.html already exists that was not generated by the script,
+then it is not overwritten.
+
+## Populating S3
+
+To copy all local files (mirror or varela.csail.mit.edu) to S3:
+
+```
+cd files.opentreeoflife.org    # local mirror  
+$PYTHON_BIN/aws s3 sync . s3://files.opentreeoflife.org/
+```
+
+`sync` documentation is [here](http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html).
+
+It can take a long time to copy 11G.
+If interrupted, the `sync` can be tried again, and files that have aleady
+been copied are not copied redundantly.
+
+## Maintenance
+
+When there are new files to copy to S3 (e.g. a taxonomy or synthesis
+release):
+
+1. Place the new file(s) correctly in a local mirror of the site.
+   (This is not strictly necessary, but it makes things easier.)
+1. If the `index.html` files are to be autogenerated, then 
+    1. Create `.write_dates.json` files for any new directories
+    1. Ensure that `.write_dates.json` is up to date (using `aws s3
+       cp` if necessary), for directories that will be changing
+    1. Update `.write_dates.json` to account for the changing directories
+    1. Run `prepare_index.py` for new and changing directories
+1. Copy the new file(s), as well as the updated `index.html` and
+  `.write_dates.json` files, to S3 using `aws s3 cp` or `aws s3 sync`.
+
+For deletion: Do it manually using `aws s3 rm` or `aws s3 rm ...
+--recursive` or `aws s3 sync ... --delete`, and edit or
+regenerate the `.write_dates.json` and/or `index.html` files.
+
+Don't worry too much about the `"timestamp"` field in `.write_dates.json`.
+It's not currently used for anything, and can be omitted.
+The write data is stored in a separate field.
+
+## Redirections
+
+Documentation [here](http://docs.aws.amazon.com/AmazonS3/latest/dev/how-to-page-redirect.html)
+
+Make OTT version 3.0 be the current version of OTT:
+
+```
+echo foo >dummy.tmp  
+$PYTHON_BIN/aws s3 cp dummy.tmp s3://files.opentreeoflife.org/ott/current --website-redirect /ott/ott3.0
+```
+
+Similarly for synthesis 9.0:
+
+```
+echo foo >dummy.tmp  
+$PYTHON_BIN/aws s3 cp dummy.tmp s3://files.opentreeoflife.org/synthesis/current --website-redirect /synthesis/synthesis9.0
+```
+
diff --git a/files-server/capture_write_dates.py b/files-server/capture_write_dates.py
@@ -0,0 +1,62 @@
+# Get write dates for all files in a directory D, and put them
+# in a JSON file as D/.write_dates.json
+
+import sys, os, datetime, json, argparse
+
+dotfile = '.write_dates.json'
+
+def capture_write_dates(dirname, refreshp):
+
+    if '.git' in dirname:
+        sys.exit(0)
+
+    oldmeta = {}
+    if not refreshp:
+        if os.path.exists(dotfile):
+            with open(dotfile, 'r') as infile:
+                oldmeta = json.load(infile)
+
+    newmeta = {}
+    for f in os.listdir(dirname):
+
+        if f.startswith('.') or f.endswith('~'):
+            continue
+
+        path = os.path.join(dirname, f)
+        s = os.stat(path)
+
+        if f in oldmeta:
+            have = oldmeta[f]
+            if have['size'] == s.st_size:
+                newmeta[f] = have
+                continue
+            else:
+                print 'File changed: %s' % f
+        else:
+            print 'New file: %s' % f
+
+        t = s.st_mtime
+        d = datetime.datetime.utcfromtimestamp(t)
+        ymd = '%04d-%02d-%02d' % d.isocalendar()
+        newmeta[f] = {'timestamp': t, 'date': ymd, 'size': s.st_size,
+                    'directory': os.path.isdir(path)}
+
+    for f in oldmeta:
+        if not f in newmeta:
+            print 'Carrying over metadata for: %s' % f
+            newmeta[f] = oldmeta[f]
+
+    # Write out new metadata
+    path = os.path.join(dirname, dotfile)
+
+    with open(path, 'w') as outfile:
+        print 'Writing %s' % path
+        json.dump(newmeta, outfile, indent=1)
+
+if __name__ == '__main__':
+    argp = argparse.ArgumentParser()
+    argp.add_argument('--refresh', dest='refresh', action='store_true')
+    argp.add_argument('dirname')
+    args = argp.parse_args()
+    capture_write_dates(args.dirname, args.refresh)
+
diff --git a/files-server/migrate.sh b/files-server/migrate.sh
@@ -0,0 +1,18 @@
+#!/bin/sh
+
+set -e
+
+path=$1
+
+# The following assumes you're on a Mac and have done
+#   pip install --upgrade --user awscli
+
+[ x$PYTHON_BIN = x ] && \
+    PYTHON_BIN=~/Library/Python/2.7/bin
+
+[ ! -e $path ] && (echo Cannot find $path; exit 1)
+
+echo "Copying $path to s3://files.opentreeoflife.org/$path - no, dry run"
+
+# $PYTHON_BIN/aws s3 cp $path s3://files.opentreeoflife.org/$path
+
diff --git a/files-server/prepare_index.py b/files-server/prepare_index.py
@@ -0,0 +1,94 @@
+# python prepare_index.py <directorypath>
+
+# Get write dates for all files in a directory D, and put them
+# in a JSON file as D/.write_dates.json
+
+import sys, os, datetime, json
+
+def main(dirname):
+
+    if '.git' in dirname:
+        sys.exit(0)
+
+    json_path = os.path.join(dirname, '.write_dates.json')
+
+    if not os.path.exists(json_path):
+        print >>sys.stderr, 'Metadata file %s not found' % json_path
+        sys.exit(1)
+
+    index_path = os.path.join(dirname, 'index.html')
+
+    if os.path.exists(index_path) and not autogenerated(index_path):
+        print >>sys.stderr, 'Index file %s already exists' % index_path
+        sys.exit(1)
+
+    with open(json_path, 'r') as infile:
+        dates = json.load(infile)
+
+    longest = 0
+
+    for f in dates:
+        longest = max(longest, len(f))
+
+    # Look for excess files (relative to metadata blob)
+    for g in os.listdir(dirname):
+        if keep(g) and not g in dates:
+            print >>sys.stderr, \
+                'Warning: File %s not present in .write_dates.json' % os.path.join(dirname, g)
+
+    with open(index_path, 'w') as outfile:
+        print 'Writing %s' % index_path
+        outfile.write('<!-- AUTOGENERATED by prepare_index.py -->\n')
+        outfile.write('<pre>\n')
+        components = dirname.split('/')
+        dots = ['..' for component in components[1:]]
+        for component in components:
+            if len(dots) > 0:
+                outfile.write('<a href="{}">{}</a> > '.format('/'.join(dots), component))
+                dots = dots[1:]
+            else:
+                outfile.write(component)
+        outfile.write('\n\n')
+        for f in sorted(dates.keys()):
+            if not keep(f): continue
+            f_path = os.path.join(dirname, f)
+            # Check for missing files (relative to metadata blob)
+            if not os.path.exists(f_path):
+                print >>sys.stderr, 'Warning: File %s does not exist any more' % f_path
+            blob = dates[f]
+            if blob['directory']:
+                size = ''
+                f = f + '/'
+            else:
+                size = '{:>11,}'.format(blob['size'])
+            outfile.write('<a href="{}">{}</a>{} {} {}\n' \
+                          .format(f, f, (' ' * (longest-len(f))), blob['date'], size))
+        outfile.write('</pre>\n')
+
+        if not maybe_append(os.path.join(dirname, 'README.html'), outfile):
+            maybe_append(os.path.join(dirname, 'README'), outfile)
+
+def keep(name):
+    if name.startswith('.') and name != '..': return False
+    if name.endswith('~'): return False
+    if name == 'index.html': return False
+    if name == 'README.md': return False
+    return True
+
+def maybe_append(path, outfile):
+    if os.path.exists(path):
+        pre = not path.endswith('.html')
+        if pre: outfile.write('<pre>\n')
+        with open(path, 'r') as infile:
+            for line in infile:
+                outfile.write(line)
+        if pre: outfile.write('</pre>\n')
+        return True
+    else:
+        return False
+
+def autogenerated(path):
+    with open(path, 'r') as infile:
+        return 'AUTOGENERATED' in infile.readline()
+
+main(sys.argv[1])