Merge pull request #63 from BinPro/develop

Version 0.2.2 released.
BinPro · Apr 15, 2014 · 65376c0 · 65376c0
2 parents 4f4685e + f46e629
commit 65376c0
Show file tree

Hide file tree

Showing 6 changed files with 189 additions and 111 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -7,15 +7,17 @@ python:
 virtualenv:
   system_site_packages: true
 before_install:
+   #Uses miniconda installation of scientific python packages instead of building from source
+   #or using old versions supplied by apt-get. Source: https://gist.github.com/dan-blanchard/7045057
+   - if [ ${TRAVIS_PYTHON_VERSION:0:1} == "2" ]; then wget http://repo.continuum.io/miniconda/Miniconda-3.3.0-Linux-x86_64.sh -O miniconda.sh; else wget http://repo.continuum.io/miniconda/Miniconda3-3.3.0-Linux-x86_64.sh -O miniconda.sh; fi
+   - chmod +x miniconda.sh
+   - ./miniconda.sh -b
+   - export PATH=/home/travis/miniconda/bin:$PATH
+   - conda update --yes conda
    - sudo apt-get update -qq
-   - sudo apt-get install -qq cython libatlas-dev liblapack-dev gfortran python-numpy python-scipy python-biopython build-essential libgsl0-dev
+   - sudo apt-get install -qq build-essential libgsl0-dev
 install:
-#Test with packages from binary install, takes a long time to build numpy and scipy
-#  - pip install -q -U numpy --use-mirrors
-#  - pip install scipy
-#  - pip install -q biopython --use-mirrors 
-  - pip install pandas
-  - pip install scikit-learn
+  - conda install --yes python=$TRAVIS_PYTHON_VERSION cython numpy scipy biopython pandas pip scikit-learn
   - python setup.py install
 # command to run tests
 script: nosetests
@@ -24,3 +26,4 @@ branches:
   only:
     - master
     - travis_tryout
+    - develop
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-#CONCOCT 0.2.1 [![Build Status](https://travis-ci.org/BinPro/CONCOCT.png?branch=master)](https://travis-ci.org/BinPro/CONCOCT)#
+#CONCOCT 0.2.2 [![Build Status](https://travis-ci.org/BinPro/CONCOCT.png?branch=master)](https://travis-ci.org/BinPro/CONCOCT)#
 
 A program for unsupervised binning of metagenomic contigs by using nucleotide composition, 
 coverage data in multiple samples and linkage data from paired end reads.
@@ -13,49 +13,85 @@ Feel free to contact our mailing list concoct-support@lists.sourceforge.net for
 If you would like subscribe to concoct-support mailing list, you can do so [here](https://lists.sourceforge.net/lists/listinfo/concoct-support)
 
 ##Dependencies##
-
-In order to install concoct, it requires python version 2.7.* and the python package installer ```pip```. It also requires a c compiler, e.g. ```gcc``` and the GNU Scientific Library ```gsl```. For linux (ubuntu) this is installed through:
+###Fundamental dependencies###
 ```
-apt-get install build-essential gsl-bin
+python v2.7.*
+gcc
+gsl
 ```
 
-Before or during the installation of concoct, several other python packages will be downloaded and installed by pip.
-
-##Install##
-Install the package concoct in default python path, and adds script concoct to bin. You can use sudo if needed.
+These items are prerequisities for the installation of concoct as described below. The installation procedure varies on different systems, and described in this README is only how to proceed with a linux (ubuntu) distribution.
 
-###Using pip###
-Download the CONCOCT distribution from https://github.com/BinPro/CONCOCT/releases (stable) and extract the files, or clone the repository with github (potentially unstable)
+The first item, ```python v2.7.*```, should be installed on a modern Ubuntu distribution. A c-compiler, e.g. ```gcc```, is needed to compile the c parts of concoct that uses the GNU Scientific Library ```gsl```. For linux (ubuntu) this is installed through:
 ```
-git clone https://github.com/BinPro/CONCOCT.git
+apt-get install build-essential libgsl0-dev
 ```
-
-Resolve all dependencies, see above and then execute:
+###Python packages###
 ```
-cd CONCOCT
-pip install -r requirements.txt
-python setup.py install
+cython>=0.19.2
+numpy>=1.7.1
+scipy>=0.12.0
+pandas>=0.11.0
+biopython>=1.62b
+scikit-learn>=0.13.1
 ```
+These are the python packages that need to be installed in order to run concoct. If you follow the installation instructions below, these will be installed automatically, but are listed here for transparency. 
+
+###Optional dependencies###
+
+* To create the input table (containing average coverage per sample and contig)
+    * [BEDTools](https://github.com/arq5x/bedtools2/releases) version >= 2.15.0 (only genomeCoverageBed)
+    * [Picard](https://launchpad.net/ubuntu/+source/picard-tools/) tools version >= 1.77
+    * [samtools](http://samtools.sourceforge.net/) version >= 0.1.18
+    * [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) version >= 2.1.0
+    * [GNU parallel](http://www.gnu.org/software/parallel/) version >= 20130422
 
-###Using apt-get###
-Another way to get the dependencies (given Ubuntu / Debian, similar for other distros) is through ```apt-get```. However, for some packages, only deprecated versions are available. Make sure that the requirements for these packages are fulfilled:
+* For validation of clustering using single-copy core genes
+    * [PROKKA](http://www.vicbioinformatics.com/software.prokka.shtml)
+    * Python packages: ```bcbio-gff>=0.4```
+    * R packages: ```gplots, reshape, ggplot2, ellipse, getopt``` and ```grid```
 
-    biopython>=1.62b
-    numpy>=1.7.1
-    pandas>=0.11.0
-    scikit-learn>=0.13.1
-    scipy>=0.12.0
+##Installation#
+Here we describe two recommended ways of getting concoct to run on your computer/server. The first option, using Anaconda, should work for any *nix (e.g. Mac OS X or Linux) system even where you do not have 'sudo' rights (e.g. on a common computer cluster). The second option is suitable for a linux computer where you have root privileges and you prefer to use a virtual machine where all dependencies to run concoct are included.
 
-The actual commands for installing is then
+###Using Anaconda###
+This instruction shows how to install all dependencies (except the 'Fundamental dependencies' and the 'Optional dependencies' listed above) using an Anaconda environment. Anaconda is a tool to isolate your python installation, which allows you to have multiple parallel installations using different versions of different packages, and gives you a very convenient and fast way to install the most common scientific python packages. Anaconda is free but not open source, you can download Anaconda [here](https://store.continuum.io/cshop/anaconda/). Installation instructions can be found [here](http://docs.continuum.io/anaconda/install.html).
+
+After installing Anaconda, create a new environment that will contain the concoct installation:
+```
+conda create -n concoct_env python=2.7.6
+```
+After choosing to proceed, run the suggested command:
+```
+source activate concoct_env
+```
+then install the concoct dependencies into this environment:
+```
+conda install cython numpy scipy biopython pandas pip scikit-learn
+```
+Finally, download the CONCOCT distribution from https://github.com/BinPro/CONCOCT/releases (stable) and extract the files, or clone the repository with github (potentially unstable). Resolve all dependencies, see above and then execute within the CONCOCT directory:
 ```
-sudo apt-get install git python-setuptools python-biopython python-nose \
-                     python-numpy python-pandas python-scikits-learn python-scipy \
-                     build-essential gsl-bin
-git clone https://github.com/BinPro/CONCOCT.git
-cd CONCOCT
 python setup.py install
 ```
 
+###Using Docker###
+If you have root access where you want to install concoct and storage for roughly 1.2G "virtual machine" then Docker provides a very nice way to get a Docker image with concoct and its dependencies installed. This way the only thing you install on your host system is Docker, the rest is contained in an Docker image. This allows you to install and run programs in that image without it affecting your host system. You should get to know Docker here: https://www.docker.io/the_whole_story/
+You need to get Docker installed (see https://www.docker.io/gettingstarted/ and specially if you have Ubuntu http://docs.docker.io/en/latest/installation/ubuntulinux/). When Docker is installed you need to download and log into the concoct image which can be done in one command. We also want to map a folder from the host (/home/user/MyData) to a folder in the image (/opt/MyData). To get all this working we execute one command:
+```
+sudo docker run -v /home/user/MyData:/opt/MyData -i -t binnisb/concoct_0.2.2 bash
+```
+This downloads the image (about 1.2G) and logs you into a bash shell. To test concoct you can then do:
+```
+$ cd /opt/CONCOCT-0.2.2
+$ nosetests
+```
+Which should execute all tests without errors. Then to run concoct on your data (stored in /home/user/MyData on host) you can do:
+```
+$ cd /opt/MyData
+$ concoct --coverage_file coverage.csv --composition_file composition.fa -b output_folder/
+```
+
+
 ##Execute concoct##
 The script concoct takes two input files. The first file, the coverage
 file, contains a table where each row correspond to a contig, and each

diff --git a/concoct/input.py b/concoct/input.py
@@ -80,15 +80,18 @@ def load_coverage(cov_file, contig_lengths, no_cov_normalization, add_total_cove
     cov.ix[:,cov_range[0]:cov_range[1]] = cov.ix[:,cov_range[0]:cov_range[1]].add(
             (100/contig_lengths),
             axis='index')
-    if add_total_coverage:
-        cov['total_coverage'] = cov.ix[:,cov_range[0]:cov_range[1]].sum(axis=1)
-        temp_cov_range = (cov_range[0],'total_coverage')
 
     if not no_cov_normalization:
         #Normalize per sample first
         cov.ix[:,cov_range[0]:cov_range[1]] = \
             _normalize_per_sample(cov.ix[:,cov_range[0]:cov_range[1]])
 
+    # Total coverage should be calculated after per sample normalization
+    if add_total_coverage:
+        cov['total_coverage'] = cov.ix[:,cov_range[0]:cov_range[1]].sum(axis=1)
+        temp_cov_range = (cov_range[0],'total_coverage')
+
+    if not no_cov_normalization:
         # Normalize contigs next
         cov.ix[:,cov_range[0]:cov_range[1]] = \
             _normalize_per_contig(cov.ix[:,cov_range[0]:cov_range[1]])

diff --git a/doc/Dockerfile b/doc/Dockerfile
@@ -0,0 +1,39 @@
+# Docker for CONCOCT (http://github.com/BinPro/CONCOCT) v0.2.2
+# VERSION 0.2.2
+# 
+# This docker creates and sets up an Ubuntu environment with all
+# dependencies for CONCOCT v0.2.2 installed.
+#
+# To login to the docker with a shared directory from the host do:
+#
+# sudo docker run -v /my/host/shared/directory:/my/docker/location -i -t binnisb/concoct_0.2.2 /bin/bash
+#
+# This environment does not set up the assembler and preprocessing
+# for concoct. We will be creating another docker for that.
+
+FROM ubuntu:13.10
+MAINTAINER CONCOCT developer group, concoct-support@lists.sourceforge.net
+
+ENV PATH /opt/miniconda/bin:$PATH
+
+# Get basic ubuntu packages needed
+RUN apt-get update -qq;\
+    apt-get install -qq wget build-essential libgsl0-dev 
+
+# Set up Miniconda environment for python2
+RUN cd /opt;\
+    wget http://repo.continuum.io/miniconda/Miniconda-3.3.0-Linux-x86_64.sh -O miniconda.sh;\
+    chmod +x miniconda.sh;\
+    ./miniconda.sh -p /opt/miniconda -b
+
+# Install python dependencies and fetch and install CONCOCT 0.2.2
+RUN cd /opt;\
+    conda update --yes conda;\
+    conda install --yes python=2.7  atlas cython numpy scipy biopython pandas pip scikit-learn;\
+    wget --no-check-certificate https://github.com/BinPro/CONCOCT/archive/0.2.2.tar.gz;\
+    tar xf 0.2.2.tar.gz;\
+    cd CONCOCT-0.2.2;\
+    python setup.py install
+
+
+
diff --git a/scripts/PROKKA_COG.py b/scripts/PROKKA_COG.py
@@ -21,11 +21,12 @@
 #            You should have received a copy of the GNU General Public License
 #            along with this program.  If not, see <http://www.gnu.org/licenses/>.
 # **************************************************************/
-import sys, getopt, urllib
-from xml.dom import minidom
+import sys
 from BCBio import GFF
+import argparse
+from Bio import Entrez
 
-def get_record_from_cdd(query):
+def get_records_from_cdd(queries, email):
     # We need CDD accession to COG accession mapping. For this we will use NCBI eutils and parse the returned XML
     # file. For example,
     #
@@ -43,31 +44,16 @@ def get_record_from_cdd(query):
     #			<Item Name="LivePssmID" Type="Integer">0</Item>
     #		</DocSum>
     #	</eSummaryResult>
-
-    params = {
-        'db':'cdd',
-    }
-
-    params['id'] = query
-    # get citation info:
-    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?' + urllib.urlencode(params)
-    data = urllib.urlopen(url).read()
-    xmldoc = minidom.parseString(data)
-    items=xmldoc.getElementsByTagName("Item")	
-    r={}
-    for i in range(items.length):	
-	r[items[i].getAttribute('Name')]=items[i].firstChild.data
-    return r
+    Entrez.email = email # Always tell ncbi who you are.
+    search_result = Entrez.read(Entrez.epost("cdd", id=",".join(queries)))
+    records = Entrez.read(Entrez.efetch(db="cdd",
+            rettype='docsum',
+            webenv=search_result['WebEnv'],
+            query_key=search_result['QueryKey']))
+    return records
 
 def usage():
     print '\n'.join([
-	   'Usage:',
-           '\t./PROKKA_COG.py -g <gfffile> -b <blastoutfile>',
-           '',
-	   'Optional parameters:',
-	   '\t-s (--scovs-threshold)\t\tsubject coverage threshold (Default:60)',
-           '\t-p (--pident-threshold)\t\tpident threshold (Default:0)',
-	   '',
            'Example usage:',
 	   '',   			
            '\tStep 1: Run PROKKA_XXXXXXXX.faa with rpsblast against the  Cog database',
@@ -77,16 +63,19 @@ def usage():
            '\t\t\tsstart send length slen\" -out blast_output.out',
            '',
            '\tStep 2: Run this script to generate COG anotations:',
-           '\t\t\t./PROKKA_COG.py -g PROKKA_XXXXXXXX.gff -b blast_output.out', 
+           '\t\t\t./PROKKA_COG.py -g PROKKA_XXXXXXXX.gff -b blast_output.out -e mail@example.com',
            '\t\t\t > annotation.cog',
 	   '',	
-           'Refer to rpsblast tutorial: http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/rpsblast/'])
+           'Refer to rpsblast tutorial: http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/rpsblast/',
+           ''])
 
-def main(argv):
+def main(args):
+   blastoutfile = args.blastoutfile
+   gfffile = args.gfffile
+   RPSBLAST_SCOVS_THRESHOLD = args.scovs_threshold
+   RPSBLAST_PIDENT_THRESHOLD = args.pident_threshold
 
    # = Parameters to set ============== #
-   RPSBLAST_SCOVS_THRESHOLD=60.0
-   RPSBLAST_PIDENT_THRESHOLD=0.0	
    RPSBLAST_QSEQID_FIELD=0
    RPSBLAST_SSEQID_FIELD=1
    RPSBLAST_EVALUE_FIELD=2
@@ -101,52 +90,40 @@ def main(argv):
    # = /Parameters to set ============= #
 
 
-   gfffile = ''
-   blastoutfile=''
-
-   try:
-      opts, args = getopt.getopt(argv,"hg:b:s:p:",["gfffile=","blastoutfile=","scovs-threshold=","pident-threshold="])
-   except getopt.GetoptError:
-      usage()	
-      sys.exit(2)
-   for opt, arg in opts:
-      if opt == '-h':
-	 usage()
-         sys.exit()
-      elif opt in ("-g", "--gfffile"):
-         gfffile = arg
-      elif opt in ("-b", "--blastoutfile"):
-         blastoutfile = arg
-      elif opt in ("-s", "--scovs-threshold"):
-         RPSBLAST_SCOVS_THRESHOLD = float(arg)	
-      elif opt in ("-p", "--pident-threshold"):
-         RPSBLAST_PIDENT_THRESHOLD = float(arg)
-
-   if (gfffile =='' or blastoutfile == ''):
-	usage()	
-	sys.exit()
-
    featureid_locations={}
    limits=dict(gff_type=["gene","mRNA","CDS"])
-   in_handle=open(gfffile)
-   for rec in GFF.parse(in_handle,limit_info=limits):
-	for feature in rec.features:
-		if str(feature.location.strand)!="-1":
-			featureid_locations[feature.id]=[rec.id,str(feature.location.start),str(feature.location.end),'+']
-		else:
-			featureid_locations[feature.id]=[rec.id,str(feature.location.start),str(feature.location.end),'-']
-		
-   in_handle.close()
+   with open(gfffile) as in_handle:
+       for rec in GFF.parse(in_handle, limit_info=limits):
+           for feature in rec.features:
+               l = [rec.id, str(feature.location.start), str(feature.location.end)]
+               if feature.location.strand == 1:
+                   l.append('+')
+               else:
+                   l.append('-')
+               featureid_locations[feature.id] = l
 
    print  '#Query\tHit\tE-value\tIdentity\tScore\tQuery-start\tQuery-end\tHit-start\tHit-end\tHit-length\tDescription\tTitle\tClass-description\tComments'	
 
+   sseq_ids = []
+   with open(blastoutfile) as in_handle:
+       for line in in_handle:
+           sseq_ids.append(line.split("\t")[RPSBLAST_SSEQID_FIELD].split('|')[2])
+   cogrecords_l = get_records_from_cdd(sseq_ids, args.email)
+   cogrecords = {}
+   for rec in cogrecords_l:
+       cogrecords[rec['Id']] = rec
+
    in_handle=open(blastoutfile)
    for line in in_handle:
         record=line.split("\t")
-        if (float(record[RPSBLAST_PIDENT_FIELD])>= RPSBLAST_PIDENT_THRESHOLD and ((float(abs(int(record[RPSBLAST_SEND_FIELD])-int(record[RPSBLAST_SSTART_FIELD]))+1)/float(record[RPSBLAST_SLEN_FIELD]))*100.0)>= RPSBLAST_SCOVS_THRESHOLD):
-		cogrecord=get_record_from_cdd(record[RPSBLAST_SSEQID_FIELD].split('|')[2])
-		featureidlocrecord=featureid_locations[record[RPSBLAST_QSEQID_FIELD]]
-		print (	featureidlocrecord[0]+'_'+record[RPSBLAST_QSEQID_FIELD][7:]+'\t'+
+        l_covered = (float(abs(int(record[RPSBLAST_SEND_FIELD])-int(record[RPSBLAST_SSTART_FIELD]))+1))
+
+        if (float(record[RPSBLAST_PIDENT_FIELD])>= RPSBLAST_PIDENT_THRESHOLD and
+                ((l_covered/float(record[RPSBLAST_SLEN_FIELD]))*100.0 >= RPSBLAST_SCOVS_THRESHOLD)):
+
+            cogrecord = cogrecords[record[RPSBLAST_SSEQID_FIELD].split('|')[2]]
+            featureidlocrecord=featureid_locations[record[RPSBLAST_QSEQID_FIELD]]
+            print(featureidlocrecord[0]+'_'+record[RPSBLAST_QSEQID_FIELD][7:]+'\t'+
 			cogrecord['Accession']+'\t'+
 			record[RPSBLAST_EVALUE_FIELD]+'\t'+
 			record[RPSBLAST_PIDENT_FIELD]+'\t'+
@@ -157,10 +134,25 @@ def main(argv):
 			record[RPSBLAST_SEND_FIELD]+'\t'+
 			record[RPSBLAST_LENGTH_FIELD]+'\t'+
 			cogrecord['Abstract'].split('[')[0].strip()+'\t'+
-			cogrecord['Title']+'\t'+cogrecord['Abstract'].split('[')[1].strip()[:-1]+'\t'+
+			cogrecord['Title']+'\t'+
+                        cogrecord['Abstract'].split('[')[1].strip()[:-1]+'\t'+
                         '['+featureidlocrecord[1]+','+featureidlocrecord[2]+']('+featureidlocrecord[3]+')'
 			)
    in_handle.close()
 
 if __name__ == "__main__":
-   main(sys.argv[1:])
+   parser = argparse.ArgumentParser(usage=usage())
+   parser.add_argument('-g', '--gfffile', required=True,
+           help='GFF file generated by e.g. prodigal')
+   parser.add_argument('-b', '--blastoutfile', required=True,
+           help='Output of rpsblast run')
+   parser.add_argument('-s', '--scovs-threshold', type=float, default=60.0,
+           help='Threshold covered in percent, default=60.0')
+   parser.add_argument('-p', '--pident-threshold', type=float, default=0.0,
+           help='Threshold identity in percent, default=0.0')
+   parser.add_argument('-e', '--email',
+           help='Email adress needed to fetch data through ncbi api')
+
+   args = parser.parse_args()
+
+   main(args)