Skip to content

Commit

Permalink
Merge pull request #63 from BinPro/develop
Browse files Browse the repository at this point in the history
Version 0.2.2 released.
  • Loading branch information
alneberg committed Apr 15, 2014
2 parents 4f4685e + f46e629 commit 65376c0
Show file tree
Hide file tree
Showing 6 changed files with 189 additions and 111 deletions.
17 changes: 10 additions & 7 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,17 @@ python:
virtualenv:
system_site_packages: true
before_install:
#Uses miniconda installation of scientific python packages instead of building from source
#or using old versions supplied by apt-get. Source: https://gist.github.com/dan-blanchard/7045057
- if [ ${TRAVIS_PYTHON_VERSION:0:1} == "2" ]; then wget http://repo.continuum.io/miniconda/Miniconda-3.3.0-Linux-x86_64.sh -O miniconda.sh; else wget http://repo.continuum.io/miniconda/Miniconda3-3.3.0-Linux-x86_64.sh -O miniconda.sh; fi
- chmod +x miniconda.sh
- ./miniconda.sh -b
- export PATH=/home/travis/miniconda/bin:$PATH
- conda update --yes conda
- sudo apt-get update -qq
- sudo apt-get install -qq cython libatlas-dev liblapack-dev gfortran python-numpy python-scipy python-biopython build-essential libgsl0-dev
- sudo apt-get install -qq build-essential libgsl0-dev
install:
#Test with packages from binary install, takes a long time to build numpy and scipy
# - pip install -q -U numpy --use-mirrors
# - pip install scipy
# - pip install -q biopython --use-mirrors
- pip install pandas
- pip install scikit-learn
- conda install --yes python=$TRAVIS_PYTHON_VERSION cython numpy scipy biopython pandas pip scikit-learn
- python setup.py install
# command to run tests
script: nosetests
Expand All @@ -24,3 +26,4 @@ branches:
only:
- master
- travis_tryout
- develop
94 changes: 65 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#CONCOCT 0.2.1 [![Build Status](https://travis-ci.org/BinPro/CONCOCT.png?branch=master)](https://travis-ci.org/BinPro/CONCOCT)#
#CONCOCT 0.2.2 [![Build Status](https://travis-ci.org/BinPro/CONCOCT.png?branch=master)](https://travis-ci.org/BinPro/CONCOCT)#

A program for unsupervised binning of metagenomic contigs by using nucleotide composition,
coverage data in multiple samples and linkage data from paired end reads.
Expand All @@ -13,49 +13,85 @@ Feel free to contact our mailing list concoct-support@lists.sourceforge.net for
If you would like subscribe to concoct-support mailing list, you can do so [here](https://lists.sourceforge.net/lists/listinfo/concoct-support)

##Dependencies##

In order to install concoct, it requires python version 2.7.* and the python package installer ```pip```. It also requires a c compiler, e.g. ```gcc``` and the GNU Scientific Library ```gsl```. For linux (ubuntu) this is installed through:
###Fundamental dependencies###
```
apt-get install build-essential gsl-bin
python v2.7.*
gcc
gsl
```

Before or during the installation of concoct, several other python packages will be downloaded and installed by pip.

##Install##
Install the package concoct in default python path, and adds script concoct to bin. You can use sudo if needed.
These items are prerequisities for the installation of concoct as described below. The installation procedure varies on different systems, and described in this README is only how to proceed with a linux (ubuntu) distribution.

###Using pip###
Download the CONCOCT distribution from https://github.com/BinPro/CONCOCT/releases (stable) and extract the files, or clone the repository with github (potentially unstable)
The first item, ```python v2.7.*```, should be installed on a modern Ubuntu distribution. A c-compiler, e.g. ```gcc```, is needed to compile the c parts of concoct that uses the GNU Scientific Library ```gsl```. For linux (ubuntu) this is installed through:
```
git clone https://github.com/BinPro/CONCOCT.git
apt-get install build-essential libgsl0-dev
```

Resolve all dependencies, see above and then execute:
###Python packages###
```
cd CONCOCT
pip install -r requirements.txt
python setup.py install
cython>=0.19.2
numpy>=1.7.1
scipy>=0.12.0
pandas>=0.11.0
biopython>=1.62b
scikit-learn>=0.13.1
```
These are the python packages that need to be installed in order to run concoct. If you follow the installation instructions below, these will be installed automatically, but are listed here for transparency.

###Optional dependencies###

* To create the input table (containing average coverage per sample and contig)
* [BEDTools](https://github.com/arq5x/bedtools2/releases) version >= 2.15.0 (only genomeCoverageBed)
* [Picard](https://launchpad.net/ubuntu/+source/picard-tools/) tools version >= 1.77
* [samtools](http://samtools.sourceforge.net/) version >= 0.1.18
* [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) version >= 2.1.0
* [GNU parallel](http://www.gnu.org/software/parallel/) version >= 20130422

###Using apt-get###
Another way to get the dependencies (given Ubuntu / Debian, similar for other distros) is through ```apt-get```. However, for some packages, only deprecated versions are available. Make sure that the requirements for these packages are fulfilled:
* For validation of clustering using single-copy core genes
* [PROKKA](http://www.vicbioinformatics.com/software.prokka.shtml)
* Python packages: ```bcbio-gff>=0.4```
* R packages: ```gplots, reshape, ggplot2, ellipse, getopt``` and ```grid```

biopython>=1.62b
numpy>=1.7.1
pandas>=0.11.0
scikit-learn>=0.13.1
scipy>=0.12.0
##Installation#
Here we describe two recommended ways of getting concoct to run on your computer/server. The first option, using Anaconda, should work for any *nix (e.g. Mac OS X or Linux) system even where you do not have 'sudo' rights (e.g. on a common computer cluster). The second option is suitable for a linux computer where you have root privileges and you prefer to use a virtual machine where all dependencies to run concoct are included.

The actual commands for installing is then
###Using Anaconda###
This instruction shows how to install all dependencies (except the 'Fundamental dependencies' and the 'Optional dependencies' listed above) using an Anaconda environment. Anaconda is a tool to isolate your python installation, which allows you to have multiple parallel installations using different versions of different packages, and gives you a very convenient and fast way to install the most common scientific python packages. Anaconda is free but not open source, you can download Anaconda [here](https://store.continuum.io/cshop/anaconda/). Installation instructions can be found [here](http://docs.continuum.io/anaconda/install.html).

After installing Anaconda, create a new environment that will contain the concoct installation:
```
conda create -n concoct_env python=2.7.6
```
After choosing to proceed, run the suggested command:
```
source activate concoct_env
```
then install the concoct dependencies into this environment:
```
conda install cython numpy scipy biopython pandas pip scikit-learn
```
Finally, download the CONCOCT distribution from https://github.com/BinPro/CONCOCT/releases (stable) and extract the files, or clone the repository with github (potentially unstable). Resolve all dependencies, see above and then execute within the CONCOCT directory:
```
sudo apt-get install git python-setuptools python-biopython python-nose \
python-numpy python-pandas python-scikits-learn python-scipy \
build-essential gsl-bin
git clone https://github.com/BinPro/CONCOCT.git
cd CONCOCT
python setup.py install
```

###Using Docker###
If you have root access where you want to install concoct and storage for roughly 1.2G "virtual machine" then Docker provides a very nice way to get a Docker image with concoct and its dependencies installed. This way the only thing you install on your host system is Docker, the rest is contained in an Docker image. This allows you to install and run programs in that image without it affecting your host system. You should get to know Docker here: https://www.docker.io/the_whole_story/
You need to get Docker installed (see https://www.docker.io/gettingstarted/ and specially if you have Ubuntu http://docs.docker.io/en/latest/installation/ubuntulinux/). When Docker is installed you need to download and log into the concoct image which can be done in one command. We also want to map a folder from the host (/home/user/MyData) to a folder in the image (/opt/MyData). To get all this working we execute one command:
```
sudo docker run -v /home/user/MyData:/opt/MyData -i -t binnisb/concoct_0.2.2 bash
```
This downloads the image (about 1.2G) and logs you into a bash shell. To test concoct you can then do:
```
$ cd /opt/CONCOCT-0.2.2
$ nosetests
```
Which should execute all tests without errors. Then to run concoct on your data (stored in /home/user/MyData on host) you can do:
```
$ cd /opt/MyData
$ concoct --coverage_file coverage.csv --composition_file composition.fa -b output_folder/
```


##Execute concoct##
The script concoct takes two input files. The first file, the coverage
file, contains a table where each row correspond to a contig, and each
Expand Down
9 changes: 6 additions & 3 deletions concoct/input.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,15 +80,18 @@ def load_coverage(cov_file, contig_lengths, no_cov_normalization, add_total_cove
cov.ix[:,cov_range[0]:cov_range[1]] = cov.ix[:,cov_range[0]:cov_range[1]].add(
(100/contig_lengths),
axis='index')
if add_total_coverage:
cov['total_coverage'] = cov.ix[:,cov_range[0]:cov_range[1]].sum(axis=1)
temp_cov_range = (cov_range[0],'total_coverage')

if not no_cov_normalization:
#Normalize per sample first
cov.ix[:,cov_range[0]:cov_range[1]] = \
_normalize_per_sample(cov.ix[:,cov_range[0]:cov_range[1]])

# Total coverage should be calculated after per sample normalization
if add_total_coverage:
cov['total_coverage'] = cov.ix[:,cov_range[0]:cov_range[1]].sum(axis=1)
temp_cov_range = (cov_range[0],'total_coverage')

if not no_cov_normalization:
# Normalize contigs next
cov.ix[:,cov_range[0]:cov_range[1]] = \
_normalize_per_contig(cov.ix[:,cov_range[0]:cov_range[1]])
Expand Down
39 changes: 39 additions & 0 deletions doc/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Docker for CONCOCT (http://github.com/BinPro/CONCOCT) v0.2.2
# VERSION 0.2.2
#
# This docker creates and sets up an Ubuntu environment with all
# dependencies for CONCOCT v0.2.2 installed.
#
# To login to the docker with a shared directory from the host do:
#
# sudo docker run -v /my/host/shared/directory:/my/docker/location -i -t binnisb/concoct_0.2.2 /bin/bash
#
# This environment does not set up the assembler and preprocessing
# for concoct. We will be creating another docker for that.

FROM ubuntu:13.10
MAINTAINER CONCOCT developer group, concoct-support@lists.sourceforge.net

ENV PATH /opt/miniconda/bin:$PATH

# Get basic ubuntu packages needed
RUN apt-get update -qq;\
apt-get install -qq wget build-essential libgsl0-dev

# Set up Miniconda environment for python2
RUN cd /opt;\
wget http://repo.continuum.io/miniconda/Miniconda-3.3.0-Linux-x86_64.sh -O miniconda.sh;\
chmod +x miniconda.sh;\
./miniconda.sh -p /opt/miniconda -b

# Install python dependencies and fetch and install CONCOCT 0.2.2
RUN cd /opt;\
conda update --yes conda;\
conda install --yes python=2.7 atlas cython numpy scipy biopython pandas pip scikit-learn;\
wget --no-check-certificate https://github.com/BinPro/CONCOCT/archive/0.2.2.tar.gz;\
tar xf 0.2.2.tar.gz;\
cd CONCOCT-0.2.2;\
python setup.py install



132 changes: 62 additions & 70 deletions scripts/PROKKA_COG.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,12 @@
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# **************************************************************/
import sys, getopt, urllib
from xml.dom import minidom
import sys
from BCBio import GFF
import argparse
from Bio import Entrez

def get_record_from_cdd(query):
def get_records_from_cdd(queries, email):
# We need CDD accession to COG accession mapping. For this we will use NCBI eutils and parse the returned XML
# file. For example,
#
Expand All @@ -43,31 +44,16 @@ def get_record_from_cdd(query):
# <Item Name="LivePssmID" Type="Integer">0</Item>
# </DocSum>
# </eSummaryResult>

params = {
'db':'cdd',
}

params['id'] = query
# get citation info:
url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?' + urllib.urlencode(params)
data = urllib.urlopen(url).read()
xmldoc = minidom.parseString(data)
items=xmldoc.getElementsByTagName("Item")
r={}
for i in range(items.length):
r[items[i].getAttribute('Name')]=items[i].firstChild.data
return r
Entrez.email = email # Always tell ncbi who you are.
search_result = Entrez.read(Entrez.epost("cdd", id=",".join(queries)))
records = Entrez.read(Entrez.efetch(db="cdd",
rettype='docsum',
webenv=search_result['WebEnv'],
query_key=search_result['QueryKey']))
return records

def usage():
print '\n'.join([
'Usage:',
'\t./PROKKA_COG.py -g <gfffile> -b <blastoutfile>',
'',
'Optional parameters:',
'\t-s (--scovs-threshold)\t\tsubject coverage threshold (Default:60)',
'\t-p (--pident-threshold)\t\tpident threshold (Default:0)',
'',
'Example usage:',
'',
'\tStep 1: Run PROKKA_XXXXXXXX.faa with rpsblast against the Cog database',
Expand All @@ -77,16 +63,19 @@ def usage():
'\t\t\tsstart send length slen\" -out blast_output.out',
'',
'\tStep 2: Run this script to generate COG anotations:',
'\t\t\t./PROKKA_COG.py -g PROKKA_XXXXXXXX.gff -b blast_output.out',
'\t\t\t./PROKKA_COG.py -g PROKKA_XXXXXXXX.gff -b blast_output.out -e mail@example.com',
'\t\t\t > annotation.cog',
'',
'Refer to rpsblast tutorial: http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/rpsblast/'])
'Refer to rpsblast tutorial: http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/rpsblast/',
''])

def main(argv):
def main(args):
blastoutfile = args.blastoutfile
gfffile = args.gfffile
RPSBLAST_SCOVS_THRESHOLD = args.scovs_threshold
RPSBLAST_PIDENT_THRESHOLD = args.pident_threshold

# = Parameters to set ============== #
RPSBLAST_SCOVS_THRESHOLD=60.0
RPSBLAST_PIDENT_THRESHOLD=0.0
RPSBLAST_QSEQID_FIELD=0
RPSBLAST_SSEQID_FIELD=1
RPSBLAST_EVALUE_FIELD=2
Expand All @@ -101,52 +90,40 @@ def main(argv):
# = /Parameters to set ============= #


gfffile = ''
blastoutfile=''

try:
opts, args = getopt.getopt(argv,"hg:b:s:p:",["gfffile=","blastoutfile=","scovs-threshold=","pident-threshold="])
except getopt.GetoptError:
usage()
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
usage()
sys.exit()
elif opt in ("-g", "--gfffile"):
gfffile = arg
elif opt in ("-b", "--blastoutfile"):
blastoutfile = arg
elif opt in ("-s", "--scovs-threshold"):
RPSBLAST_SCOVS_THRESHOLD = float(arg)
elif opt in ("-p", "--pident-threshold"):
RPSBLAST_PIDENT_THRESHOLD = float(arg)

if (gfffile =='' or blastoutfile == ''):
usage()
sys.exit()

featureid_locations={}
limits=dict(gff_type=["gene","mRNA","CDS"])
in_handle=open(gfffile)
for rec in GFF.parse(in_handle,limit_info=limits):
for feature in rec.features:
if str(feature.location.strand)!="-1":
featureid_locations[feature.id]=[rec.id,str(feature.location.start),str(feature.location.end),'+']
else:
featureid_locations[feature.id]=[rec.id,str(feature.location.start),str(feature.location.end),'-']
in_handle.close()
with open(gfffile) as in_handle:
for rec in GFF.parse(in_handle, limit_info=limits):
for feature in rec.features:
l = [rec.id, str(feature.location.start), str(feature.location.end)]
if feature.location.strand == 1:
l.append('+')
else:
l.append('-')
featureid_locations[feature.id] = l

print '#Query\tHit\tE-value\tIdentity\tScore\tQuery-start\tQuery-end\tHit-start\tHit-end\tHit-length\tDescription\tTitle\tClass-description\tComments'

sseq_ids = []
with open(blastoutfile) as in_handle:
for line in in_handle:
sseq_ids.append(line.split("\t")[RPSBLAST_SSEQID_FIELD].split('|')[2])
cogrecords_l = get_records_from_cdd(sseq_ids, args.email)
cogrecords = {}
for rec in cogrecords_l:
cogrecords[rec['Id']] = rec

in_handle=open(blastoutfile)
for line in in_handle:
record=line.split("\t")
if (float(record[RPSBLAST_PIDENT_FIELD])>= RPSBLAST_PIDENT_THRESHOLD and ((float(abs(int(record[RPSBLAST_SEND_FIELD])-int(record[RPSBLAST_SSTART_FIELD]))+1)/float(record[RPSBLAST_SLEN_FIELD]))*100.0)>= RPSBLAST_SCOVS_THRESHOLD):
cogrecord=get_record_from_cdd(record[RPSBLAST_SSEQID_FIELD].split('|')[2])
featureidlocrecord=featureid_locations[record[RPSBLAST_QSEQID_FIELD]]
print ( featureidlocrecord[0]+'_'+record[RPSBLAST_QSEQID_FIELD][7:]+'\t'+
l_covered = (float(abs(int(record[RPSBLAST_SEND_FIELD])-int(record[RPSBLAST_SSTART_FIELD]))+1))

if (float(record[RPSBLAST_PIDENT_FIELD])>= RPSBLAST_PIDENT_THRESHOLD and
((l_covered/float(record[RPSBLAST_SLEN_FIELD]))*100.0 >= RPSBLAST_SCOVS_THRESHOLD)):

cogrecord = cogrecords[record[RPSBLAST_SSEQID_FIELD].split('|')[2]]
featureidlocrecord=featureid_locations[record[RPSBLAST_QSEQID_FIELD]]
print(featureidlocrecord[0]+'_'+record[RPSBLAST_QSEQID_FIELD][7:]+'\t'+
cogrecord['Accession']+'\t'+
record[RPSBLAST_EVALUE_FIELD]+'\t'+
record[RPSBLAST_PIDENT_FIELD]+'\t'+
Expand All @@ -157,10 +134,25 @@ def main(argv):
record[RPSBLAST_SEND_FIELD]+'\t'+
record[RPSBLAST_LENGTH_FIELD]+'\t'+
cogrecord['Abstract'].split('[')[0].strip()+'\t'+
cogrecord['Title']+'\t'+cogrecord['Abstract'].split('[')[1].strip()[:-1]+'\t'+
cogrecord['Title']+'\t'+
cogrecord['Abstract'].split('[')[1].strip()[:-1]+'\t'+
'['+featureidlocrecord[1]+','+featureidlocrecord[2]+']('+featureidlocrecord[3]+')'
)
in_handle.close()

if __name__ == "__main__":
main(sys.argv[1:])
parser = argparse.ArgumentParser(usage=usage())
parser.add_argument('-g', '--gfffile', required=True,
help='GFF file generated by e.g. prodigal')
parser.add_argument('-b', '--blastoutfile', required=True,
help='Output of rpsblast run')
parser.add_argument('-s', '--scovs-threshold', type=float, default=60.0,
help='Threshold covered in percent, default=60.0')
parser.add_argument('-p', '--pident-threshold', type=float, default=0.0,
help='Threshold identity in percent, default=0.0')
parser.add_argument('-e', '--email',
help='Email adress needed to fetch data through ncbi api')

args = parser.parse_args()

main(args)
Loading

0 comments on commit 65376c0

Please sign in to comment.