This repository contains supporting material for the manuscript:
"Clonal heterogeneity influences the fate of new adaptive mutations"
Ignacio Vázquez-García, Francisco Salinas, Jing Li, Andrej Fischer, Benjamin Barré, Johan Hallin, Anders Bergström, Elisa Alonso-Pérez, Jonas Warringer, Ville Mustonen, Gianni Liti
Cell Reports 21, 732-744 (2017), doi: https://doi.org/10.1016/j.celrep.2017.09.046
To clone this repository, run the following command in a local directory:
$ git clone --recursive https://github.com/ivazquez/clonal-heterogeneity.git
The --recursive
flag is required in order to download the nested git submodule from an external repository.
The source code contains iPython notebooks to reproduce the manuscript figures, which make use of numpy, scipy, the matplotlib plotting environment
and others. These are found in the src/
directory. The repository also includes the cloneHD submodule, which is in C++ and requires g++ with the GSL library.
To install all Python dependencies inside a virtual environment and build the cloneHD executables into the build/
directory, run:
$ cd clonal-heterogeneity
$ make
You can then browse and run the notebooks locally to reproduce all figures with:
$ jupyter notebook
Alternatively, you can run the notebooks online using Binder.
Figures | Notebook |
---|---|
Figure 1 | Schematic of study design |
Figures 2, S2 | Driver-passenger dynamics |
Figures 3, S3, S4, S9 | Reconstruction of subclonal heterogeneity |
Figures 4, S5 | Pervasive selection for adaptive mutations and genome instability |
Figure 5 | Elevated rates of loss of heterozygosity |
Figures 6, S10, S11, S12 | Ensemble-averaged fitness effects of genetic background and de novo mutations |
Figures S6, S7, S8 | Engineered genetic constructs |
Sequencing reads are available in BAM or CRAM format from the European Nucleotide Archive and the NCBI BioProject. Sequence data for the parental strains and the ancestral individuals were previously submitted to the SRA/ENA databases under study accession no. ERP000780 and the NCBI BioProject under accession no. PRJEB2608. Sequence data for the time-resolved populations and the evolved individuals have been submitted to the SRA/ENA databases under study accession no. ERP003953 and the NCBI BioProject under accession no. PRJEB4645. To download the files programmatically from the FTP server (156GB):
wget -i <(awk -F, '{gsub("#","%23",$NF); print $NF}' data/seq/sample_ids_unmerged.csv)
Sequences must be aligned to the S. cerevisiae reference genome R64-1-1.
Variant calls are available in VCF format with accession no. PRJEB13491 and can be browsed on the European Variation Archive. Each VCF file corresponds to one sample and contains either pre-existing variants (*.background.vcf.gz
) or de novo variants (*.de_novo.vcf.gz
). They can be downloaded programmatically from the FTP (713MB):
wget -i <(awk -F, '{gsub(";","\n",$NF); print $NF;}' data/seq/sample_ids_merged_dup.csv)
Alternatively, variants can also be found in tab-separated format or serialized in Pickle format for Python in the data/seq/
directory, annotated with Ensembl Variant Effect Predictor.
With the sequence data we carry out subclonal decomposition using a probabilistic inference method named cloneHD, as shown in Figure 3 of the manuscript. The source code contains a minimal example to carry out subclonal decomposition in a simulated dataset. To test this method with simulated data:
src/subclonality_simulated.sh
Also, to test this on a representative time series dataset for one of the populations (as shown in Figure 2):
src/subclonality_experiment.sh
The full documentation for filterHD and cloneHD can be found in the cloneHD repository.
This dataset comprises phenotype measurements of intra-population heterogeneity, of engineered genetic constructs, and of a recombinant library of pre-existing and de novo mutations created by genetic crossing. Raw imaging data is available upon request (~250GB). Phenotype measurements are analysed using scan-o-matic and are available in comma-separated format. They can be found in the data/pheno/
directory.
- Temporal changes to the phenotype distribution are summarised in this dataset and are analysed in this notebook. The results are shown in Figure 3 and S9.
- Phenotype measurements of the genetic cross are summarised in this dataset (spores and hybrids) and are analysed in this notebook. The results are shown in Figure 6 and Figures S10 and S11.
- Phenotype measurements of the genetic constructs are summarised in this dataset and are analysed in this notebook.
Each measurement is indexed by experimental run, plate, row and column. All datasets report the growth rate and the doubling time. For each observable, the dataset reports absolute values and normalised values extracted after spatial normalisation. NaN
is used to indicate missing data.
Locus-specific measurements of the LOH rate using a Luria-Delbruck fluctuation test. This dataset reports the raw colony counts measured in the fluctuation assay and the estimated LOH rates. They can be found in the data/fluctuation/
directory.
- The raw counts contain the number of colony-forming units (CFU) in YPD medium and in 5-FOA+ dropout medium. This provides the average number of cells per culture,
N
, the average number of LOH events per culture,m
. - For every background and environment, the mean LOH rate and 5%/95% confidence intervals are estimated using the probability generating function of the Luria-Delbruck distribution defined by Hamon et al. (2012).