Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Preprocess, OTHER finished
  • Loading branch information
yil8 committed Sep 30, 2013
1 parent c245d49 commit 019f5d3
Showing 1 changed file with 100 additions and 8 deletions.
108 changes: 100 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,9 @@ convert BICseq results to BED file. You can copy these two folders somewhere eas
USAGE
=====


Overview
--------

PyLOH is composed of three modules:
* `preprocess`. Preprocess the reads aliments of paired normal-tumor samples in BAM format and produces the paired counts file,
preprocessed segments file and preprocessed BAF heat map file as output.
Expand All @@ -84,12 +84,13 @@ allele type of each segment.
The general workflow of PyLOH is this
![alt tag](https://github.com/uci-cbcl/PyLOH/blob/gh-pages/images/workflow.png?raw=true)


Preprocess
----------
This part of README is based on [JoinSNVMix](https://code.google.com/p/joint-snv-mix/wiki/runningOld). To preprocess the paired
cancer sequencing data, execute:
```
$ PyLOH.py preprocess REFERENCE_GENOME.fasta NORMAL.bam TUMOUR.bam BASENAME --segments_bed_file_name SEGMENTS.bed --min_depth 20 --min_base_qual 10 --min_map_qual 10 --process_num 10
$ PyLOH.py preprocess REFERENCE_GENOME.fasta NORMAL.bam TUMOUR.bam BASENAME --segments_bed SEGMENTS.bed --min_depth 20 --min_base_qual 10 --min_map_qual 10 --process_num 10
```

**REFERENCE_GENOME.fasta** The path to the fasta file that the paired BAM files aligned to. Note that the index file should be generated
Expand All @@ -106,13 +107,104 @@ be done by running

**BASENAME** The base name of preprocessed files to be created.

**--segments_bed_file_name SEGMENTS.bed** Use the genome segmentation stored in SEGMENTS.bed. If not provided, use 22 autosomes as the
segmentaion. But using automatic segmentation algorithm is highly recommended, such as [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html).
**--segments_bed SEGMENTS.bed** Use the genome segmentation stored in SEGMENTS.bed. If not provided, use 22 autosomes as the segmentaion.
But using automatic segmentation algorithm is highly recommended, such as [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html).

**--min_depth** Minimum depth in both normal and tumor sample required to use a site in the analysis.

**--min_base_qual** Minimum base quality required for each base.

**--min_map_qual** Minimum mapping quality required for each base.

**--process_num** Number of processes to launch for preprocessing.


Run model
---------
After the paired cancer sequencing data is preprocessed, we can run the probabilistic model of PyLOH by execute:
```
$ PyLOH.py run_model BASENAME --allele_number_max 2 --max_iters 100 --stop_value 1e-7
```
**BASENAME** The base name of preprocessed files created in the preprocess step.

**--allele_number_max** The maximum copy number of each allele allows to take.

**--priors_file_name** Path to the file of the prior distribution. The prior file must be consistent with the --allele_number_max. If not
provided, use uniform prior, which is recommended.

**--max_iters** Maximum number of iterations for training.

**--stop_value** Stop value of the EM algorithm for training. If the change of log-likelihood is lower than this value, stop training.


Postprocess
-----------
Currently, the postprocess module is only for plotting the BAF heat map of each segment:
```
$ PyLOH.py BAF_heatmap BASENAME
```

**BASENAME** The base name of preprocessed files created in the preprocess step.


Output files
------------
**\*.PyLOH.counts** The preprocessed paired counts file. It which contains the allelic counts information of sites, which are heterozygous
loci in the normal genome. The definition of each column in a *.PyLOH.counts file is listed here:

| Column | Definition |
| :-------- | :------------------------------------------------- |
| seg_index | Index of each segment |
| normal_A | Count of bases match A allele in the normal sample |
| normal_B | Count of bases match B allele in the normal sample |
| tumor_A | Count of bases match A allele in the tumor sample |
| tumor_B | Count of bases match B allele in the tumor sample |

**\*.PyLOH.segments** The preprocessed segments file. It which contains the genomic information of each segment. The definition of each
column in a *.PyLOH.segments file is listed here:

| Column | Definition |
| :--------------- | :-------------------------------------------------------- |
| seg_name | Name of the segment |
| chrom | Chromosome of the segment |
| start | Start position of the segment |
| end | End position of the segment |
| normal_reads_num | Count of reads mapped to the segment in the normal sample |
| tumor_reads_num | Count of reads mapped to the segment in the normal sample |
| LOH_frec | Fraction of LOH sites in the segment |
| log2_ratio | Log2 ratio between tumor_reads_num and normal_reads_num |

**\*.PyLOH.segments.extended** The extended segments file after run_model. There are two additional columns:

| Column | Definition |
| :--------------- | :-------------------------------------------------------- |
| allele_type | Estimated allele type of the segment |
| copy_number | Estimated copy number of the segment |

**--min_depth 20** Minimum depth of 20 in both tumor and normal sample required to use a site in the analysis.
**\*.PyLOH.purity** Estimated tumor purity.

**--min_base_qual 10** Remove bases with base quality lower than 10.
**\*.PyLOH.heatmap.pkl** The preprocessed BAF heat map file in Python pickle format.

**--min_map_qual 10** Remove bases with mapping quality lower than 10.
**\*.PyLOH.heatmap.plot** The folder of BAF heat maps plotted for each segment. A typical BAF heat map looks like this
![alt tag](https://github.com/uci-cbcl/PyLOH/blob/gh-pages/images/BAF_heamap_sample.png?raw=true)



OTHER
=====

BIC-seq related utilities
-------------------------
We highly recommend using automatic segmentation algorithm to partition the tumor genome, and thus prepare the segments file in BED format.
For exmaple, we used [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html) in the original paper. To run a BICseq analysis, you
can copy the commands in `bin/BICseq.R` and paste them in a R interative shell. Or you can also run the R script from the command line:
```
$ R CMD BATCH bin/BICseq.R
```
Note that,`normal.bam` and `tumor.bam` must be in the same directory where you run the command. The R script will output a segments file
`segments.BICseq`. Then you can use the other script `bin/BICseq2bed.py` to convert the segments file into BED format:
```
$ BICseq2bed.py segments.BICseq segments.bed --seg_length 1000000
```

**--process_num 10** Use 10 processes to launch the preprocess module.
**--seg_length** Only convert segments with length longer than the threshold.

0 comments on commit 019f5d3

Please sign in to comment.