Update README.md

Preprocess, OTHER finished
sejooning · Sep 30, 2013 · 019f5d3 · 019f5d3
1 parent c245d49
commit 019f5d3
Showing 1 changed file with 100 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -69,9 +69,9 @@ convert BICseq results to BED file. You can copy these two folders somewhere eas
 USAGE
 =====
 
+
 Overview
 --------
-
 PyLOH is composed of three modules: 
 * `preprocess`. Preprocess the reads aliments of paired normal-tumor samples in BAM format and produces the paired counts file, 
 preprocessed segments file and preprocessed BAF heat map file as output.
@@ -84,12 +84,13 @@ allele type of each segment.
 The general workflow of PyLOH is this
 ![alt tag](https://github.com/uci-cbcl/PyLOH/blob/gh-pages/images/workflow.png?raw=true)
 
+
 Preprocess
 ----------
 This part of README is based on [JoinSNVMix](https://code.google.com/p/joint-snv-mix/wiki/runningOld). To preprocess the paired 
 cancer sequencing data, execute:
 ```
-$ PyLOH.py preprocess REFERENCE_GENOME.fasta NORMAL.bam TUMOUR.bam BASENAME --segments_bed_file_name SEGMENTS.bed --min_depth 20 --min_base_qual 10 --min_map_qual 10 --process_num 10
+$ PyLOH.py preprocess REFERENCE_GENOME.fasta NORMAL.bam TUMOUR.bam BASENAME --segments_bed SEGMENTS.bed --min_depth 20 --min_base_qual 10 --min_map_qual 10 --process_num 10
 ```
 
 **REFERENCE_GENOME.fasta** The path to the fasta file that the paired BAM files aligned to. Note that the index file should be generated 
@@ -106,13 +107,104 @@ be done by running
 
 **BASENAME** The base name of preprocessed files to be created.
 
-**--segments_bed_file_name SEGMENTS.bed** Use the genome segmentation stored in SEGMENTS.bed. If not provided, use 22 autosomes as the 
-segmentaion. But using automatic segmentation algorithm is highly recommended, such as [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html).
+**--segments_bed SEGMENTS.bed** Use the genome segmentation stored in SEGMENTS.bed. If not provided, use 22 autosomes as the segmentaion. 
+But using automatic segmentation algorithm is highly recommended, such as [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html).
+
+**--min_depth** Minimum depth in both normal and tumor sample required to use a site in the analysis.
+
+**--min_base_qual** Minimum base quality required for each base.
+
+**--min_map_qual** Minimum mapping quality required for each base.
+
+**--process_num** Number of processes to launch for preprocessing.
+
+
+Run model
+---------
+After the paired cancer sequencing data is preprocessed, we can run the probabilistic model of PyLOH by execute:
+```
+$ PyLOH.py run_model BASENAME --allele_number_max 2 --max_iters 100 --stop_value 1e-7
+```
+**BASENAME** The base name of preprocessed files created in the preprocess step.
+
+**--allele_number_max** The maximum copy number of each allele allows to take.
+
+**--priors_file_name** Path to the file of the prior distribution. The prior file must be consistent with the --allele_number_max. If not 
+provided, use uniform prior, which is recommended.
+
+**--max_iters** Maximum number of iterations for training.
+
+**--stop_value** Stop value of the EM algorithm for training. If the change of log-likelihood is lower than this value, stop training.
+
+
+Postprocess
+-----------
+Currently, the postprocess module is only for plotting the BAF heat map of each segment:
+```
+$ PyLOH.py BAF_heatmap BASENAME
+```
+
+**BASENAME** The base name of preprocessed files created in the preprocess step.
+
+
+Output files
+------------
+**\*.PyLOH.counts** The preprocessed paired counts file. It which contains the allelic counts information of sites, which are heterozygous 
+loci in the normal genome. The definition of each column in a *.PyLOH.counts file is listed here:
+
+| Column    | Definition                                         | 
+| :-------- | :------------------------------------------------- | 
+| seg_index | Index of each segment                              |      
+| normal_A  | Count of bases match A allele in the normal sample |
+| normal_B  | Count of bases match B allele in the normal sample |
+| tumor_A   | Count of bases match A allele in the tumor sample  |
+| tumor_B   | Count of bases match B allele in the tumor sample  |
+
+**\*.PyLOH.segments** The preprocessed segments file. It which contains the genomic information of each segment. The definition of each
+column in a *.PyLOH.segments file is listed here:
+
+| Column           | Definition                                                | 
+| :--------------- | :-------------------------------------------------------- | 
+| seg_name         | Name of the segment                                       |      
+| chrom            | Chromosome of the segment                                 |  
+| start            | Start position of the segment                             |
+| end              | End position of the segment                               |
+| normal_reads_num | Count of reads mapped to the segment in the normal sample |
+| tumor_reads_num  | Count of reads mapped to the segment in the normal sample |
+| LOH_frec         | Fraction of LOH sites in the segment                      |
+| log2_ratio       | Log2 ratio between tumor_reads_num and normal_reads_num   |
+
+**\*.PyLOH.segments.extended** The extended segments file after run_model. There are two additional columns:
+
+| Column           | Definition                                                | 
+| :--------------- | :-------------------------------------------------------- | 
+| allele_type      | Estimated allele type of the segment                      |      
+| copy_number      | Estimated copy number of the segment                      |  
 
-**--min_depth 20** Minimum depth of 20 in both tumor and normal sample required to use a site in the analysis.
+**\*.PyLOH.purity** Estimated tumor purity.
 
-**--min_base_qual 10** Remove bases with base quality lower than 10.
+**\*.PyLOH.heatmap.pkl** The preprocessed BAF heat map file in Python pickle format.
 
-**--min_map_qual 10** Remove bases with mapping quality lower than 10.
+**\*.PyLOH.heatmap.plot** The folder of BAF heat maps plotted for each segment. A typical BAF heat map looks like this
+![alt tag](https://github.com/uci-cbcl/PyLOH/blob/gh-pages/images/BAF_heamap_sample.png?raw=true)
+
+
+
+OTHER
+=====
+
+BIC-seq related utilities
+-------------------------
+We highly recommend using automatic segmentation algorithm to partition the tumor genome, and thus prepare the segments file in BED format.
+For exmaple, we used [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html) in the original paper. To run a BICseq analysis, you
+can copy the commands in `bin/BICseq.R` and paste them in a R interative shell. Or you can also run the R script from the command line:
+```
+$ R CMD BATCH bin/BICseq.R
+```
+Note that,`normal.bam` and `tumor.bam` must be in the same directory where you run the command. The R script will output a segments file
+`segments.BICseq`. Then you can use the other script `bin/BICseq2bed.py` to convert the segments file into BED format:
+```
+$ BICseq2bed.py segments.BICseq segments.bed --seg_length 1000000
+```
 
-**--process_num 10** Use 10 processes to launch the preprocess module.
+**--seg_length** Only convert segments with length longer than the threshold.