Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
456 commits
Select commit Hold shift + click to select a range
80e13e7
Option to make qsubber run locally
petercombs Jul 21, 2016
574ada4
Start to compare whether absolute expression perturbed in hybrids
petercombs Jul 28, 2016
31950e8
Smarter treatment of calling maternal
petercombs Jul 28, 2016
09a06e1
Skip calculation of FDR
petercombs Jul 28, 2016
3486c93
Option to calculate svASE Kegg enrichment
petercombs Jul 28, 2016
b094bde
Make PlotUtils great again
petercombs Jul 28, 2016
c02339f
Tweak down the map rate requirement to include slices at the end
petercombs Jul 28, 2016
04c1889
Code for making a graph of connected phenotypes/genes
petercombs Jul 28, 2016
74f36d3
Better estimation of figure size
petercombs Jul 29, 2016
94046b1
Take the average of parents for non-maternal genes
petercombs Jul 29, 2016
feede2a
Return type for not enough data to confidently say
petercombs Aug 1, 2016
c3851ed
I already have this number calculated---and don't have to worry about…
petercombs Aug 1, 2016
16f9495
Code cleanup and write list of genes
petercombs Aug 12, 2016
d614bec
Finally do this as a standalone script that can run
petercombs Aug 12, 2016
a8dac92
Rework how I calculate which genes are mat/dominant/etc
petercombs Aug 12, 2016
9f280ca
Code cleanup
petercombs Aug 12, 2016
41899ca
Make parentals sort better by renaming
petercombs Aug 12, 2016
1ea2c5d
Option to supresss redrawing of outliers
petercombs Aug 12, 2016
916fa7d
Start putting code into place for a weighted rolling average
petercombs Aug 12, 2016
5e8b4b5
Complete options for +/-progress and +/- multiprocessing
petercombs Aug 12, 2016
fb58b89
See if svASE genes are more likely to have SNPs nearby
petercombs Aug 23, 2016
a1323f9
Add Get_chroms in Utils
petercombs Aug 23, 2016
66ba8c9
Skip recalculating all the fits if rerun
petercombs Aug 23, 2016
dee7976
Try to calculate the image width if not provided
petercombs Aug 24, 2016
1ae7947
Option to pass in the base link to use (e.g. for linking to flybase)
petercombs Aug 24, 2016
53743fe
Fix a copy/pasta error
petercombs Aug 24, 2016
d864a1b
Also look at differences in zelda binding
petercombs Aug 25, 2016
c57ac39
Ignore exonic peaks
petercombs Aug 25, 2016
58b6139
Many changes from oenocyte branch of this code
petercombs Aug 31, 2016
06e3365
Add option to use MEME to find motif hits
petercombs Sep 1, 2016
8c89234
Start thinking about how to systematically make predictions of TF bin…
petercombs Sep 2, 2016
b6d0671
Start looking for localized effects of different TF biases
petercombs Sep 7, 2016
4b1811a
Trick to get just the indices where true
petercombs Sep 20, 2016
e82354d
Tweaks to the processing
petercombs Sep 20, 2016
71dd5fd
Evolution of the code
petercombs Sep 20, 2016
139081a
Dig up old code for doing a combination heatmap and scatterplot
petercombs Sep 20, 2016
da9ba86
Make unexpressed stuff just barely off-white, and true white becomes …
petercombs Sep 20, 2016
4db2ab7
A few sets of default plotting options
petercombs Sep 20, 2016
5394c52
Now that we are using gene names throughout, row_labels is unnecessary
petercombs Sep 20, 2016
a0f9f58
Rewrite to use any number of metrics for best alignment, followed by …
petercombs Sep 20, 2016
e5f49e1
Use WASP and the unified chromosome function in Utils
petercombs Sep 20, 2016
5c5b8f1
Function to calculate the EMD within a set of embryos
petercombs Sep 20, 2016
a4a4e06
Assorted DistributionDifference tweaks
petercombs Sep 20, 2016
00ad8f8
Bug fixes
petercombs Sep 20, 2016
7057b7c
Option to "squeeze" each row by applying a function
petercombs Sep 20, 2016
99fc994
Bug fixes
petercombs Sep 20, 2016
e6f246a
More comprehensive way of taking the best motif
petercombs Sep 22, 2016
a0342fa
Also run alignments/binding site finding on Kr and pinocchio and Bsg25A
petercombs Sep 22, 2016
5a8324e
Use the newest ideas on what binding sites actually change
petercombs Oct 18, 2016
8b838db
Lots of changes--too many to clearly catalog. sorry.
petercombs Oct 18, 2016
09391d4
Changes to actually use the R5 reference genome
petercombs Oct 18, 2016
1cbd792
Variable for the list of variants
petercombs Oct 18, 2016
3acfa93
Lots of code to get the binding data
petercombs Oct 18, 2016
98fc795
Frist attempt for CompareAtlases
petercombs Nov 8, 2016
e32ffb1
Standardized code to get rid of extra plot spines
petercombs Nov 8, 2016
1592be7
Make some effort to match the mel and sim histograms
petercombs Nov 11, 2016
be862d3
Match nuclei as modeled by Fowlkes 2011
petercombs Nov 11, 2016
e95c762
Be more explicit in which stage I'm using
petercombs Nov 16, 2016
9d4e87f
Rename denom to avoid conflicts
petercombs Nov 16, 2016
7256e3d
Use a variable for the target gene
petercombs Nov 28, 2016
77ef4d2
Actually calculate the best-fit cells
petercombs Nov 28, 2016
e6d47ff
Use the cohort_names variable to ensure consistency
petercombs Nov 28, 2016
85ceb2d
Manual 'database' of the different parts of the embryo with no expres…
petercombs Nov 28, 2016
dc606c9
Cleaner output
petercombs Nov 28, 2016
cde75f8
Summarize the effect on hunchback targets
petercombs Nov 28, 2016
ec83b14
Merge branch 'HybridSliceSeq' into ReferenceR5
petercombs Nov 28, 2016
f6f2450
Lower tolerance to improve sim-only FPKM estimation
petercombs Nov 28, 2016
81a40bc
Load module for STAR before using it
petercombs Nov 28, 2016
cf0b8a9
Test file for using the Isolator splicing analysis
petercombs Dec 1, 2016
1f1bfc8
Make SVGs of interesting ranges
petercombs Dec 1, 2016
54bd992
Look up the sequence regions around genes with svASE
petercombs Dec 1, 2016
aca4d87
More descriptive figure
petercombs Dec 1, 2016
2c0c4d3
Compute the correlation with actual data
petercombs Dec 1, 2016
85ba205
Compute the correlation with actual data
petercombs Dec 1, 2016
b801f8a
Merge branch 'ReferenceR5' of github.com:petercombs/HybridSliceSeq in…
petercombs Dec 1, 2016
26e9408
Better formatting of very small values
petercombs Dec 2, 2016
eb67731
Report FDR stats for fitting
petercombs Dec 2, 2016
ec976c0
Report TF binding data for the paper
petercombs Dec 2, 2016
5300b97
Improved atlas plotting
petercombs Dec 16, 2016
db9787f
Generalize the Compare_svASE code
petercombs Dec 16, 2016
993c4cd
Make HybUtils bias testing a bit safer
petercombs Dec 16, 2016
c02a0ac
Try to nicely kill the job if a sigterm is received
petercombs Dec 16, 2016
8729600
Be more consistent about using the right flybase version
petercombs Dec 16, 2016
4840849
Don't throw away bad columns (we need them for spacing)
petercombs Jan 4, 2017
3ef54d4
Automated analysis of TF binding
petercombs Jan 4, 2017
03eb2ec
Allow to be called from outside the current working directory
petercombs Jan 4, 2017
09a6ddf
Code to figure out if svASE is different between cross directions
petercombs Jan 4, 2017
d81b0bd
Start to refactor drawing code to separate functions
petercombs Jan 4, 2017
6eaa419
Handle misalignments more gracefully
petercombs Jan 4, 2017
2e6233c
Refactor out the alignment drawing
petercombs Jan 4, 2017
0fbe6a8
First successful stab at overlaying the chromatin accessibility data
petercombs Jan 4, 2017
bd54f0b
Take advantage of new code for plotting DNase
petercombs Jan 4, 2017
ea17859
Move peak files to a subdirectory
petercombs Jan 4, 2017
f5c70b9
See if the bicoid site pushes strength negative, or just to zero
petercombs Jan 4, 2017
164ecfa
Ignore invisible files
petercombs Jan 4, 2017
a95971f
Sort alignments for consistent order
petercombs Jan 5, 2017
8a11140
Better handle being in a different directory
petercombs Jan 10, 2017
43434f5
Create in situ-like images of mel and sim expression
petercombs Jan 10, 2017
3b7c5e9
Start looking for genes that may drive interesting phenotypes
petercombs Jan 10, 2017
38b471f
Log an error and don't crash if a column is missing
petercombs Jan 13, 2017
f4e73da
Use R5 *everywhere*
petercombs Jan 13, 2017
2beaa4e
Use the right bed files everywhere, and expand search radius
petercombs Jan 13, 2017
bd5329a
Use a single cutoff everywhere (make it easier to set a different one)
petercombs Jan 13, 2017
8d80c49
More flexible plotting/output options
petercombs Jan 13, 2017
8523588
Make expression files with hybrids only, for easier flicking between …
petercombs Jan 13, 2017
75f3f5b
Add more TFs and make easier to download
petercombs Mar 9, 2017
71ed9ea
Also add in ORegAnno regions
petercombs Mar 9, 2017
23f803d
Look in all available gene map tables for chromosomes of genes
petercombs Mar 9, 2017
1a6cb4c
Also add synonyms to chromosome lookup table
petercombs Mar 9, 2017
7149b43
Calculate the upper bound of FDR
petercombs Mar 9, 2017
38d597e
Better way of comparing alignments
petercombs Mar 9, 2017
aa087c0
Bug fix on sorting
petercombs Mar 9, 2017
8b679f5
Picard now has a tool to do the correct deduplication
petercombs Mar 9, 2017
fd3ae29
Now use an adjustable cutoff
petercombs Mar 9, 2017
c6527eb
Tons of plotting additions, including virtual sliced ase
petercombs May 11, 2017
715043d
Generate heatmaps with just two replicates
petercombs May 15, 2017
68070c0
Save output to a file
petercombs May 15, 2017
aee4502
The cluster now seems to require me to accurately guess memory requir…
petercombs May 15, 2017
1367027
Code for running the sign test on GO categories
petercombs May 15, 2017
d8977ed
More selector functions
petercombs May 15, 2017
3cb8f27
Find the coordinates of all genes
petercombs May 15, 2017
19bc7c3
Try to kill job on the cluster when a kill command is sent
petercombs May 15, 2017
c140de3
More vigorously try to print where the error is
petercombs May 15, 2017
e6cf762
I thought I had code to do a maternal ttest already, but I can't find…
petercombs May 15, 2017
54830c0
Ignore genes with identical CDSes
petercombs May 19, 2017
452c05c
Combine code for plotting both virtual ASE atlas and virtual slices
petercombs May 19, 2017
b65875e
Search for the best TF tweak, rather than guessing a constant
petercombs May 19, 2017
c35a21e
Fix sample numbering in case of multiple entries
petercombs May 31, 2017
b1a1d71
Print out genes at various extrema of chromosomes
petercombs May 31, 2017
fc23a1d
Code from a couple months ago for sign test stuff
petercombs May 31, 2017
3622dd4
Code for doing DEXseq automagically
petercombs May 31, 2017
5a7ef0a
quickly set background colors
petercombs May 31, 2017
7e05bc9
Code for quickly plotting coordinates of genes on the various chromos…
petercombs May 31, 2017
70cf982
Capitalize Kr consistently
petercombs May 31, 2017
525aca9
Option for a figure title
petercombs May 31, 2017
cdf7400
Bugfix for index keyword
petercombs May 31, 2017
bab55ec
Options to normalize by constants
petercombs May 31, 2017
c26c13b
Output a single-column that has quick summaries of all the fits
petercombs May 31, 2017
4cd7472
Color NaN's white by default
petercombs May 31, 2017
1324f1b
Plot various classes of putative targets
petercombs May 31, 2017
ffbb9a2
Various tweaks to try to find a phenotype
petercombs May 31, 2017
5b6f184
Another attempt at spatially-varying splicing
petercombs Jun 7, 2017
102fc52
Calculate explicitly zygotic genes and plot a heatmap of Susan's data
petercombs Jun 7, 2017
ffc1825
Easier parsing of output
petercombs Jun 9, 2017
269cfbd
Use command line args to separate out analyses
petercombs Jun 9, 2017
eb798ff
Skip bad files if I want an incomplete datset
petercombs Jun 9, 2017
7729525
First stab at calculating PSI values everywhere
petercombs Jun 13, 2017
d862119
Speed enhancements
petercombs Jun 14, 2017
3c48f89
Fix bug if reads jump over gene
petercombs Jun 14, 2017
78cfb81
Set Maximum sharpness of a function to avoid overflowing
petercombs Jun 14, 2017
b42276f
Add option to keep all male data
petercombs Jun 14, 2017
6ec6616
Options to do basic filtering of the data
petercombs Jun 14, 2017
cf72ea5
Option to have multiple, aligned labels
petercombs Jun 20, 2017
1dbb760
Add species dominance to GetASEStats
petercombs Jun 20, 2017
c1f32d5
Move the writing of files sooner
petercombs Jul 24, 2017
dc18dd9
Output the r^2 as well
petercombs Jul 24, 2017
f252b4d
First stab at implementing Hunter's proposed pipeline in email of ~7/13
petercombs Jul 24, 2017
b87ada4
Cast to bool in case input is some other type
petercombs Jul 24, 2017
e496343
Figure title can now take a tuple for multiple rows
petercombs Jul 24, 2017
3d38d71
Use an actual variable for the ASE cutoff
petercombs Jul 24, 2017
0612fa7
Splicing code
petercombs Jul 24, 2017
70bd3fe
Remove dummy lines
petercombs Jul 24, 2017
0cc9b1a
Add option to add a baseline to control for noisy expression
petercombs Jul 24, 2017
f5e6652
Switch to melXmel instead of just mel. Been a long time since that wo…
petercombs Jul 24, 2017
df06bc3
Command line switch for multiprocessing
petercombs Jul 24, 2017
a47c577
Summary table for the gene expression omnibus
petercombs Jul 24, 2017
b01380d
Add option for key column number, rather than key string
petercombs Jul 24, 2017
a5fcce6
Slightly tweak the GEO code
petercombs Jul 24, 2017
d45f988
Kade found a bug where I'm not writing out the pair
petercombs Aug 3, 2017
9ff2a56
Hunter asked for a sans serif font, since the serif font had ugly jag…
petercombs Aug 3, 2017
58cfb64
Write out type one and type 2, since I was manually changing this before
petercombs Aug 3, 2017
d214dc2
Properly deal with the lack of information on male X chromosomes
petercombs Aug 3, 2017
fe96139
Portability improvements
petercombs Aug 3, 2017
fff3f65
Option to draw GTF track
petercombs Aug 3, 2017
3805cee
Minor tweaks to argument names for consistency
petercombs Aug 3, 2017
4eb4220
Add option to use a multiple alignment
petercombs Aug 3, 2017
1077039
Create drawing groups
petercombs Aug 3, 2017
c690813
Just accept that I'm not calculating the FDR in here
petercombs Aug 3, 2017
76d4503
More orphan analyses
petercombs Aug 3, 2017
933c2b2
Igore collected files for GEO
petercombs Aug 3, 2017
03f1829
Start using Snakemake where possible
petercombs Aug 3, 2017
0f96d45
Helper code to get everything read for submission to GEO
petercombs Aug 3, 2017
032f313
Use full list of binding site changes
petercombs Aug 3, 2017
c5a575e
Simplify title option
petercombs Aug 3, 2017
c244b87
Many plotting tweaks
petercombs Aug 3, 2017
c323d78
More sans-serif
petercombs Aug 10, 2017
1adce13
Code to help get allele-specific primers
petercombs Aug 31, 2017
2681473
Try calculating ASE weighted by expression
petercombs Aug 31, 2017
3a75a7e
Add option to give the name of the file created
petercombs Aug 31, 2017
2de1fb6
Tweak output for better organization
petercombs Aug 31, 2017
c0cc121
Add data for the gene sala
petercombs Aug 31, 2017
e8abb65
Use an expression-weighted ASE score
petercombs Aug 31, 2017
7253b83
More cleanly generate customizable binding change maps
petercombs Aug 31, 2017
fb3b2b2
Merge branch 'ReferenceR5'
petercombs Aug 31, 2017
b4a4715
More of the push towards turnkey Snakefile operation
petercombs Sep 1, 2017
3ec274d
Add in the SRA keys for the gDNA
petercombs Sep 5, 2017
5c4960e
More of the sliceseq pipeline
petercombs Sep 5, 2017
4db76f2
Modernize the readme file
petercombs Sep 5, 2017
5152272
Start with default cluster parametes
petercombs Sep 5, 2017
625582d
Mention switching to Snakemake
petercombs Sep 5, 2017
a7db17a
Final set of tweaks to get it to run
petercombs Sep 11, 2017
c44a191
Look more closely for an effect of transposons on svASE
petercombs Nov 13, 2017
6c7f872
Restrict species ids to three letters
petercombs Nov 13, 2017
dccff22
Fix parenthesization bug
petercombs Nov 13, 2017
c8af60e
GTF processing function
petercombs Nov 13, 2017
2b716e8
Use the original version to build mel/sim genes
petercombs Nov 13, 2017
0c4a2b1
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Dec 19, 2017
6d87861
Split the genome prior to running GATK
petercombs Dec 19, 2017
b5c8904
Be move verbose with prereqs and arguments
petercombs Dec 19, 2017
5e33162
Specify a samtools prefix
petercombs Dec 19, 2017
da94dad
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Dec 19, 2017
4fabfc8
Remove Unused scripts and files
petercombs Feb 2, 2018
2757232
Fix edge case bug with multiple NaNs
petercombs Feb 2, 2018
21edc9b
Start stripping out usage of deprecated .ix on DataFrames
petercombs Feb 2, 2018
c675a35
Be more permissive if empty column isn't present
petercombs Feb 2, 2018
fd587a3
Use argparse to be more flexible
petercombs Feb 2, 2018
a4cbc80
Use argparse for FDR on other files
petercombs Feb 3, 2018
af4964e
Switch FDR to use fyrd
petercombs Feb 3, 2018
c506931
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Feb 3, 2018
9a3221d
Use only true heterozygous snps
petercombs Feb 3, 2018
33ea33f
Calculate Percent Spliced In automatically
petercombs Feb 3, 2018
3019124
Ignore commented lines
petercombs Feb 3, 2018
42009b5
More tweaks to get this to run on Sherlock
petercombs Feb 4, 2018
bfbabdb
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Feb 4, 2018
d72099a
More aggressively write output (in multiple formats)
petercombs Feb 5, 2018
a878223
Merge
petercombs Feb 5, 2018
d14a13c
Start updating the readme
petercombs Feb 7, 2018
2f62b49
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Feb 7, 2018
01384e1
Tweaks to the FDR calculation scripts
petercombs Feb 7, 2018
600da8f
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Feb 7, 2018
b0cc9b6
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Feb 7, 2018
eb635e4
Option to allow start and stop to be cast
petercombs Feb 10, 2018
649c14d
Snakefile generate prerequisites for binding figures
petercombs Feb 10, 2018
9734a0c
Specify sans-serif everywhere (validator demands all lowercase)
petercombs Feb 10, 2018
28cae88
Another attempt at finding spatial splicing
petercombs Feb 15, 2018
3e150e4
Compute ASE on smaller portions of genes (CDS, exons)
petercombs Feb 16, 2018
673148c
One attempt to describe 5' UTRs
petercombs Feb 16, 2018
6e98902
More explicit output
petercombs Feb 16, 2018
d1a4145
Extend functionality of autocorrs
petercombs Feb 16, 2018
f996e0d
Better scaling of figure with really wide labels
petercombs Feb 16, 2018
98ec007
Option for a full variance colorscaling
petercombs Feb 16, 2018
4682f0d
Ignore fyrd files
petercombs Feb 16, 2018
56bc034
Merge
petercombs Feb 16, 2018
95c67aa
Bugfixes
petercombs Feb 18, 2018
5394825
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Feb 18, 2018
4a741cb
Use kallisto for quantification
petercombs Feb 21, 2018
c8d26d1
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Feb 21, 2018
bd7f436
Final changes before submission
petercombs Mar 7, 2018
e6560fc
Better running on Sherlock
petercombs Mar 7, 2018
b3b8e06
Merge branch 'master' of github.com:petercombs/HybridSliceSeq
petercombs Mar 7, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
analysis*
Reference/*
sequence/*
prereqs/*
*.swp
*.out
*.pyc
*[Ll]og*
tmp*
*.pkl
__pycache__
config.make
godot
iPythonNotebooks/.ipynb_checkpoints
geo
.snakemake
fit_and_eval*
*cluster.py
*.sbatch
*.qsub
*.err
*.pickle.in
*.func.pickle
142 changes: 0 additions & 142 deletions BED2FASTA.pl

This file was deleted.

217 changes: 217 additions & 0 deletions CalculatePSI.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
from __future__ import print_function
from collections import defaultdict
from pysam import Samfile
from progressbar import ProgressBar as PBar
import pandas as pd
import numpy as np
from argparse import ArgumentParser
from sys import stdout

def parse_annotation(annot):
retval = {}
for i in annot.strip(';').split(';'):
key, val = i.strip().split(' ', maxsplit=1)
retval[key.strip()] = val.strip().strip(';"')
return retval


def get_supported_exons(read, exons_in_gene):
exon_num = 0
exons_iter = iter(exons_in_gene)
supported_exons = []
supported_lens = []
excluded_exons = []
cur_pos = read.pos
cur_exon = next(exons_iter)
try:
while cur_exon[1] < cur_pos:
#print("Skipping", cur_exon, cur_pos)
cur_exon = next(exons_iter)
#print("Skipped, now on:", cur_exon, cur_exon[1] < cur_pos)
exon_num += 1
#print(read.cigartuples)
for cigartype, length in read.cigartuples:
# Match = 0, Deletion = 2, Skipped = 3
#print(cigartype, length, cur_exon, cur_pos,
#cur_pos <= cur_exon[1] <= cur_pos+length,
#cur_pos < cur_exon[1] < cur_pos + length,
#)
if cigartype == 0:
while ((cur_pos <= cur_exon[1] <= cur_pos+length) or
(cur_pos <= cur_exon[0] <= cur_pos+length) or
(cur_exon[0] < cur_pos < cur_exon[1])):
supported_lens.append(cur_exon[1] - cur_exon[0])
supported_exons.append(cur_exon[2])
#print("Adding", cur_exon)
cur_exon = next(exons_iter)
cur_pos += length
if cigartype==3 or cigartype == 2:
while cur_pos < cur_exon[1] < cur_pos + length:
#print("Excluding", cur_exon)
excluded_exons.append(cur_exon[2])
cur_exon = next(exons_iter)
cur_pos += length
except StopIteration:
#print("StoppedIteration", cur_exon)
pass
finally:
return (
supported_lens,
supported_exons,
(excluded_exons if supported_exons else [])
)


def get_exon_dictionary(gtf_file):
current_gene = ''
all_exons_by_chrom_by_gene = defaultdict(list)
exon_names = {}

num_exons = 0
for line in open(gtf_file):
data = line.strip().split('\t')
left = int(data[3])
right = int(data[4])
annot = parse_annotation(data[-1])
if data[2] == 'aggregate_gene':
current_exons = []
current_gene = annot['gene_id']
# Note here that what *should* happen is due to python's
# pass-by-reference magic, we should be able to keep editing
# current_exons even after it's been inserted into the dictionary
all_exons_by_chrom_by_gene[data[0]].append(
(left, right, current_gene, current_exons)
)
elif data[2] == 'exonic_part':
current_exons.append((left, right, num_exons,
'{}_{}'.format(
annot['gene_id'], annot['exonic_part_number']
)
))
exon_names[num_exons] = '{}_{}'.format( annot['gene_id'], annot['exonic_part_number'])
num_exons += 1
return all_exons_by_chrom_by_gene, exon_names


def parse_overlapping_reads(samfile, raw_exon_counts, exons_on_chrom):
for read in samfile:
chrom = read.reference_name
for read in samfile:
for gene_left, gene_right, gene_name, exons in exons_on_chrom[chrom]:
positions = read.positions
if gene_left <= positions[0] < positions[-1] <= gene_right:
old_exon_right = -1
for exon_left, exon_right in exons:
if exon_right < positions[0] or positions[-1] < exon_left:
continue
elif exon_left in positions or exon_right in positions:
raw_exon_counts.ix[(chrom, exon_left, exon_right,
gene), 'INCLUDED'] += 1
else:
pass

def iterate_over_samfile(samfile, exon_dictionary, exon_names, timeit=1e100):
samfile.reset()
curr_reference = ''
all_exons = pd.DataFrame(index=exon_names.keys(),
columns=['SUPPORTED', 'EXCLUDED',
'N_SUPPORTED', 'N_EXCLUDED'],
data=0.0, dtype=float,
#data={'SUPPORTED': 0.0, 'EXCLUDED': 0.0,
#'N_SUPPORTED': 0, 'N_EXCLUDED': 0}
)
num_exons = len(exon_names)
arr_supported = np.zeros(num_exons, dtype=float)
arr_excluded = np.zeros(num_exons, dtype=float)
arr_n_supported = np.zeros(num_exons, dtype=int)
arr_n_excluded = np.zeros(num_exons, dtype=int)
references=samfile.references
pb = PBar(maxval=samfile.mapped)
pb.start()
#pb = lambda x: x
for i, read in enumerate(samfile):
if i % 1e5 == 0:
pb.update(i)
refid = read.reference_id
if refid != curr_reference:
curr_reference = refid
genes_on_chrom = exon_dictionary[references[refid]]
current_position_in_annotation = 0
curpos = read.pos
while current_position_in_annotation < len(genes_on_chrom) and genes_on_chrom[current_position_in_annotation][1] < curpos:
current_position_in_annotation += 1
for left, right, genedata, exons in genes_on_chrom[current_position_in_annotation:]:
if curpos < left:
break
readlen = sum(i for t, i in read.cigartuples if t == 0)
lens, supported, excluded = get_supported_exons(read, exons)
for length, ix in zip(lens, supported):
arr_supported[ix] += 1/(length + readlen - 1)
arr_n_supported[ix] += 1
for ix in excluded:
arr_excluded[ix] += 1/(readlen - 1)
arr_n_excluded[ix] += 1
if i > timeit:
break
pb.finish()
all_exons['SUPPORTED'] = arr_supported
all_exons['EXCLUDED'] = arr_excluded
all_exons['N_SUPPORTED'] = arr_n_supported
all_exons['N_EXCLUDED'] = arr_n_excluded
all_exons.rename(index=exon_names, inplace=True)
return all_exons

def iterate_over_references(samfile, exon_dictionary):
samfile.reset()
references = samfile.references
all_exons = pd.DataFrame(index=[exon
for genes in exon_dictionary.values()
for _, _, _, exons in genes
for _, _, exon in exons
],
columns=['SUPPORTED', 'EXCLUDED', 'N_SUPPORTED',
'N_EXCLUDED'],
data=0)

for reference in PBar()(references):
genes_on_chrom = exon_dictionary[reference]
current_position_in_annotation = 0
for read in samfile.fetch(reference):
curpos = read.pos
while current_position_in_annotation < len(genes_on_chrom) and genes_on_chrom[current_position_in_annotation][1] < curpos:
current_position_in_annotation += 1
for left, right, genedata, exons in genes_on_chrom[current_position_in_annotation:]:
if curpos < left:
break
supported, excluded = get_supported_exons(read, exons)
all_exons.ix[supported, 'SUPPORTED'] += 1
all_exons.ix[excluded, 'EXCLUDED'] += 1


return

def parse_args():
parser = ArgumentParser()
parser.add_argument('samfile', type=Samfile)
parser.add_argument('gtf_file')
parser.add_argument('--outfile', '-o', default=stdout)
parser.add_argument('--min-reads', default=20, type=int)
parser.add_argument('--timeit', default=1e100, type=float)
return parser.parse_args()


if __name__ == "__main__":
args = parse_args()
exon_dict, exon_names = get_exon_dictionary(args.gtf_file)
all_exons = iterate_over_samfile(args.samfile, exon_dict, exon_names, args.timeit)
all_exons.index.name = 'exon_id'
all_exons['psi'] = all_exons.SUPPORTED / (all_exons.SUPPORTED +
all_exons.EXCLUDED)
all_exons.ix[(all_exons.N_SUPPORTED + all_exons.N_EXCLUDED) < args.min_reads,
'psi'] = pd.np.nan
all_exons.to_csv(args.outfile, sep='\t', na_rep='NA')





Loading