-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data concentration plot #36
base: master
Are you sure you want to change the base?
Conversation
…t since it shows the read length neighborhood where most data is concentrated if such a neighborhood exists (they exist for pacbio reads, but the plot seems more uniform with minION reads). This is the type of plot one sees in MinKNOW during the sequencing run. poretools/poretool_main.py was edited to include this subcommand and poretools/dataconc.py was added to the repertoire. dataconc.py uses the matplotlib and pandas libraries and is a surprisingly simple few lines of code.
… data as a percent of all data. The other is --cumulative which plots the cumulative data with increasing read length. --percent and --cumulative can be used together as well.
I actually just updated data_conc to allow plotting the cumulative amount of data with increasing read length as well as to allow both the regular and cumulative plots as percents of total data instead of absolute amounts (in bp) of data. These options at the command line are: --cumulative and --percent. |
This is great John, thanks for the pull request. My only slight concern is that this adds new dependencies and we have found that users are struggling installing many dependencies on quite diverse setups. It might be good to re-code this to use Rpy2/ggplot2 as this is what we are using already for plotting. One of us could perhaps do this. |
I can try to re-code it, but have to familiarize myself with rpy2 and with ggplot2. I do a lot of coding in R (in fact, I first coded this in R which is why I used the pandas library in python), but usually just use regular old plot(). I first tried to do this in rpy2 for the reasons you mention, but found rpy2 somewhat confusing despite being familiar with both R and python -- any tips on it are welcomed. Despite the disadvantage of extra dependencies, one advantage of matplotlib plotting is that when the user does not use "--saveas" and it temporarily goes to screen, it allows the user to then choose to save what they see if they would like to (in any format). |
…tration plots would look like with uniform sampling of read lengths
I just added a feature to data_conc that allows the user to simulate what the data concentration plot would look like if the read lengths were uniformly sampled. By default it uses the same number of reads and range of sizes. The user can override this default and simulate any number of reads and range. MinION data concentration plots look strikingly uniform (though not completely) compared to pac bio plots. |
… types, time constraints, etc. Simulation also updated to reflect these changes where relevant.
DC plots can now be generated based on read type, start/end times, etc. I show examples on my ONT poreminion page. Still have not fully converted it to rpy2 though. |
Hi Aaron,
If you are interested, I added a new subtool that returns the type of plot one sees in MinKNOW during the sequencing run with the sum of data over each bin.
poretools/poretool_main.py was edited to include the subcommand "data_conc" and poretools/dataconc.py was added to the repertoire.
dataconc.py uses the matplotlib and pandas libraries. I kept with the structure and lingo you used in hist.py. data_conc can actually write to any image extension type that matplotlib allows, but I arbitrarily restricted it to pdf and jpg to avoid errors thrown from erroneous extensions.
best,
John