Figures for Labeled Optimal PARTitioning paper, arXiv:2006.13967.
- Figure 1: example data and cost computation. R script, standAlone pdf, tex for inclusion in paper.
- Figure 2: timings. R script, tex for inclusion in paper: time vs number of labels (left), time vs number of data (right).
- Figures 3-4: Best case label error. R script, OPART pdf (Fig 3 left), BinSeg pdf (Fig 3 right), SegAnnot pdf (Fig 4).
- Comparison with BIC figures. R script will make this predictions CSV.
- First run figure-label-errors-data.R and figure-label-errors-data-binseg.R to compute label error rates for a grid of penaleis, which should create folders of figure-label-errors-data/*.csv and figure-label-errors-data-binseg/*.csv files, same as figure-label-errors-data.tgz
- Figure 5: Error rates, CSV for data in figure.
- Figure 6: ROC curves, CSV for data in figure.
Reproducibility: use “make” with one of the recipes defined in Makefile.
Video screencast for NAU ML group meeting, 31 Aug 2020.
slides.tex makes slides.pdf
- figure-candidates.R demo slides.
- figure-candidates-interactive.R creates a more interactive version of the demo slide figures, where you can select any t,tau and other penalty values: http://ml.nau.edu/viz/2021-07-23-LOPART-candidates/
- figure-baselines.R makes figures that show why previous baselines are not good enough.
- Predicted constant penalty label error figures. R script, OPART pdf, SegAnnot pdf, ROC curves.
- Sequential cross-validation (fold 1 = first half of labels, fold 2 = second half labels) figures. R script, Best case OPART differences, Best case SegAnnot differences, ROC curves, Test error rates.
The 413 real genomic data sequences described in the paper are downloadable here:
- Each CSV file has a sequenceID column which is of the form A.B where A is the profile ID number and B is the chromosome name. This ID can be used to separate the data into different segmentation/changepoint detection problems.
- signals, from 39 to 43628 observations per data sequence.
- data.i column ranges from 1 to the number of data on that sequenceID. (data.i=1 is the first data point in the sequence, data.i=2 is the second, etc)
- logratio column is the raw/noisy data which should be used as input to the segmentation/changepoint detection algorithm.
- labels, from 2 to 12 per data sequence.
- changes column is either 0 or 1 (the number of changepoints which
should be predicted in this region label).
- False positive label is when predicted number of changepoints is greater than labeled number of changepoints (e.g., 2 changes predicted in either a 0 or 1 label).
- False negative label is when changes=1 and predicted number of changepoints is 0.
- fold column is either 1 or 2 (fold IDs used in cross-validation in the paper).
- start/end columns define a region of the correponding data sequence (in terms of data.i index values) in which the labeled number of changes should be detected.
- changes column is either 0 or 1 (the number of changepoints which
should be predicted in this region label).
- Note on data source: they were created using SegAnnDB software, in which expert biologists looked at scatterplots and created labels based on visual interpretation of significant signal/noise patterns. Source data file, R Script to process, SegAnnDB paper.
I recommend computing the label error using penaltyLearning R package, useR 2017 tutorial materials.