From 329b5b3e921d7c36a808986a50378ec665755533 Mon Sep 17 00:00:00 2001 From: Toby Dylan Hocking Date: Fri, 16 Oct 2020 10:16:47 -0700 Subject: [PATCH] benchmark data section --- README.org | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/README.org b/README.org index 5d91305..1d5df08 100644 --- a/README.org +++ b/README.org @@ -35,3 +35,37 @@ Reproducibility: use "make" with one of the recipes defined in second half labels) figures. [[file:figure-sequence-cv.R][R script]], [[file:figure-sequence-cv-OPART.pdf][Best case OPART differences]], [[file:figure-sequence-cv-SegAnnot.pdf][Best case SegAnnot differences]], [[file:figure-sequence-cv-roc.pdf][ROC curves]], [[file:figure-sequence-cv.pdf][Test error rates]]. +** Benchmark data set + +The data described in the paper are downloadable here: +- Data were created using SegAnnDB software, in which expert + biologists looked at scatterplots and created labels based on visual + interpretation of significant signal/noise + patterns. [[http://members.cbio.mines-paristech.fr/~thocking/neuroblastoma/signal.list.annotation.sets.RData][Source data file]], [[https://github.com/tdhock/LabeledFPOP-paper/blob/master/signal.list.annotation.sets.R][R Script to process]]. +- Each CSV file has a sequenceID column which is of the form A.B where + A is the profile ID number and B is the chromosome name. This ID can + be used to separate the data into different segmentation/changepoint + detection problems. +- [[https://github.com/tdhock/LOPART-paper/raw/master/data-for-LOPART-signals.csv.gz][signals]], from 39 to 43628 observations per data sequence. + - data.i column ranges from 1 to the number of data on that + sequenceID. (data.i=1 is the first data point in the sequence, + data.i=2 is the second, etc) + - logratio column is the raw/noisy data which should be used as + input to the segmentation/changepoint detection algorithm. +- [[https://github.com/tdhock/LOPART-paper/raw/master/data-for-LOPART-labels.csv.gz][labels]], from 2 to 12 per data sequence. + - changes column is either 0 or 1 (the number of changepoints which + should be predicted in this region label). + - False positive label is when predicted number of changepoints is + greater than labeled number of changepoints (e.g., 2 changes + predicted in either a 0 or 1 label). + - False negative label is when changes=1 and predicted number of + changepoints is 0. + - fold column is either 1 or 2 (fold IDs used in + cross-validation in the paper). + - start/end columns define a region of the correponding data + sequence (in terms of data.i index values) in which the labeled + number of changes should be detected. + +I recommend computing the label error using [[https://github.com/tdhock/penaltyLearning][penaltyLearning]] R package, +[[https://github.com/tdhock/change-tutorial][useR +2017 tutorial materials]].