This repository was archived by the owner on Sep 11, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 122
Estimation Data types
Frank Noe edited this page Oct 31, 2015
·
1 revision
Some data processing steps are currently inefficient - in memory usage, CPU usage, or both.
- Allow to use sparse matrices as input
- Allow to use different data types, e.g. boolean or bitarrays for contact maps
The question arises how we still keep generality in the data processing pipeline.
- Build specialized low-level estimators for specific datatypes, e.g. covariance estimators for integer and sparse boolean data. (simple one-pass algorithm is robust for integral data, C implementation can efficiently deal with 1's and 0's.)
- High-level estimator (e.g. TICA) encapsulates multiple types, e.g. float/int.
- There is a fallback implementation if specialized low-level algorithms are not implemented. For example a boolean array can be cast to a float array containing 0.0 and 1.0, a sparse data chunk can be copied into a dense data chunk etc.
- Clustering output is integer, MSM/HMSM input is integer
- How can they be included in a data processing pipeline?