Collection of scripts utilized to classify artifact components vs neural signal from Independent Components that are outputs from pySEAS.
Publication: Weiser SC, Mullen BR, Ascencio D, Ackman JB (2023) Data-driven segmentation of cortical calcium dynamics. PLOS Computational Biology 19(5): e1011085. https://doi.org/10.1371/journal.pcbi.1011085
pySEAS: https://github.com/ackmanlab/pySEAS
Full datasets (not just the metrics in this repository): Weiser, S, Mullen, BR, Ackman, J (2023), Data-driven segmentation of cortical calcium dynamics, Dryad, Dataset, https://doi.org/10.7291/D1N96W
Metric training datasets were 7 full datasets randomly selected out of 12 full datasets.
Metric testing datasets is what we called the novel testing set. Wherein 5 completely new datasets that did not contribute to the model.
./data/training_dataset.tsv
./data/testing_dataset.tsv
Metrics generated from the control data are also saved in the data file:
#astrocyte GFP
aGFP_list = ['190613_01-02_600components_ica_metrics.tsv',
'190613_03-04_600components_ica_metrics.tsv',
'190329_02-03_600components_ica_metrics.tsv']
#microglia GFP
mGFP_list =['171211_01-02_600components_ica_metrics.tsv',
'191111_01-02_600components_ica_metrics.tsv',
'191112_01-02_600components_ica_metrics.tsv']
#black 6 c57, non-transgenic mice
WT_list = ['190625_01-02_600components_ica_metrics.tsv',
'190904_01-02_600components_ica_metrics.tsv',
'190625_03-04_600components_ica_metrics.tsv']
Included are some of the code used to generate a couple figures for the paper.
Figure4.1_scripts.ipynb/Figure4.2_scripts.ipynb: This is the core piece of code used to generate Figure 4 and several of the suplemental figures.
- How we generated the sIC displayed in the paper
- How we generated the wavelet power to noise ratio
- Statistics on comparing between GCaMP animals and Control animals
- Histograms showing distributions of generated metrics
Figure6.1_ML feature selection.ipynb/Figure6.2_ML assessment.ipynb: This is the core piece of code used to generate Figure 6 and suplemental figure.
- This will also show the code on how we decided important features
- PCA projections and mapping metric values on PCA space
- Iteratitive training, while storing accuracy metrics
- Iteratitive testing on novel data, while storing accuracy metrics
- Performance/confidence of the model vs how it performed on the novel data
This code works directly off of the files generated by pySEAS. The code is parallelized to speed up metric generation.
python component_metrics.py -h
optional arguments:
-h, --help show this help message and exit
-i INPUT [INPUT ...], --input INPUT [INPUT ...]
path to the ica that needs characterization or tsv file that needs
class update
-f FPS, --fps FPS frames per second from recordings
-g GROUP_PATH [GROUP_PATH ...], --group_path GROUP_PATH [GROUP_PATH ...]
save path to a file that groups the experiment. If used on experiments
that have already been characterized, this will force the re-
calculation of any already processed data file
-pr PROCESS, --process PROCESS
Number of CPU dedicated to processing; 0 will max out the number of
CPU
-uc UPDATECLASS [UPDATECLASS ...], --updateClass UPDATECLASS [UPDATECLASS ...]
directory to ica.hdf5, put the tsv path into the input arguemnt,
updates the tsv based on ica.hdf5 classification
-fc, --force force re-calculation if not grouped
Each script should be run in the terminal. Here are some common commands that will produce metrics.
Be sure to set the FPS to your recording rate (default 10Hz). As it is currently written, frequency information will produce metrics based on the wavelet family Morlet (ω = 4) described in the paper.
python component_metrics.py -i /path/to/hdf5/files/XXXXXXXXX_ica.hdf5
This base command will generate a metrics file names similar to the input ica file, where the output will be saved to /path/to/hdf5/files/XXXXXXXXX_ica_metrics.tsv
python component_metrics.py -i /path/to/hdf5/files/*_ica.hdf5 -g /path/to/save/metrics/in/a/group/GROUP_metrics.tsv
*
is a wildcard expansion and will find all files with the _ica.hdf5
termination. All these files will be generated and placed into a single group metrics file. Each index is generated based on the name of the _ica
file name.
If a file is re-run on the same file, the code will overwrite the previous metric generation. This is to reduce the ability for duplication of metrics based on a single experiment. Each component has its own unique index saved in the data frame. Outputs will be saved to /path/to/save/metrics/in/a/group/GROUP_metrics.tsv
, with an extra column anml
, which will number each animal independent of each other.
python component_metrics.py -i /path/to/save/metrics/in/a/group/GROUP_metrics.tsv -uc /path/to/hdf5/files/XXXXXXXXX_ica.hdf5
As the code is written, only the random forest classifier is saved. However, the script will train across multiple other models as well.
python ML_classify.py -h
usage: ML_classify.py [-h] [-i INPUT_TSV [INPUT_TSV ...]] [-h5 INPUT_HDF5 [INPUT_HDF5 ...]] [-uc] [-g GROUP_PATH [GROUP_PATH ...]]
[-t] [-fc] [-cf CLASSIFIER [CLASSIFIER ...]] [-p]
optional arguments:
-h, --help show this help message and exit
-i INPUT_TSV [INPUT_TSV ...], --input_tsv INPUT_TSV [INPUT_TSV ...]
path to the .tsv file for classification
-h5 INPUT_HDF5 [INPUT_HDF5 ...], --input_hdf5 INPUT_HDF5 [INPUT_HDF5 ...]
path to the .hdf5 file for updating artifact classifications
-uc, --updateClass updates the ica.hdf5 based on classifier model and metrics .tsv file requires tsv and hdf5 inputs
-g GROUP_PATH [GROUP_PATH ...], --group_path GROUP_PATH [GROUP_PATH ...]
save path to a file that groups the experiment. If used on experiments that have already been characterized,
this will force the re-calculation of any already processed data file
-t, --train train the classifier on the newest class_metric data frame
-fc, --force force re-calculation
-cf CLASSIFIER [CLASSIFIER ...], --classifier CLASSIFIER [CLASSIFIER ...]
path to the classifier.hdf5 file
-p, --plot Visualize training outcome
First, you will need to train a classifier. To do this, use the metrics generated previously:
python ML_classify.py -i ./data/training_dataset.tsv -cf ../classifier/test_classifier.hdf5 -t -p
Note: These metrics are dependent on experimental conditions and recording equipment. If you are using your own data, validation using your own data will need to be done to ensure proper classification.
With the plot flag -p
, two plots are displayed:
- Each component along the x-axis, across various experiments (labeled on top of graph). Each component has a confidence of either artifact or neural classification. False positives and false negatives are identified with vertical lines.
- ROC plot utilized in Supplemental figure 7 (PLOS, https://doi.org/10.1371/journal.pcbi.1011085)
The training will output accuracy of various trained models score, precision, and recall are all displayed. Neural vs artifact accuracy are also displayed:
score precision recall neural acc. artifact acc.
classifier
LogisticRegression 0.95 0.96 0.97 0.97 0.89
GaussianNB 0.93 0.97 0.94 0.94 0.92
SVM 0.96 0.97 0.97 0.97 0.93
RandomForest 0.96 0.97 0.98 0.98 0.92
Voting 0.96 0.97 0.98 0.98 0.92
You can load in several data frames to train your data (this will load both training and novel datasets):
python ML_classify.py -i ./data/*_dataset.tsv -cf ../classifier/test_classifier.hdf5 -t
Once you have a trained classifier, you can test the classifier on novel data (do not test your classifier on any data used to train the model):
python ML_classify.py -i ./data/novel_dataset.tsv -cf ../classifier/test_classifier.hdf5 -uc
-uc
flag will create a new column on the data frame specified as machine identified neural components: 'm_neural'. Accuracy comparisons will print in the terminal window.
Accuracy comparing human to machine classification: 96.03 %
HDF5 will not be updated. File was not found or specified.
Saving to file: ./data/novel_dataset.tsv
If you give the -uc
flag and INPUT_HDF5
, the script will update the ICA.hdf5
files with the artifact classification. Put in a single file, list or *
wildcard expansion. Each component identity will be based on unique index generated in metrics file.
python ML_classify.py -i ./data/novel_dataset.tsv -h5 path/to/*ICA.hdf5 -cf ../classifier/test_classifier.hdf5 -uc
The terminal will tell you which files are being updated:
Accuracy comparing human to machine classification: 96.27 %
Saving artifact_component to ../../Desktop/hdf5test/190408_07-08_ica.hdf5
Saving artifact_component to ../../Desktop/hdf5test/190423_03-04_ica.hdf5
Saving to file: ./data/novel_dataset.tsv