ArtifactClassifier-pySEAS

Collection of scripts utilized to classify artifact components vs neural signal from Independent Components that are outputs from pySEAS.

Publication: Weiser SC, Mullen BR, Ascencio D, Ackman JB (2023) Data-driven segmentation of cortical calcium dynamics. PLOS Computational Biology 19(5): e1011085. https://doi.org/10.1371/journal.pcbi.1011085

pySEAS: https://github.com/ackmanlab/pySEAS

Full datasets (not just the metrics in this repository): Weiser, S, Mullen, BR, Ackman, J (2023), Data-driven segmentation of cortical calcium dynamics, Dryad, Dataset, https://doi.org/10.7291/D1N96W

Data utilized in the Machine Learning section of our PLoS Computational Biology paper

Metric training datasets were 7 full datasets randomly selected out of 12 full datasets.

Metric testing datasets is what we called the novel testing set. Wherein 5 completely new datasets that did not contribute to the model.

./data/training_dataset.tsv
./data/testing_dataset.tsv

Metrics generated from the control data are also saved in the data file:

#astrocyte GFP
aGFP_list = ['190613_01-02_600components_ica_metrics.tsv',
             '190613_03-04_600components_ica_metrics.tsv',
             '190329_02-03_600components_ica_metrics.tsv']

#microglia GFP       
mGFP_list =['171211_01-02_600components_ica_metrics.tsv',
            '191111_01-02_600components_ica_metrics.tsv',
            '191112_01-02_600components_ica_metrics.tsv']

#black 6 c57, non-transgenic mice
WT_list = ['190625_01-02_600components_ica_metrics.tsv',
           '190904_01-02_600components_ica_metrics.tsv',
           '190625_03-04_600components_ica_metrics.tsv']

Data visualization and machine learning tests

Included are some of the code used to generate a couple figures for the paper.

Figure4.1_scripts.ipynb/Figure4.2_scripts.ipynb: This is the core piece of code used to generate Figure 4 and several of the suplemental figures.

How we generated the sIC displayed in the paper
How we generated the wavelet power to noise ratio
Statistics on comparing between GCaMP animals and Control animals
Histograms showing distributions of generated metrics

Figure6.1_ML feature selection.ipynb/Figure6.2_ML assessment.ipynb: This is the core piece of code used to generate Figure 6 and suplemental figure.

This will also show the code on how we decided important features
PCA projections and mapping metric values on PCA space
Iteratitive training, while storing accuracy metrics
Iteratitive testing on novel data, while storing accuracy metrics
Performance/confidence of the model vs how it performed on the novel data

Generating metrics for the Machine Learning

This code works directly off of the files generated by pySEAS. The code is parallelized to speed up metric generation.

python component_metrics.py -h
optional arguments:
  -h, --help            show this help message and exit
  -i INPUT [INPUT ...], --input INPUT [INPUT ...]
                        path to the ica that needs characterization or tsv file that needs
                        class update
  -f FPS, --fps FPS     frames per second from recordings
  -g GROUP_PATH [GROUP_PATH ...], --group_path GROUP_PATH [GROUP_PATH ...]
                        save path to a file that groups the experiment. If used on experiments
                        that have already been characterized, this will force the re-
                        calculation of any already processed data file
  -pr PROCESS, --process PROCESS
                        Number of CPU dedicated to processing; 0 will max out the number of
                        CPU
  -uc UPDATECLASS [UPDATECLASS ...], --updateClass UPDATECLASS [UPDATECLASS ...]
                        directory to ica.hdf5, put the tsv path into the input arguemnt,
                        updates the tsv based on ica.hdf5 classification
  -fc, --force          force re-calculation if not grouped

Each script should be run in the terminal. Here are some common commands that will produce metrics.

Be sure to set the FPS to your recording rate (default 10Hz). As it is currently written, frequency information will produce metrics based on the wavelet family Morlet (ω = 4) described in the paper.

One file:

python component_metrics.py -i /path/to/hdf5/files/XXXXXXXXX_ica.hdf5

This base command will generate a metrics file names similar to the input ica file, where the output will be saved to /path/to/hdf5/files/XXXXXXXXX_ica_metrics.tsv

Grouped files:

python component_metrics.py -i /path/to/hdf5/files/*_ica.hdf5 -g /path/to/save/metrics/in/a/group/GROUP_metrics.tsv

* is a wildcard expansion and will find all files with the _ica.hdf5 termination. All these files will be generated and placed into a single group metrics file. Each index is generated based on the name of the _ica file name.

If a file is re-run on the same file, the code will overwrite the previous metric generation. This is to reduce the ability for duplication of metrics based on a single experiment. Each component has its own unique index saved in the data frame. Outputs will be saved to /path/to/save/metrics/in/a/group/GROUP_metrics.tsv, with an extra column anml, which will number each animal independent of each other.

updatate class in the tsv file for specific ica.hdf5:

python component_metrics.py -i /path/to/save/metrics/in/a/group/GROUP_metrics.tsv -uc /path/to/hdf5/files/XXXXXXXXX_ica.hdf5

Machine Learning script

As the code is written, only the random forest classifier is saved. However, the script will train across multiple other models as well.

python ML_classify.py -h
usage: ML_classify.py [-h] [-i INPUT_TSV [INPUT_TSV ...]] [-h5 INPUT_HDF5 [INPUT_HDF5 ...]] [-uc] [-g GROUP_PATH [GROUP_PATH ...]]
                      [-t] [-fc] [-cf CLASSIFIER [CLASSIFIER ...]] [-p]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_TSV [INPUT_TSV ...], --input_tsv INPUT_TSV [INPUT_TSV ...]
                        path to the .tsv file for classification
  -h5 INPUT_HDF5 [INPUT_HDF5 ...], --input_hdf5 INPUT_HDF5 [INPUT_HDF5 ...]
                        path to the .hdf5 file for updating artifact classifications
  -uc, --updateClass    updates the ica.hdf5 based on classifier model and metrics .tsv file requires tsv and hdf5 inputs
  -g GROUP_PATH [GROUP_PATH ...], --group_path GROUP_PATH [GROUP_PATH ...]
                        save path to a file that groups the experiment. If used on experiments that have already been characterized,
                        this will force the re-calculation of any already processed data  file
  -t, --train           train the classifier on the newest class_metric data frame
  -fc, --force          force re-calculation
  -cf CLASSIFIER [CLASSIFIER ...], --classifier CLASSIFIER [CLASSIFIER ...]
                        path to the classifier.hdf5 file
  -p, --plot            Visualize training outcome

Training your classifier

First, you will need to train a classifier. To do this, use the metrics generated previously:

python ML_classify.py -i ./data/training_dataset.tsv -cf ../classifier/test_classifier.hdf5 -t -p

Note: These metrics are dependent on experimental conditions and recording equipment. If you are using your own data, validation using your own data will need to be done to ensure proper classification.

With the plot flag -p, two plots are displayed:

Each component along the x-axis, across various experiments (labeled on top of graph). Each component has a confidence of either artifact or neural classification. False positives and false negatives are identified with vertical lines.
ROC plot utilized in Supplemental figure 7 (PLOS, https://doi.org/10.1371/journal.pcbi.1011085)

The training will output accuracy of various trained models score, precision, and recall are all displayed. Neural vs artifact accuracy are also displayed:

                    score  precision  recall  neural acc.  artifact acc.
classifier                                                              
LogisticRegression   0.95       0.96    0.97         0.97           0.89
GaussianNB           0.93       0.97    0.94         0.94           0.92
SVM                  0.96       0.97    0.97         0.97           0.93
RandomForest         0.96       0.97    0.98         0.98           0.92
Voting               0.96       0.97    0.98         0.98           0.92

You can load in several data frames to train your data (this will load both training and novel datasets):

python ML_classify.py -i ./data/*_dataset.tsv -cf ../classifier/test_classifier.hdf5 -t

Testing your classifier

Once you have a trained classifier, you can test the classifier on novel data (do not test your classifier on any data used to train the model):

python ML_classify.py -i ./data/novel_dataset.tsv -cf ../classifier/test_classifier.hdf5 -uc

Updating your files

-uc flag will create a new column on the data frame specified as machine identified neural components: 'm_neural'. Accuracy comparisons will print in the terminal window.

Accuracy comparing human to machine classification: 96.03 %
HDF5 will not be updated. File was not found or specified.

Saving to file:  ./data/novel_dataset.tsv

If you give the -uc flag and INPUT_HDF5, the script will update the ICA.hdf5 files with the artifact classification. Put in a single file, list or * wildcard expansion. Each component identity will be based on unique index generated in metrics file.

python ML_classify.py -i ./data/novel_dataset.tsv -h5 path/to/*ICA.hdf5 -cf ../classifier/test_classifier.hdf5 -uc

The terminal will tell you which files are being updated:

Accuracy comparing human to machine classification: 96.27 %
  Saving artifact_component to  ../../Desktop/hdf5test/190408_07-08_ica.hdf5
  Saving artifact_component to  ../../Desktop/hdf5test/190423_03-04_ica.hdf5

Saving to file:  ./data/novel_dataset.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArtifactClassifier-pySEAS

Data utilized in the Machine Learning section of our PLoS Computational Biology paper

Data visualization and machine learning tests

Generating metrics for the Machine Learning

One file:

Grouped files:

updatate class in the tsv file for specific ica.hdf5:

Machine Learning script

Training your classifier

Testing your classifier

Updating your files

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
__pycache__		__pycache__
data		data
Figure4.1_scripts.ipynb		Figure4.1_scripts.ipynb
Figure4.2_scripts.ipynb		Figure4.2_scripts.ipynb
Figure6.1_ML feature selection.ipynb		Figure6.1_ML feature selection.ipynb
Figure6.2_ML assessment.ipynb		Figure6.2_ML assessment.ipynb
LICENSE		LICENSE
ML_classify.py		ML_classify.py
Novel_data_comparison_confusion_mat_loc_OB.svg		Novel_data_comparison_confusion_mat_loc_OB.svg
README.md		README.md
component_metrics.py		component_metrics.py

License

ackmanlab/ArtifactClassifier-pySEAS

Folders and files

Latest commit

History

Repository files navigation

ArtifactClassifier-pySEAS

Data utilized in the Machine Learning section of our PLoS Computational Biology paper

Data visualization and machine learning tests

Generating metrics for the Machine Learning

One file:

Grouped files:

updatate class in the tsv file for specific ica.hdf5:

Machine Learning script

Training your classifier

Testing your classifier

Updating your files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages