-
Notifications
You must be signed in to change notification settings - Fork 1
Querying Data
Module for data query, checking for file existence, and generating dataframe with images to process. This repo is designed to be generalizable to different datasets.
The structure of your data should generally adhere to the following structure:
└── experiment
├── plate1
│ ├── well1_site1_ch1.png
│ ├── well1_site2_ch1.png
│ ├── well2_site1_ch1.png
│ └── well2_site2_ch1.png
└── plate2
├── well1_site1_ch1.png
├── well1_site2_ch1.png
├── well2_site1_ch1.png
└── well2_site2_ch1.png
with a main "experiment" directory containing a folder for each plate's images you wish to analyze.
There can be any number of subfolders between a plate folder (e.g. experiment/plate1/) and its images, as long as each plate's images follows the same structure with the same subfolder names.
A correct example:
└── experiment
├── plate1
│ └── images
│ ├── well1_site1_ch1.png
│ ├── well1_site2_ch1.png
│ ├── well2_site1_ch1.png
│ └── well2_site2_ch1.png
└── plate2
└── images
├── well1_site1_ch1.png
├── well1_site2_ch1.png
├── well2_site1_ch1.png
└── well2_site2_ch1.png
An incorrect example:
└── experiment
├── plate1
│ └── images
│ ├── well1_site1_ch1.png
│ ├── well1_site2_ch1.png
│ ├── well2_site1_ch1.png
│ └── well2_site2_ch1.png
└── plate2
└── imgs <--- different name
├── well1_site1_ch1.png
├── well1_site2_ch1.png
├── well2_site1_ch1.png
└── well2_site2_ch1.png
Parameters:
-
exp_folder: main directory with a subdirectory for each plate you would like to get images for -
pattern: arrangement of info in filenames to parse, such aswell,site,channel, etc. (Explained in detail below) -
plates: ids of desired plates (each must haveplate_identifiersthat come immediately before and after) -
plate_identifiers: two substrings that immediately precede and succeed each plate inplates -
exts: file extensions of your images (without a period) (e.g.,'png','tiff','jpg')
Say you have data structured like this:
└── HUVEC-1
├── Plate1
│ ├── AB08_s1_w1.png
│ ├── AB08_s2_w1.png
│ ├── AB08_s3_w1.png
│ ├── AB08_s4_w1.png
│ ├── L18_s1_w1.png
│ ├── L18_s2_w1.png
│ ├── L18_s3_w1.png
│ └── L18_s4_w1.png
└── Plate2
├── AB08_s1_w1.png
├── AB08_s2_w1.png
├── AB08_s3_w1.png
├── AB08_s4_w1.png
├── L18_s1_w1.png
├── L18_s2_w1.png
├── L18_s3_w1.png
└── L18_s4_w1.png
as is the case for the RxRx2 test images in the tests/sample_data/rxrx2 directory of this repo.
Consider the image path:
'/path/to/HUVEC-1/Plate1/AB08_s1_w1.png'
The exp_folder parameter in this case would be '/path/to/HUVEC-1/' and plate_identifiers could be something like ['Plate',''].
The pattern parameter is meant to help with parsing relevant metadata encoded in filenames of imaging datasets, like:
- well
- site/field
- channel
- plane
for the data above, the filenames have a pattern of '<well>_<site>_<channel>.<ext> (where <ext> will substituted for each of the extensions list in the exts parameter)
Using this function to parse this sample data will look like this:
files_df = query_data_updated(exp_folder='/path/to/HUVEC-1/', plate_identifiers=['Plate',''],
pattern='<well>_<site>_<channel>.<ext>,exts=['png'])
where files_df will be a pandas DataFrame with this structure:
plate well site channel plane filename file_path
0 Plate1 AB08 s1 w1 plane01 AB08_s1_w1.png /path/to/HUVEC-1/Plate1/AB08_s1_w1.png
1 Plate1 AB08 s2 w1 plane01 AB08_s2_w1.png /path/to/HUVEC-1/Plate1/AB08_s2_w1.png
2 Plate1 AB08 s3 w1 plane01 AB08_s3_w1.png /path/to/HUVEC-1/Plate1/AB08_s3_w1.png
3 Plate1 AB08 s4 w1 plane01 AB08_s4_w1.png /path/to/HUVEC-1/Plate1/AB08_s4_w1.png
4 Plate1 L18 s1 w1 plane01 L18_s1_w1.png /path/to/HUVEC-1/Plate1/L18_s1_w1.png
5 Plate1 L18 s2 w1 plane01 L18_s2_w1.png /path/to/HUVEC-1/Plate1/L18_s2_w1.png
6 Plate1 L18 s3 w1 plane01 L18_s3_w1.png /path/to/HUVEC-1/Plate1/L18_s3_w1.png
7 Plate1 L18 s4 w1 plane01 L18_s4_w1.png /path/to/HUVEC-1/Plate1/L18_s4_w1.png
8 Plate2 AB08 s1 w1 plane01 AB08_s1_w1.png /path/to/HUVEC-1/Plate2/AB08_s1_w1.png
9 Plate2 AB08 s2 w1 plane01 AB08_s2_w1.png /path/to/HUVEC-1/Plate2/AB08_s2_w1.png
10 Plate2 AB08 s3 w1 plane01 AB08_s3_w1.png /path/to/HUVEC-1/Plate2/AB08_s3_w1.png
11 Plate2 AB08 s4 w1 plane01 AB08_s4_w1.png /path/to/HUVEC-1/Plate2/AB08_s4_w1.png
12 Plate2 L18 s1 w1 plane01 L18_s1_w1.png /path/to/HUVEC-1/Plate2/L18_s1_w1.png
13 Plate2 L18 s2 w1 plane01 L18_s2_w1.png /path/to/HUVEC-1/Plate2/L18_s2_w1.png
14 Plate2 L18 s3 w1 plane01 L18_s3_w1.png /path/to/HUVEC-1/Plate2/L18_s3_w1.png
15 Plate2 L18 s4 w1 plane01 L18_s4_w1.png /path/to/HUVEC-1/Plate2/L18_s4_w1.png
- Do not include any leading or following forward slashes (
/) in thepattern,plates, orplate_identifiersparameters - You must include field character lengths in a
patternif consecutive metadata fields are not separated by any characters (e.g.<well(6)><site(3)><plane(3)>-<channel(3)>.<ext>is the pattern for Opera Phenix imagesr11c06f23p02-ch1sk1fk1fl1.tiff) - A
plate_identifierof''indicates the beginning or end of plate directory name
In cases where you have consecutive fields that you would like separate, you can specify the number of characters in the pattern string.
Images from the Opera Phenix line of high-content imaging systems typically have filenames that follow this format:
r11c06f23p02-ch1sk1fk1fl1.tiff
and the corresponding pattern to parse these filenames would be:
<well(6)><site(3)><plane(3)>-<channel(3)>.<ext>
where <well(6)><site(3)> indicates that the well is 6 characters long, the site is 3 characters long after the well, and so on.
Varying the plate_identifiers and plates parameters:
files_df = query_functions_local.query_data('tests/sample_data/rxrx2/images/HUVEC-1',
pattern='<well>_<site>_<channel>.<ext>',
plate_identifiers=['',''],
plates=['Plate1'],
exts=['png'])
gives a DataFrame with the structure
plate well site channel plane filename plate_folder file_path
0 Plate1 AB08 s1 w1 plane01 AB08_s1_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/AB08_s1_w1.png
1 Plate1 AB08 s2 w1 plane01 AB08_s2_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/AB08_s2_w1.png
2 Plate1 AB08 s3 w1 plane01 AB08_s3_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/AB08_s3_w1.png
3 Plate1 AB08 s4 w1 plane01 AB08_s4_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/AB08_s4_w1.png
4 Plate1 L18 s1 w1 plane01 L18_s1_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/L18_s1_w1.png
5 Plate1 L18 s2 w1 plane01 L18_s2_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/L18_s2_w1.png
6 Plate1 L18 s3 w1 plane01 L18_s3_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/L18_s3_w1.png
7 Plate1 L18 s4 w1 plane01 L18_s4_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/L18_s4_w1.png
Setting plates=[1] and plate_identifiers=['Plate',''] results in:
plate well site channel plane filename plate_folder file_path
0 1 AB08 s1 w1 plane01 AB08_s1_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/AB08_s1_w1.png
1 1 AB08 s2 w1 plane01 AB08_s2_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/AB08_s2_w1.png
2 1 AB08 s3 w1 plane01 AB08_s3_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/AB08_s3_w1.png
3 1 AB08 s4 w1 plane01 AB08_s4_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/AB08_s4_w1.png
4 1 L18 s1 w1 plane01 L18_s1_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/L18_s1_w1.png
5 1 L18 s2 w1 plane01 L18_s2_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/L18_s2_w1.png
6 1 L18 s3 w1 plane01 L18_s3_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/L18_s3_w1.png
7 1 L18 s4 w1 plane01 L18_s4_w1.png Plate1 tests/sample_data/rxrx2/images/HUVEC-1/Plate1/L18_s4_w1.png
Here is an example Phenix image directory structure for an experiment called 'EXP0001':
├── EXP0001
│ ├── EXP0001_CCU384_401__2024-02-17T22_04_34-Measurement 19
│ │ └── Images
│ │ ├── r03c07f01p01-ch1sk1fk1fl1.tiff
│ │ ├── r03c07f01p01-ch2sk1fk1fl1.tiff
│ │ ├── r06c17f01p01-ch1sk1fk1fl1.tiff
│ │ └── r06c17f01p01-ch2sk1fk1fl1.tiff
│ ├── EXP0001_CCU384_402__2024-02-13T22_52_28-Measurement 19
│ │ └── Images
│ │ ├── r03c07f01p01-ch1sk1fk1fl1.tiff
│ │ ├── r03c07f01p01-ch2sk1fk1fl1.tiff
│ │ ├── r06c17f01p01-ch1sk1fk1fl1.tiff
│ │ └── r06c17f01p01-ch2sk1fk1fl1.tiff
│ ├── EXP0001_CV384W_201__2024-06-15T05_02_40-Measurement 1
│ │ └── Images
│ │ ├── r03c07f01p01-ch1sk1fk1fl1.tiff
│ │ ├── r03c07f01p01-ch2sk1fk1fl1.tiff
│ │ ├── r06c17f01p01-ch1sk1fk1fl1.tiff
│ │ └── r06c17f01p01-ch2sk1fk1fl1.tiff
│ └── EXP0001_CV384W_202__2024-06-15T05_09_49-Measurement 1
│ └── Images
│ ├── r03c07f01p01-ch1sk1fk1fl1.tiff
│ ├── r03c07f01p01-ch2sk1fk1fl1.tiff
│ ├── r06c17f01p01-ch1sk1fk1fl1.tiff
│ └── r06c17f01p01-ch2sk1fk1fl1.tiff
To get the metadata for the plates numbered 201 and 401, we use these parameters:
files_df = query_functions_local.query_data('/home/jeffdatasci/test_imaging_data/EXP0001',
pattern='Images/<Well(6)><Site(3)><Plane(3)>-<Channel(3)>.<ext>',
plates=['201','401'],
plate_identifiers=['_','_'],
exts=['tiff'])
NOTE:
The metadata fields include character lengths (e.g., <Well(6)>) because metadata for well, site, etc are not separated by any other characters. Also, pattern includes the Images subdirectory of each plate directory.
This function call results in a DataFrame of the structure:
plate well site channel plane filename \
0 401 r03c07 f01 ch1 p01 r03c07f01p01-ch1sk1fk1fl1.tiff
1 401 r03c07 f01 ch2 p01 r03c07f01p01-ch2sk1fk1fl1.tiff
2 401 r06c17 f01 ch1 p01 r06c17f01p01-ch1sk1fk1fl1.tiff
3 401 r06c17 f01 ch2 p01 r06c17f01p01-ch2sk1fk1fl1.tiff
4 201 r03c07 f01 ch1 p01 r03c07f01p01-ch1sk1fk1fl1.tiff
5 201 r03c07 f01 ch2 p01 r03c07f01p01-ch2sk1fk1fl1.tiff
6 201 r06c17 f01 ch1 p01 r06c17f01p01-ch1sk1fk1fl1.tiff
7 201 r06c17 f01 ch2 p01 r06c17f01p01-ch2sk1fk1fl1.tiff
plate_folder \
0 EXP0001_CCU384_401__2024-02-17T22_04_34-Measurement 19
1 EXP0001_CCU384_401__2024-02-17T22_04_34-Measurement 19
2 EXP0001_CCU384_401__2024-02-17T22_04_34-Measurement 19
3 EXP0001_CCU384_401__2024-02-17T22_04_34-Measurement 19
4 EXP0001_CV384W_201__2024-06-15T05_02_40-Measurement 1
5 EXP0001_CV384W_201__2024-06-15T05_02_40-Measurement 1
6 EXP0001_CV384W_201__2024-06-15T05_02_40-Measurement 1
7 EXP0001_CV384W_201__2024-06-15T05_02_40-Measurement 1
file_path
0 EXP0001/EXP0001_CCU384_401__2024-02-17T22_04_34-Measurement 19/Images/r03c07f01p01-ch1sk1fk1fl1....
1 EXP0001/EXP0001_CCU384_401__2024-02-17T22_04_34-Measurement 19/Images/r03c07f01p01-ch2sk1fk1fl1....
2 EXP0001/EXP0001_CCU384_401__2024-02-17T22_04_34-Measurement 19/Images/r06c17f01p01-ch1sk1fk1fl1....
3 EXP0001/EXP0001_CCU384_401__2024-02-17T22_04_34-Measurement 19/Images/r06c17f01p01-ch2sk1fk1fl1....
4 EXP0001/EXP0001_CV384W_201__2024-06-15T05_02_40-Measurement 1/Images/r03c07f01p01-ch1sk1fk1fl1.tiff
5 EXP0001/EXP0001_CV384W_201__2024-06-15T05_02_40-Measurement 1/Images/r03c07f01p01-ch2sk1fk1fl1.tiff
6 EXP0001/EXP0001_CV384W_201__2024-06-15T05_02_40-Measurement 1/Images/r06c17f01p01-ch1sk1fk1fl1.tiff
7 EXP0001/EXP0001_CV384W_201__2024-06-15T05_02_40-Measurement 1/Images/r06c17f01p01-ch2sk1fk1fl1.tiff
If you have any issues/questions regarding this code, feel free to open an issue