Download a tarballed DICOM dataset from the CFMM DICOM server
cfmm2tar is a command-line tool for querying and downloading DICOM studies from the CFMM (Centre for Functional and Metabolic Mapping) DICOM server.
cfmm2tar uses pixi for dependency management, which automatically handles all dependencies including Python, dcm4che tools, and required libraries.
Requirements:
- Pixi package manager
- Git
Installation:
-
Install pixi (if not already installed):
curl -fsSL https://pixi.sh/install.sh | bashOr on Windows:
iwr -useb https://pixi.sh/install.ps1 | iex
-
Clone the repository:
git clone https://github.com/khanlab/cfmm2tar cd cfmm2tar -
Install dependencies:
pixi install
-
Activate the environment:
# Option 1: Start a shell with the environment activated pixi shell # Option 2: Use pixi shell-hook for automatic activation eval "$(pixi shell-hook)"
- Install cfmm2tar globally:
pixi global install cfmm2tar -c khanlab
Usage:
OUTPUT_DIR=/path/to/dir
mkdir -p ${OUTPUT_DIR}
# Show help
cfmm2tar --help
# Download studies for a specific Principal^Project on a specific date
cfmm2tar -p 'Khan^NeuroAnalytics' -d '20240101' ${OUTPUT_DIR}
# Download all studies on a specific date
cfmm2tar -d '20170530' ${OUTPUT_DIR}
# Download a specific study by StudyInstanceUID
cfmm2tar -u '1.2.840.113619.2.55.3.1234567890.123' ${OUTPUT_DIR}You will be prompted for your UWO username and password. You can only download datasets to which you have read permissions.
Running without activating the shell:
You can also run commands directly using pixi run:
pixi run cfmm2tar -p 'Khan^Project' -d '20240101' ${OUTPUT_DIR}Using pixi provides several advantages:
- âś… All dependencies included: Python, dcm4che tools, and all libraries are automatically managed
- âś… Cross-platform: Works on Linux, macOS, and Windows
- âś… Reproducible environments: Lock file ensures consistent dependency versions
- âś… No containers needed: Direct installation on your system
- âś… Easy development: Simple setup for both users and contributors
- âś… Fast: Binary packages from conda-forge install quickly
Search and download DICOM studies based on search criteria:
# Download all studies for a specific Principal^Project on a specific date
cfmm2tar -p 'Khan^NeuroAnalytics' -d '20240101' output_dir
# Download studies for a specific patient
cfmm2tar -n '*subj01*' output_dir
# Download a specific study by StudyInstanceUID
cfmm2tar -u '1.2.840.113619.2.55.3.1234567890.123' output_dirYou can query and export study metadata to a TSV file without downloading the actual DICOM files. Metadata is always saved to study_metadata.tsv in the output directory:
# Export metadata for all studies on a specific date
cfmm2tar -m -d '20240101' output_dir
# Export metadata for a specific Principal^Project
cfmm2tar -m -p 'Khan^NeuroAnalytics' -d '20240101-20240131' output_dirThis creates a TSV file at output_dir/study_metadata.tsv with columns:
StudyInstanceUID: Unique identifier for the studyPatientName: Patient namePatientID: Patient IDStudyDate: Date of the studyStudyDescription: Study description (typically Principal^Project)
Note: When downloading studies (without -m), metadata is automatically saved to study_metadata.tsv in the output directory.
You can include additional DICOM tags in the metadata TSV using the --metadata-tags option. This is useful for downstream filtering, remapping, or analysis based on custom metadata fields:
# Include PatientBirthDate in metadata
cfmm2tar -m --metadata-tags 00100030:PatientBirthDate -d '20240101' output_dir
# Include multiple additional tags
cfmm2tar -m \
--metadata-tags 00100030:PatientBirthDate \
--metadata-tags 00100040:PatientSex \
-p 'Khan^NeuroAnalytics' -d '20240101' output_dir
# Works with download mode too
cfmm2tar --metadata-tags 00100030:PatientBirthDate -d '20240101' output_dirThe format is TAG:NAME where:
TAGis the DICOM tag in hexadecimal format (e.g.,00100030)NAMEis the column name you want in the TSV (e.g.,PatientBirthDate)
Common DICOM tags you might want to include:
00100030:PatientBirthDate- Patient's birth date00100040:PatientSex- Patient's sex (M/F/O)00101010:PatientAge- Patient's age at time of study00080050:AccessionNumber- Accession number00200010:StudyID- Study ID00080090:ReferringPhysicianName- Referring physician
Note: The DICOM tag must exist in the PACS query response. If a tag is missing for a particular study, the column will contain an empty value.
After reviewing the metadata file, you can download specific studies:
# Download all studies from the metadata file
cfmm2tar --from-metadata study_metadata.tsv output_dir
# Or create a filtered version of the metadata file and download only those
# (e.g., filter in Excel, grep, awk, or Python)
cfmm2tar --from-metadata study_metadata_filtered.tsv output_dir
# You can also use a simple text file with one UID per line
cfmm2tar --from-metadata uid_list.txt output_dir-
Query and export metadata for review:
cfmm2tar -m -p 'Khan^NeuroAnalytics' -d '20240101-20240131' output_dir
This creates
output_dir/study_metadata.tsv -
Review and filter the
study_metadata.tsvfile (e.g., in Excel or with command-line tools) -
Download filtered studies:
cfmm2tar --from-metadata output_dir/study_metadata_filtered.tsv output_dir
This workflow is especially useful when:
- You want to review available studies before downloading
- Storage is limited and you need to select specific studies
- You're sharing the metadata with collaborators to decide what to download
- You need to filter studies based on multiple criteria
You can use the --skip-derived flag to exclude DICOM files with ImageType containing "DERIVED". This is useful to filter out:
- Reformatted images (MPR, MIP, etc.)
- Screen captures
- Derived/calculated images
- Post-processed images
Only ORIGINAL/PRIMARY images will be included in the tar file when using this option.
# Download studies, skipping derived images
cfmm2tar --skip-derived -p 'Khan^NeuroAnalytics' -d '20240101' output_dir
# Can be combined with other options
cfmm2tar --skip-derived --from-metadata study_metadata.tsv output_dirThis is particularly useful when:
- You only need the original acquired images for analysis
- Storage is limited and you want to exclude redundant reformats
- Your pipeline doesn't require derived images
In addition to the command-line interface, cfmm2tar provides a Python API for programmatic access. This is useful for integration into Python scripts, Jupyter notebooks, or workflow management tools like Snakemake.
# Basic installation
pip install cfmm2tar
# With pandas support for DataFrame operations
pip install cfmm2tar[dataframe]Note: The Python API requires dcm4che tools to be installed separately, or you can use the --dcm4che-container option (future feature) to point to a container with dcm4che.
The API functions automatically handle credentials in the following order of precedence:
- Provided parameters:
usernameandpasswordarguments (if supplied) - Environment variables:
UWO_USERNAMEandUWO_PASSWORD - Credentials file:
~/.uwo_credentials(line 1: username, line 2: password)
This means you can use the API without explicitly passing credentials in most cases:
from cfmm2tar import query_metadata
# Credentials automatically loaded from ~/.uwo_credentials or environment variables
studies = query_metadata(
study_description="Khan^NeuroAnalytics",
study_date="20240101-20240131"
)Or provide credentials explicitly when needed:
studies = query_metadata(
username="your_username",
password="your_password",
study_description="Khan^NeuroAnalytics",
study_date="20240101-20240131"
)Or use environment variables in scripts or CI/CD:
export UWO_USERNAME="your_username"
export UWO_PASSWORD="your_password"
python your_script.pyQuery study metadata and get results as a list of dictionaries or pandas DataFrame:
from cfmm2tar import query_metadata
# Query metadata (credentials from ~/.uwo_credentials or env vars)
studies = query_metadata(
study_description="Khan^NeuroAnalytics",
study_date="20240101-20240131",
patient_name="*",
return_type="list" # or "dataframe" for pandas DataFrame
)
print(f"Found {len(studies)} studies")
for study in studies:
print(f" {study['StudyDate']}: {study['StudyDescription']}")Query with additional DICOM tags:
from cfmm2tar import query_metadata
# Query metadata with additional DICOM tags
studies = query_metadata(
study_description="Khan^NeuroAnalytics",
study_date="20240101",
additional_tags={
"00100030": "PatientBirthDate",
"00100040": "PatientSex",
"00101010": "PatientAge"
}
)
# Access additional fields
for study in studies:
print(f"{study['PatientName']}: Age {study['PatientAge']}, Sex {study['PatientSex']}")With pandas DataFrame:
import pandas as pd
from cfmm2tar import query_metadata
# Query metadata and get as DataFrame
df = query_metadata(
study_description="Khan^*",
study_date="20240101-",
return_type="dataframe"
)
# Filter and analyze
recent_studies = df[df['StudyDate'] > '20240601']
print(recent_studies[['StudyDate', 'PatientName', 'StudyDescription']])Download studies programmatically:
from cfmm2tar import download_studies
# Download studies matching criteria (credentials from ~/.uwo_credentials or env vars)
output_dir = download_studies(
output_dir="/path/to/output",
study_description="Khan^NeuroAnalytics",
study_date="20240101",
patient_name="*subj01*"
)
print(f"Studies downloaded to: {output_dir}")Download a specific study by UID:
from cfmm2tar import download_studies
download_studies(
output_dir="/path/to/output",
study_instance_uid="1.2.840.113619.2.55.3.1234567890.123"
)Download with additional DICOM tags in metadata:
from cfmm2tar import download_studies
# Download studies and include additional tags in metadata TSV
output_dir = download_studies(
output_dir="/path/to/output",
study_description="Khan^NeuroAnalytics",
study_date="20240101",
additional_tags={
"00100030": "PatientBirthDate",
"00100040": "PatientSex"
}
)
# The metadata TSV will include PatientBirthDate and PatientSex columnsDownload studies using metadata from various sources:
from cfmm2tar import download_studies_from_metadata
# From a list of study metadata dicts (credentials from ~/.uwo_credentials or env vars)
studies = [
{'StudyInstanceUID': '1.2.3.4', 'PatientName': 'Patient1'},
{'StudyInstanceUID': '5.6.7.8', 'PatientName': 'Patient2'}
]
download_studies_from_metadata(
output_dir="/path/to/output",
metadata=studies
)
# From a TSV file
download_studies_from_metadata(
output_dir="/path/to/output",
metadata="study_metadata.tsv"
)
# From a pandas DataFrame
import pandas as pd
df = pd.read_csv("study_metadata.tsv", sep="\t")
filtered_df = df[df['StudyDate'] > '20240101']
download_studies_from_metadata(
output_dir="/path/to/output",
metadata=filtered_df
)Here's a complete workflow that queries metadata, filters studies, and downloads selected ones:
from cfmm2tar import query_metadata, download_studies_from_metadata
import pandas as pd
# Step 1: Query all available studies (credentials from ~/.uwo_credentials or env vars)
studies_df = query_metadata(
study_description="Khan^*",
study_date="20240101-20240131",
return_type="dataframe"
)
print(f"Found {len(studies_df)} total studies")
# Step 2: Filter studies based on criteria
# For example, only studies with specific patient names
filtered_df = studies_df[
studies_df['PatientName'].str.contains('subj0[1-3]', regex=True)
]
print(f"Filtered to {len(filtered_df)} studies")
# Step 3: Download the filtered studies
download_studies_from_metadata(
output_dir="/path/to/output",
metadata=filtered_df
)
print("Download complete!")The Python API works seamlessly with Snakemake workflows:
# Snakefile
from cfmm2tar import query_metadata, download_studies_from_metadata
# Query metadata in a rule
rule query_studies:
output:
"metadata/study_list.tsv"
run:
# Credentials automatically loaded from env vars or ~/.uwo_credentials
studies = query_metadata(
study_description=config["project"],
study_date=config["date_range"],
return_type="dataframe"
)
studies.to_csv(output[0], sep="\t", index=False)
# Download studies in another rule
rule download_studies:
input:
"metadata/study_list.tsv"
output:
directory("data/dicoms")
run:
download_studies_from_metadata(
output_dir=output[0],
metadata=input[0]
)Query study metadata from the DICOM server.
Parameters:
username(str, optional): UWO username for authentication (see credential precedence below)password(str, optional): UWO password for authentication (see credential precedence below)credentials_file(str, onptional): Custom path to credentials filestudy_description(str): Study description search string (default: "*")study_date(str): Date search string (default: "-")patient_name(str): PatientName search string (default: "*")dicom_server(str): DICOM server connection string (default: "CFMM@dicom.cfmm.uwo.ca:11112")dcm4che_options(str): Additional dcm4che options (default: "")force_refresh_trust_store(bool): Force refresh trust store (default: False)return_type(str): "list" or "dataframe" (default: "list")
Credential Precedence:
- Provided
username/passwordparameters UWO_USERNAMEandUWO_PASSWORDenvironment variables~/.uwo_credentialsfile (line 1: username, line 2: password)
Returns:
- List of dicts or pandas DataFrame with study metadata
Download DICOM studies and create tar archives.
Parameters:
output_dir(str): Output directory for tar archivesusername(str, optional): UWO username for authentication (see credential precedence)password(str, optional): UWO password for authentication (see credential precedence)credentials_file(str, onptional): Custom path to credentials filestudy_description(str): Study description search string (default: "*")study_date(str): Date search string (default: "-")patient_name(str): PatientName search string (default: "*")study_instance_uid(str): Specific StudyInstanceUID (default: "*")temp_dir(str, optional): Temporary directory for intermediate filesdicom_server(str): DICOM server connection string (default: "CFMM@dicom.cfmm.uwo.ca:11112")dcm4che_options(str): Additional dcm4che options (default: "")force_refresh_trust_store(bool): Force refresh trust store (default: False)keep_sorted_dicom(bool): Keep sorted DICOM files (default: False)skip_derived(bool): Skip DICOM files with ImageType containing DERIVED (default: False)
Returns:
- Path to output directory
Download studies using UIDs from metadata source.
Parameters:
output_dir(str): Output directory for tar archivesmetadata(str, list, or DataFrame): Metadata source (file path, list of dicts, or DataFrame)username(str, optional): UWO username for authentication (see credential precedence)password(str, optional): UWO password for authentication (see credential precedence)credentials_file(str, onptional): Custom path to credentials filetemp_dir(str, optional): Temporary directory for intermediate filesdicom_server(str): DICOM server connection string (default: "CFMM@dicom.cfmm.uwo.ca:11112")dcm4che_options(str): Additional dcm4che options (default: "")force_refresh_trust_store(bool): Force refresh trust store (default: False)keep_sorted_dicom(bool): Keep sorted DICOM files (default: False)skip_derived(bool): Skip DICOM files with ImageType containing DERIVED (default: False)
Returns:
- Path to output directory
For complete working examples, see the examples/ directory:
examples/api_usage.py: Interactive examples demonstrating various API usage patternsexamples/Snakefile_example: Example Snakemake workflow integrating cfmm2tarexamples/README.md: Detailed documentation for the examples
Run the interactive examples:
python examples/api_usage.pyWhen connecting to the CFMM DICOM server, cfmm2tar requires a valid TLS certificate trust store for secure communication. The tool automatically handles certificate management for you.
cfmm2tar automatically:
- Downloads the UWO Sectigo certificate from the institutional PKI server
- Creates a JKS (Java KeyStore) trust store file using
keytool - Caches the trust store in
~/.cfmm2tar/mytruststore.jksfor future use - Adds the
--trust-storeoption to all dcm4che commands
This happens transparently on first use - no manual setup required!
If the certificate expires or you need to refresh the cached trust store:
# Force refresh the trust store
cfmm2tar --refresh-trust-store -p 'Khan^NeuroAnalytics' output_dir
# The trust store will be automatically refreshed before downloadingFor advanced users or troubleshooting:
from cfmm2tar import truststore
# Get the default trust store path
path = truststore.get_truststore_path()
print(f"Trust store location: {path}")
# Force creation/refresh of trust store
truststore.ensure_truststore(force_refresh=True)The automatic trust store setup requires:
wget(for downloading the certificate)keytool(part of Java JRE/JDK)
These are automatically included when using pixi, as the Java runtime is installed as a dependency of dcm4che-tools.
Note: If trust store setup fails (e.g., network issues, missing tools), cfmm2tar will log a warning but continue to operate. However, TLS connections may fail without a valid trust store.
For contributors and developers:
# Install pixi (if not already installed)
curl -fsSL https://pixi.sh/install.sh | bash
# Clone the repository
git clone https://github.com/khanlab/cfmm2tar
cd cfmm2tar
# Install dependencies (including dev dependencies)
pixi install
# Activate the development environment
pixi shell
# Set up pre-commit hooks (runs quality checks before each commit)
pre-commit installThis project uses ruff for linting and formatting:
# Format code
ruff format .
# Check for lint issues
ruff check .
# Fix auto-fixable issues
ruff check --fix .
# Run pre-commit hooks manually
pre-commit run --all-filesThis project includes a comprehensive testing framework using a containerized dcm4che PACS instance.
# Activate the pixi environment
pixi shell
# Run unit tests (no PACS server required)
pytest tests/test_dcm4che_utils.py::TestDcm4cheUtilsUnit -v
# Run unit tests with coverage
pytest tests/test_dcm4che_utils.py::TestDcm4cheUtilsUnit -v --cov=cfmm2tar --cov-report=term-missing
# Run integration tests (requires Docker)
cd tests
docker compose up -d
sleep 60 # Wait for PACS to be ready
cd ..
pytest tests/test_dcm4che_utils.py::TestDcm4cheUtilsIntegration -v
# Clean up
cd tests
docker compose down -vAlternatively, you can run tests using pixi directly without activating the shell:
# Run unit tests
pixi run pytest tests/test_dcm4che_utils.py::TestDcm4cheUtilsUnit -v
# Run all tests with coverage
pixi run pytest tests/ -v --cov=cfmm2tar --cov-report=term-missing --cov-report=htmlSee tests/README.md for detailed testing documentation.
The project uses pytest-cov for code coverage analysis:
# Run tests with coverage report
pytest tests/ --cov=cfmm2tar --cov-report=term-missing --cov-report=html
# View coverage report in browser
# Open htmlcov/index.html in your browser
# Generate XML coverage report (for CI/CD integration)
pytest tests/ --cov=cfmm2tar --cov-report=xmlCoverage reports are automatically generated in CI/CD and uploaded as artifacts.
The project uses GitHub Actions for automated testing. The workflow:
- Sets up the pixi environment
- Runs unit tests with code coverage on every push and pull request
- Starts a containerized dcm4chee PACS server
- Runs integration tests against the PACS server with coverage
- Uploads coverage reports as artifacts
- Displays coverage summary in the workflow
See .github/workflows/test.yml for the complete workflow.