This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress, please submit a pull request for any dataset you would like to advertise!
Name | Description | Comments |
---|---|---|
The Cancer Genome Atlas | Variety of Cancer Data | most cancer types have 100-1000 samples |
NIH GDC | Cancer, many types of genomic data | |
UK Biobank | ||
European Genome-Phenome Archive | ||
METABRIC | The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers. | |
HapMap | ||
23andMe | 2280 Public Domain Curated Genotypes | |
Mice | SNPs, 2000+ samples | 4 generations. It might be possible to learn a family structure out of the data. |
Arabidopsis | SNPs, 100+ phenotypes |
Name | Description | Comments |
---|---|---|
TargetFinder | ~100,000 DNA-DNA interaction pairs |
Name | Description | Comments |
---|---|---|
GEO | Main place for NCBI data | |
ENCODE | Variety of assays to identify functional elements | |
ArrayExpress | DNA sequencing, gene/protein expression, epigenetics | |
Cytometry Continuous | flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline | Classical benchmark dataset for learning graphical models; contains known errors |
Transcription factor binding | ChIP-Seq data on 12 TFs | |
GTEx | Landmark study for EQTL analysis | |
PharmacoGenomics DB | ||
ProteomeXChange | ||
BeatAML | whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity | 672 tumour specimens collected from 562 patients |
Name | Description | Comments |
---|---|---|
Single-cell expression atlas | ||
scPerturb | single-cell perturbation-response datasets | harmonized and preprocessed across 44 original datasets |
Name | Description | Comments |
---|---|---|
TRRUST | manually curated database of human transcriptional regulatory network | |
Yeast Network | 23-million yeast 2-hybrid experiments to investigate genetic interactions | |
Perturb-Seq | Integrated model of perturbations, single cell phenotypes, and epistatic interactions | |
KEGG Metabolic Regulatory Network (Undirected) | 65554 instances, 29 attributes each | |
KEGG Metabolic Regulatory Network (Directed) | 53414 instance, 24 attributes each |
Name | Description | Comments |
---|---|---|
The Cancer Imaging Archive | Extracts the images from the TCGA data | |
Multiple Myeloma DREAM Challenge | Challenge to identify Multiple Myeloma Patients | |
Breast Cancer Wisconsin (Diagnostic) Data Set | Predict whether the cancer is benign or malignant | |
DDSM | Mammogram Database | |
Kaggle Soft Tissue Sarcomas | Preprocessed subset of the TCIA study "Soft Tissue Sarcoma" | segmentation task |
Kaggle Cervical Cancer Screening | Classify cervix type from images | |
CMELYON17 | Pathology challenge - automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections | |
Grand Challenges | Datasets from biomedical image analysis competitions | |
Breast Cancer MRI Dataset | Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images |
Name | Description | Comments |
---|---|---|
ENGIMA Cerebellum | Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction | |
Seizure Prediction | Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure). |
Name | Description | Comments |
---|---|---|
MIMIC | 59,000 EHRs | |
UCI Diabetes | 130 US hospital data for 1999-2008 | |
i2b2 | Clinical notes only, designed for NLP tasks | |
PhysioNet | ||
Metadata Acquired from Clinical Case Reports (MACCRs) | 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases | |
eICU | 200k EHRs | |
All of Us | >250k EHRs, some genomic data | |
PMC-Patients | 167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations |
Name | Description | Comments |
---|---|---|
CheXPert | 200k chest radiographs | Competition and leaderboard associated |
MIMIC-CXR | ~400k chest x-rays, 14 labels | Data on PhysioNet |
PadChest | 160k chest x-rays, 174 different findings |
Name | Description | Comments |
---|---|---|
HINT (High-quality INTeractomes) | curated compilation of high-quality protein-protein interactions from 8 interactome resources |
Name | Description | Comments |
---|---|---|
National Population Health Survey | Longitudinal Survey that collects health information via surveys every two years. |
Name | Description | Comments |
---|---|---|
ProteinNet | Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits. |
Name | Description | Comments |
---|---|---|
BioASQ | Abstracts of medical articles (from PubMed); ontologies of medical concepts. | Tasks: MLC, QA. |
Cases | Articles from medical case studies. | |
UPMC Pathology | UPMC Pathology case studies. |
Name | Description | Comments |
---|---|---|
Therapeutic Data Commons | Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing. | Available as Python modules. |
Cancer Omics Drug Experiment Response Dataset | Molecular datasets paired with corresponding drug sensitivity data | Seeks to standardize datasets of cancer drug responses into a standard schema |