This repository contains the official implementation and pre-trained weights for Clinical ModernBERT, a transformer-based encoder optimized for biomedical and clinical natural language processing (NLP) tasks. Our model leverages state-of-the-art innovations from ModernBERT, adapted specifically to biomedical literature and clinical data.
Clinical ModernBERT (137M Parameters) uses ModernBERT's enhancements, including:
- Extended Context Length (8,192 tokens): Accommodates lengthy clinical documents such as discharge summaries and comprehensive patient narratives.
- Rotary Positional Embeddings (RoPE): Facilitates efficient modeling of long-range dependencies critical for clinical text understanding.
- Flash Attention: Significantly reduces computational overhead, enabling efficient processing of extensive biomedical corpora.
- GeGLU Activation: Enhances representational capability with improved gradient flow.
The model is pretrained on approximately 40 million PubMed abstracts and real-world clinical notes from MIMIC-IV, combined with structured medical terminologies (e.g., ICD codes). We pre-trained it for 150,000 steps on a single NVIDIA 80GB A100 GPU using a batch size of 128.
Create the required environment with Conda:
conda env create -f environment.yml
Pretrained model weights and tokenizer artifacts are provided to facilitate easy integration with your downstream biomedical NLP tasks:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('Simonlee711/Clinical_ModernBERT')
tokenizer = AutoTokenizer.from_pretrained('Simonlee711/Clinical_ModernBERT')
Pre-training Datasources
The data composes of ~13,000,000,000 billion tokens
-
PubMed Abstracts (up to 2025)
Biomedical literature metadata and abstracts from NLM's PubMed baseline. A primary source for pretraining language models on scientific discourse. -
MIMIC-IV Note (v2.2)
A de-identified corpus of real-world hospital clinical notes, covering diverse specialties and temporal contexts, suitable for modeling clinical language patterns. -
ICD-9/10/11/12 Disease & Procedure Codes
Canonical taxonomy for diagnostic and procedural coding across multiple ICD versions, maintained by CMS. These codes offer structured clinical semantics useful for task supervision or embedding learning. -
ICD-10-CM Medication Codes
U.S. Clinical Modification of ICD-10 providing detailed coding for drugs, toxic agents, and pharmacologic categories. Valuable for aligning text spans to standardized medication representations.
The Code Ontologies Pre-training Text Was constructed in the following way:
ICD [VERSION] code for [CODE]: [DESCRIPTION]
Example:
ICD 9 code for 250.00: Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled.
To execute the scripts, run:
CUDA_VISIBLE_DEVICES=1 python3 pre-train.py
Clinical ModernBERT provides clear semantic clustering for medical codes, outperforming general-domain models significantly in capturing the inherent structure of medical terminologies.
If you find Clinical ModernBERT useful in your work, please cite:
Paper Citation:
@misc{lee2025clinicalmodernbertefficientlong,
title={Clinical ModernBERT: An efficient and long context encoder for biomedical text},
author={Simon A. Lee and Anthony Wu and Jeffrey N. Chiang},
year={2025},
eprint={2504.03964},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.03964},
}
Model Card Citation:
@misc {simon_lee_2025,
author = { {Simon Lee} },
title = { Clinical_ModernBERT (Revision 24e72d6) },
year = 2025,
url = { https://huggingface.co/Simonlee711/Clinical_ModernBERT },
doi = { 10.57967/hf/4999 },
publisher = { Hugging Face }
}
This work utilized resources provided by the UCLA Department of Computational Medicine.
For further inquiries or collaboration, please contact:
- Simon A. Lee: simonlee711@g.ucla.edu
- Anthony Wu: anthonytkwu@g.ucla.edu
- Jeffrey N. Chiang: njchiang@g.ucla.edu