This is a benchmark repo for the LLM-IE Python package. We used a synthesized medical note from GPT-4 for a system-wised evaluation. The 2012, 2014, and 2018 i2b2/ n2c2 datasets are used for benchmarking. Note that the datasets are NOT included in this repo in compliance to the data user agreements. To access the datasets, please refer to the DBMI data portal.
We utilized the LLE-IE package to build an information extraction pipeline for the drug, condition, and ADE entities, attributes, and relations. For all the frames extracted by the Frame extractor, the attribute “Type” represents the frame type as one of the “Drug”, “Condition”, or “ADE”. If the Type is “Drug”, “Dosage” and “Frequency” are extracted as additional attributes. If the Type is “Condition”, an “Assertion” attribute is assigned. The relations between a “Condition” frame and a “Drug” frame and between an “ADE” frame and a “Drug” frame are extracted by the Relation extractor. We visualized the results with the viz_render()
method and displayed them on a browser.
For the NER and EA tasks, the Sentence Frame Extractor achieved the best F1 scores, while consuming more GPU time. The Review Frame Extractor had higher recall than the Basic Frame Extractor on all NER tasks.
Tasks |
Algorithm |
GPU time (s)/ Note |
Benchmarks |
|||||
Named Entity Recognition |
|
|
2012 Temporal Relations Challenge |
|||||
|
|
EVENT |
TIMEX |
|||||
|
|
Precision |
Recall |
F1 |
Precision |
Recall |
F1 |
|
Basic |
67.5 |
0.9406 |
0.2841 |
0.4364 |
0.9595 |
0.3516 |
0.5147 |
|
Review |
84.0 |
0.8965 |
0.3995 |
0.5527 |
0.9352 |
0.5473 |
0.6905 |
|
Sentence |
132.9 |
0.9101 |
0.6824 |
0.7799 |
0.8891 |
0.739 |
0.8071 |
|
|
|
2014 De-identification Challenge |
||||||
|
|
Strict |
Relaxed |
|||||
|
|
Precision |
Recall |
F1 |
Recall |
Precision |
F1 |
|
Basic |
9.4 |
0.7154 |
0.4813 |
0.5755 |
0.7172 |
0.4826 |
0.5769 |
|
Review |
15.7 |
0.5649 |
0.5454 |
0.555 |
0.5667 |
0.5471 |
0.5567 |
|
Sentence |
20.7 |
0.6683 |
0.7379 |
0.7014 |
0.6703 |
0.7401 |
0.7035 |
|
|
|
2018 (Track 2) ADE and Medication Extraction Challenge |
||||||
|
|
Strict |
Lenient |
|||||
|
|
Precision |
Recall |
F1 |
Recall |
Precision |
F1 |
|
Basic |
44.3 |
0.7384 |
0.3534 |
0.478 |
0.8537 |
0.4034 |
0.5479 |
|
Review |
63.2 |
0.7209 |
0.427 |
0.5363 |
0.8416 |
0.4918 |
0.6208 |
|
Sentence |
114.1 |
0.852 |
0.6166 |
0.7154 |
0.963 |
0.692 |
0.8053 |
|
Entity Attribute Extraction |
|
|
2012 Temporal Relations Challenge |
|||||
|
|
EVENT |
TIMEX |
|||||
|
|
Type |
Polarity |
Modality |
Type |
Value |
Modifier |
|
Basic |
67.5 |
0.2589 |
0.2707 |
0.2737 |
0.3236 |
0.2835 |
0.3198 |
|
Review |
84.0 |
0.358 |
0.3799 |
0.3828 |
0.4934 |
0.4209 |
0.4857 |
|
Sentence |
132.9 |
0.6056 |
0.642 |
0.6432 |
0.678 |
0.5505 |
0.667 |
|
Relation Extraction |
|
|
2018 (Track 2) ADE and Medication Extraction Challenge |
|||||
|
|
Precision |
Recall |
F1 |
||||
Multi-class |
213.9 |
0.3831 |
0.978 |
0.5505 |
All the experiments were conducted with the LLM-IE Python package and vLLM inference engine.
pip install llm-ie==0.3.1
pip install vllm==0.5.4
For visualization, our plug-in package ie-viz
is needed.
pip install ie-viz==0.1.4
We used the OpenAI compatible server to run Llama-3.1-70B-Instruct.
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --api-key EMPTY --tensor-parallel-size 4 --enable-prefix-caching
The full code is available in the pipeline. The configuration files are in the config
directories: 2012 i2b2, 2014 i2b2, and 2018 n2c2. Below are technical highlights for each task.
We use the Sentence frame extraction pipeline to demo. The full code is available NER_sentence.py.
We import the inference engine, extractor (prompting algorithm), and document (for entity output storage) from the LLM-IE.
from llm_ie.engines import OpenAIInferenceEngine
from llm_ie.extractors import SentenceFrameExtractor
from llm_ie.data_types import LLMInformationExtractionDocument
Define inference engine. Since we use vLLM's OpenAI compatible server, we use OpenAIInferenceEngine
. The config['base_url']
is http://localhost:8000/v1
following the default.
engine = OpenAIInferenceEngine(base_url=config['base_url'],
api_key="EMPTY",
model="meta-llama/Meta-Llama-3.1-70B-Instruct")
Define extractor with prompt template and system prompt. the full prompt templates are in the prompt_templates
directories under each benchmark. The system prompt for all tasks is "You are a highly skilled clinical AI assistant, proficient in reviewing clinical notes and performing accurate information extraction"
extractor = SentenceFrameExtractor(inference_engine=engine,
prompt_template=prompt_template,
system_prompt=config['system_prompt'])
Iterate through all documents and extract frames with the extractor.extract_frames()
method. The extracted frames are stored in the LLMInformationExtractionDocument
and save to disk.
loop = tqdm(IEs, total=len(IEs), leave=True)
for ie in loop:
loop.set_description(f"doc_id: {ie.doc_id}")
frames = extractor.extract_frames(text_content=ie['text'], entity_key="entity_text", multi_turn=False, stream=False)
doc = LLMInformationExtractionDocument(doc_id=ie['doc_id'], text=ie['text'])
for frame in frames:
doc.add_frame(frame, valid_mode="span", create_id=True)
doc.save(os.path.join(config['out_dir'], config['run_name'], f"{doc.doc_id}.llmie"))
The named entity recognition and entity attribute extraction use the same pipeline, following the steps above. The only difference is the prompt template. The Schema defines the attributes.
...
# Schema definition
Your output should contain:
"entity_text": the exact wording as mentioned in the note.
"entity_type": type of the entity. It should be one of the "EVENT" or "TIMEX3".
if entity_type is "EVENT",
"type": the event type as one of the "TEST", "PROBLEM", "TREATMENT", "CLINICAL_DEPT", "EVIDENTIAL", or "OCCURRENCE".
"polarity": whether an EVENT is positive ("POS") or negative ("NAG"). For example, in “the patient reports headache, and denies chills”, the EVENT [headache] is positive in its polarity, and the EVENT [chills] is negative in its polarity.
"modality": whether an EVENT actually occurred or not. Must be one of the "FACTUAL", "CONDITIONAL", "POSSIBLE", or "PROPOSED".
if entity_type is "TIMEX3",
"type": the type as one of the "DATE", "TIME", "DURATION", or "FREQUENCY".
"val": the numeric value 1) DATE: [YYYY]-[MM]-[DD], 2) TIME: [hh]:[mm]:[ss], 3) DURATION: P[n][Y/M/W/D]. So, “for eleven days” will be
represented as “P11D”, meaning a period of 11 days. 4) R[n][duration], where n denotes the number of repeats. When the n is omitted, the expression denotes an unspecified amount of repeats. For example, “once a day for 3 days” is “R3P1D” (repeat the time interval of 1 day (P1D) for 3 times (R3)), twice every day is “RP12H” (repeat every 12 hours)
"mod": additional information regarding the temporal value of a time expression. Must be one of the:
“NA”: the default value, no relevant modifier is present;
“MORE”, means “more than”, e.g. over 2 days (val = P2D, mod = MORE);
“LESS”, means “less than”, e.g. almost 2 months (val = P2M, mod=LESS);
“APPROX”, means “approximate”, e.g. nearly a week (val = P1W, mod=APPROX);
“START”, describes the beginning of a period of time, e.g. Christmas morning, 2005 (val= 2005-12-25, mod= START).
“END”, describes the end of a period of time, e.g. late last year, (val = 2010, mod = END)
“MIDDLE”, describes the middle of a period of time, e.g. mid-September 2001 (val = 2001-09, mod = MIDDLE)
# Output format definition
Your output should follow JSON format,
if there are one of the EVENT or TIMEX3 entity mentions:
[
{"entity_text": "<Exact entity mentions as in the note>", "entity_type": "EVENT", "type": "<event type>", "polarity": "<event polarity>", "modality": "<event modality>"},
{"entity_text": "<Exact entity mentions as in the note>", "entity_type": "TIMEX3", "type": "<TIMEX3 type>", "val": "<time value>", "mod": "<additional information>"}
...
]
if there is no entity mentioned in the given sentence, just output an empty list:
[]
I am only interested in the extracted contents in []. Do NOT explain your answer.
...
The full code is available RE_multiclass.
We import the MultiClassRelationExtractor
class for relation extraction.
from llm_ie.engines import OpenAIInferenceEngine
from llm_ie.extractors import MultiClassRelationExtractor
from llm_ie.data_types import LLMInformationExtractionDocument
Define inference engine. Since we use vLLM's OpenAI compatible server, we use OpenAIInferenceEngine
. The config['base_url']
is http://localhost:8000/v1
following the default.
engine = OpenAIInferenceEngine(base_url=config['base_url'],
api_key="EMPTY",
model="meta-llama/Meta-Llama-3.1-70B-Instruct")
We define a Python function possible_relation_types_func()
that inputs 2 frames and outputs the possible relation types between them. In this dataset, there are relations:
- Strength-Drug: this is a relationship between the drug strength and its name.
- Dosage-Drug: this is a relationship between the drug dosage and its name.
- Duration-Drug: this is a relationship between a drug duration and its name.
- Frequency-Drug: this is a relationship between a drug frequency and its name.
- Form-Drug: this is a relationship between a drug form and its name.
- Route-Drug: this is a relationship between the route of administration for a drug and its name.
- Reason-Drug: this is a relationship between the reason for which a drug was administered (e.g., symptoms, diseases, etc.) and a drug name.
- ADE-Drug: this is a relationship between an adverse drug event (ADE) and a drug name.
The possible_relation_types_func()
returns []
("no relation") when the 2 frames are over 500 characters apart. If the entity types are a Drug and something else, return the something-Drug relation type. Else return []
.
def possible_relation_types_func(frame_1, frame_2) -> List[str]:
# If the two frames are > 500 characters apart, we assume "No Relation"
if abs(frame_1.start - frame_2.start) > 200:
return []
# If the two frames are "Drug" and an attribute entity
if (frame_1.attr["EntityType"] == "Drug" and frame_2.attr["EntityType"] != "Drug"):
return [f'{frame_2.attr["EntityType"]}-Drug']
if (frame_2.attr["EntityType"] == "Drug" and frame_1.attr["EntityType"] != "Drug"):
return [f'{frame_1.attr["EntityType"]}-Drug']
return []
Define extractor and pass the possible_relation_types_func()
.
extractor = MultiClassRelationExtractor(inference_engine=engine,
prompt_template=prompt_template,
system_prompt=config['system_prompt'],
possible_relation_types_func=possible_relation_types_func)
Run extractor with extractor.extract_relations()
and add the relations to the document object. Then save to disk.
loop = tqdm(docs, total=len(docs), leave=True)
for doc in loop:
loop.set_description(f"doc_id: {doc.doc_id}")
relations = extractor.extract_relations(doc=doc, stream=False)
doc.add_relations(relations)
doc.save(os.path.join(config['out_dir'], config['run_name'], f"{doc.doc_id}.llmie"))
The GPT-4 synthesized medical note and the full code is available demo_ADE_extraction.py.
Import LLM-IE
from llm_ie.engines import LlamaCppInferenceEngine
from llm_ie.extractors import SentenceFrameExtractor, BinaryRelationExtractor
from llm_ie.data_types import LLMInformationExtractionDocument
The medical note
note_text = """**Patient:** John Doe, 45 M
**Physician:** Dr. Emily Johnson, Cardiologist, Green Valley Hospital
---
John is a 45-year-old male with a history of hypertension (dx 2015), Type 2 diabetes (dx 2018), and hyperlipidemia. He has been experiencing
increased angina episodes since July 2024. He initially presented with complaints of occasional dizziness and fatigue, likely due to
Lisinopril 10 mg daily.
**Meds Adjustments:**
- Lisinopril was reduced to 5 mg daily, but the patient later developed a persistent dry cough (suspected ADR). Switched to Losartan 50 mg daily,
which resolved the cough.
- Added Atorvastatin 20 mg daily in May 2024 for cholesterol control but caused muscle cramps. Switched to Rosuvastatin 10 mg daily in June 2024.
- Noticed palpitations and headaches since starting Sitagliptin 100 mg daily for better glucose control. Reduced to 50 mg due to GI upset and
added Pantoprazole 20 mg.
**Current Meds:**
- Losartan 50 mg daily
- Metformin 500 mg BID
- Rosuvastatin 10 mg daily
- Sitagliptin 50 mg daily + Pantoprazole 20 mg daily
- Carvedilol 12.5 mg BID (increased from 6.25 mg for angina)
---
**Plan:**
Dr. Johnson advised John to monitor his blood pressure closely and keep a log of any side effects or new symptoms, especially related to the
recent medication changes. Follow-up scheduled for October 2024 to reassess symptom control, particularly regarding angina frequency and GI
symptoms.
"""
We use Llama.cpp to run Meta-Llama-3.1-70B-Instruct with int8 quantization.
llm = LlamaCppInferenceEngine(repo_id="bullerwins/Meta-Llama-3.1-70B-Instruct-GGUF",
gguf_filename="Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf",
n_ctx=16000,
verbose=False)
The named entity recognition and entity attribute extraction are performed end-to-end.
# Define extractor
extractor = SentenceFrameExtractor(llm, prompt_template, system_prompt="You are a helpful medical AI assistant.")
# Extract
frames = extractor.extract_frames(note_text, entity_key="EntityText", stream=True)
# Check extractions
for frame in frames:
print(frame.to_dict())
# Define document
doc = LLMInformationExtractionDocument(doc_id="Meidcal note", text=note_text)
# Add frames to document
doc.add_frames(frames, valid_mode="span", create_id=True)
Relation extraction
from typing import List
def possible_relation_func(frame_1, frame_2) -> bool:
# If the two frames are > 500 characters apart, we assume "No Relation"
if abs(frame_1.start - frame_2.start) > 500:
return []
# If the two frames are "Medication" and "Strength", the only possible relation types are "Strength-Drug" or "No Relation"
if (frame_1.attr["Type"] == "Drug" and frame_2.attr["Type"] == "Condition") or \
(frame_2.attr["Type"] == "Drug" and frame_1.attr["Type"] == "Condition"):
return True
# If the two frames are "Medication" and "Frequency", the only possible relation types are "Frequency-Drug" or "No Relation"
if (frame_1.attr["Type"] == "Drug" and frame_2.attr["Type"] == "ADE") or \
(frame_2.attr["Type"] == "Drug" and frame_1.attr["Type"] == "ADE"):
return True
return False
# Define relation extractor
relation_extractor = BinaryRelationExtractor(llm, prompt_template=prompt_template, possible_relation_func=possible_relation_func)
# Extract multi-class relations
relations = relation_extractor.extract_relations(doc, stream=True)
# Add to document
doc.add_relations(relations)
To visualize, we render the results to HTML and save to file.
html = doc.viz_render(color_attr_key="Type")
import os
with open(os.path.join("demo_ADE_extraction.html"), "w") as f:
f.write(html)