Skip to content

daviden1013/LLM-IE_Benchmark

Repository files navigation

This is a benchmark repo for the LLM-IE Python package. We used a synthesized medical note from GPT-4 for a system-wised evaluation. The 2012, 2014, and 2018 i2b2/ n2c2 datasets are used for benchmarking. Note that the datasets are NOT included in this repo in compliance to the data user agreements. To access the datasets, please refer to the DBMI data portal.

Table of Contents

Overview

We utilized the LLE-IE package to build an information extraction pipeline for the drug, condition, and ADE entities, attributes, and relations. For all the frames extracted by the Frame extractor, the attribute “Type” represents the frame type as one of the “Drug”, “Condition”, or “ADE”. If the Type is “Drug”, “Dosage” and “Frequency” are extracted as additional attributes. If the Type is “Condition”, an “Assertion” attribute is assigned. The relations between a “Condition” frame and a “Drug” frame and between an “ADE” frame and a “Drug” frame are extracted by the Relation extractor. We visualized the results with the viz_render() method and displayed them on a browser.

For the NER and EA tasks, the Sentence Frame Extractor achieved the best F1 scores, while consuming more GPU time. The Review Frame Extractor had higher recall than the Basic Frame Extractor on all NER tasks.

Tasks

Algorithm

GPU time (s)/ Note

Benchmarks

Named Entity Recognition

 

 

2012 Temporal Relations Challenge

 

 

EVENT

TIMEX

 

 

Precision

Recall

F1

Precision

Recall

F1

Basic

67.5

0.9406

0.2841

0.4364

0.9595

0.3516

0.5147

Review

84.0

0.8965

0.3995

0.5527

0.9352

0.5473

0.6905

Sentence

132.9

0.9101

0.6824

0.7799

0.8891

0.739

0.8071

 

 

2014 De-identification Challenge

 

 

Strict

Relaxed

 

 

Precision

Recall

F1

Recall

Precision

F1

Basic

9.4

0.7154

0.4813

0.5755

0.7172

0.4826

0.5769

Review

15.7

0.5649

0.5454

0.555

0.5667

0.5471

0.5567

Sentence

20.7

0.6683

0.7379

0.7014

0.6703

0.7401

0.7035

 

 

2018 (Track 2) ADE and Medication Extraction Challenge

 

 

Strict

Lenient

 

 

Precision

Recall

F1

Recall

Precision

F1

Basic

44.3

0.7384

0.3534

0.478

0.8537

0.4034

0.5479

Review

63.2

0.7209

0.427

0.5363

0.8416

0.4918

0.6208

Sentence

114.1

0.852

0.6166

0.7154

0.963

0.692

0.8053

Entity Attribute Extraction

 

 

2012 Temporal Relations Challenge

 

 

EVENT

TIMEX

 

 

Type

Polarity

Modality

Type

Value

Modifier

Basic

67.5

0.2589

0.2707

0.2737

0.3236

0.2835

0.3198

Review

84.0

0.358

0.3799

0.3828

0.4934

0.4209

0.4857

Sentence

132.9

0.6056

0.642

0.6432

0.678

0.5505

0.667

Relation Extraction

 

 

2018 (Track 2) ADE and Medication Extraction Challenge

 

 

Precision

Recall

F1

Multi-class

213.9

0.3831

0.978

0.5505

Prerequisite

All the experiments were conducted with the LLM-IE Python package and vLLM inference engine.

pip install llm-ie==0.3.1
pip install vllm==0.5.4

For visualization, our plug-in package ie-viz is needed.

pip install ie-viz==0.1.4

We used the OpenAI compatible server to run Llama-3.1-70B-Instruct.

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --api-key EMPTY --tensor-parallel-size 4 --enable-prefix-caching

Methods

The full code is available in the pipeline. The configuration files are in the config directories: 2012 i2b2, 2014 i2b2, and 2018 n2c2. Below are technical highlights for each task.

Named Entity Recognition

We use the Sentence frame extraction pipeline to demo. The full code is available NER_sentence.py.

We import the inference engine, extractor (prompting algorithm), and document (for entity output storage) from the LLM-IE.

from llm_ie.engines import OpenAIInferenceEngine
from llm_ie.extractors import SentenceFrameExtractor
from llm_ie.data_types import LLMInformationExtractionDocument

Define inference engine. Since we use vLLM's OpenAI compatible server, we use OpenAIInferenceEngine. The config['base_url'] is http://localhost:8000/v1 following the default.

engine = OpenAIInferenceEngine(base_url=config['base_url'],
                               api_key="EMPTY",
                               model="meta-llama/Meta-Llama-3.1-70B-Instruct")

Define extractor with prompt template and system prompt. the full prompt templates are in the prompt_templates directories under each benchmark. The system prompt for all tasks is "You are a highly skilled clinical AI assistant, proficient in reviewing clinical notes and performing accurate information extraction"

extractor = SentenceFrameExtractor(inference_engine=engine,
                                   prompt_template=prompt_template,
                                   system_prompt=config['system_prompt'])

Iterate through all documents and extract frames with the extractor.extract_frames() method. The extracted frames are stored in the LLMInformationExtractionDocument and save to disk.

loop = tqdm(IEs, total=len(IEs), leave=True)
for ie in loop:
    loop.set_description(f"doc_id: {ie.doc_id}")
    frames = extractor.extract_frames(text_content=ie['text'], entity_key="entity_text", multi_turn=False, stream=False)
    doc = LLMInformationExtractionDocument(doc_id=ie['doc_id'], text=ie['text'])
    for frame in frames:
        doc.add_frame(frame, valid_mode="span", create_id=True)

    doc.save(os.path.join(config['out_dir'], config['run_name'], f"{doc.doc_id}.llmie"))

Entity Attribute Extraction

The named entity recognition and entity attribute extraction use the same pipeline, following the steps above. The only difference is the prompt template. The Schema defines the attributes.

...
# Schema definition
Your output should contain: 
    "entity_text": the exact wording as mentioned in the note.
    "entity_type": type of the entity. It should be one of the "EVENT" or "TIMEX3".
    if entity_type is "EVENT",
        "type": the event type as one of the "TEST", "PROBLEM", "TREATMENT", "CLINICAL_DEPT", "EVIDENTIAL", or "OCCURRENCE".
        "polarity": whether an EVENT is positive ("POS") or negative ("NAG"). For example, in “the patient reports headache, and denies chills”, the EVENT [headache] is positive in its polarity, and the EVENT [chills] is negative in its polarity.
        "modality": whether an EVENT actually occurred or not. Must be one of the "FACTUAL", "CONDITIONAL", "POSSIBLE", or "PROPOSED".

    if entity_type is "TIMEX3",
        "type": the type as one of the "DATE", "TIME", "DURATION", or "FREQUENCY".
        "val": the numeric value 1) DATE: [YYYY]-[MM]-[DD], 2) TIME: [hh]:[mm]:[ss], 3) DURATION: P[n][Y/M/W/D]. So, “for eleven days” will be 
represented as “P11D”, meaning a period of 11 days. 4)  R[n][duration], where n denotes the number of repeats. When the n is omitted, the expression denotes an unspecified amount of repeats. For example, “once a day for 3 days” is “R3P1D” (repeat the time interval of 1 day (P1D) for 3 times (R3)), twice every day is “RP12H” (repeat every 12 hours)
        "mod": additional information regarding the temporal value of a time expression. Must be one of the:
            “NA”: the default value, no relevant modifier is present;  
            “MORE”, means “more than”, e.g. over 2 days (val = P2D, mod = MORE);  
            “LESS”, means “less than”, e.g. almost 2 months (val = P2M, mod=LESS); 
            “APPROX”, means “approximate”, e.g. nearly a week (val = P1W, mod=APPROX);  
            “START”, describes the beginning of a period of time, e.g.  Christmas morning, 2005 (val= 2005-12-25, mod= START).  
            “END”, describes the end of a period of time, e.g. late last year, (val = 2010, mod = END)
            “MIDDLE”, describes the middle of a period of time, e.g. mid-September 2001 (val = 2001-09, mod = MIDDLE) 

# Output format definition
Your output should follow JSON format, 
if there are one of the EVENT or TIMEX3 entity mentions:
    [
        {"entity_text": "<Exact entity mentions as in the note>", "entity_type": "EVENT", "type": "<event type>", "polarity": "<event polarity>", "modality": "<event modality>"},
        {"entity_text": "<Exact entity mentions as in the note>", "entity_type": "TIMEX3", "type": "<TIMEX3 type>", "val": "<time value>", "mod": "<additional information>"}
        ...
     ]
if there is no entity mentioned in the given sentence, just output an empty list:
    []

I am only interested in the extracted contents in []. Do NOT explain your answer.
...

Relation Extraction

The full code is available RE_multiclass.

We import the MultiClassRelationExtractor class for relation extraction.

from llm_ie.engines import OpenAIInferenceEngine
from llm_ie.extractors import MultiClassRelationExtractor
from llm_ie.data_types import LLMInformationExtractionDocument

Define inference engine. Since we use vLLM's OpenAI compatible server, we use OpenAIInferenceEngine. The config['base_url'] is http://localhost:8000/v1 following the default.

engine = OpenAIInferenceEngine(base_url=config['base_url'],
                               api_key="EMPTY",
                               model="meta-llama/Meta-Llama-3.1-70B-Instruct")

We define a Python function possible_relation_types_func() that inputs 2 frames and outputs the possible relation types between them. In this dataset, there are relations:

  • Strength-Drug: this is a relationship between the drug strength and its name.
  • Dosage-Drug: this is a relationship between the drug dosage and its name.
  • Duration-Drug: this is a relationship between a drug duration and its name.
  • Frequency-Drug: this is a relationship between a drug frequency and its name.
  • Form-Drug: this is a relationship between a drug form and its name.
  • Route-Drug: this is a relationship between the route of administration for a drug and its name.
  • Reason-Drug: this is a relationship between the reason for which a drug was administered (e.g., symptoms, diseases, etc.) and a drug name.
  • ADE-Drug: this is a relationship between an adverse drug event (ADE) and a drug name.

The possible_relation_types_func() returns [] ("no relation") when the 2 frames are over 500 characters apart. If the entity types are a Drug and something else, return the something-Drug relation type. Else return [].

def possible_relation_types_func(frame_1, frame_2) -> List[str]:
    # If the two frames are > 500 characters apart, we assume "No Relation"
    if abs(frame_1.start - frame_2.start) > 200:
        return []
    
    # If the two frames are "Drug" and an attribute entity
    if (frame_1.attr["EntityType"] == "Drug" and frame_2.attr["EntityType"] != "Drug"):
        return [f'{frame_2.attr["EntityType"]}-Drug']
    if (frame_2.attr["EntityType"] == "Drug" and frame_1.attr["EntityType"] != "Drug"):
        return [f'{frame_1.attr["EntityType"]}-Drug']

    return []

Define extractor and pass the possible_relation_types_func().

extractor = MultiClassRelationExtractor(inference_engine=engine,
                                        prompt_template=prompt_template,
                                        system_prompt=config['system_prompt'],
                                        possible_relation_types_func=possible_relation_types_func)

Run extractor with extractor.extract_relations() and add the relations to the document object. Then save to disk.

loop = tqdm(docs, total=len(docs), leave=True)
for doc in loop:
    loop.set_description(f"doc_id: {doc.doc_id}")
    relations = extractor.extract_relations(doc=doc, stream=False)
    doc.add_relations(relations)
    doc.save(os.path.join(config['out_dir'], config['run_name'], f"{doc.doc_id}.llmie"))

System Evaluation

The GPT-4 synthesized medical note and the full code is available demo_ADE_extraction.py.

Import LLM-IE

from llm_ie.engines import LlamaCppInferenceEngine
from llm_ie.extractors import SentenceFrameExtractor, BinaryRelationExtractor
from llm_ie.data_types import LLMInformationExtractionDocument

The medical note

note_text = """**Patient:** John Doe, 45 M  
**Physician:** Dr. Emily Johnson, Cardiologist, Green Valley Hospital

---

John is a 45-year-old male with a history of hypertension (dx 2015), Type 2 diabetes (dx 2018), and hyperlipidemia. He has been experiencing 
increased angina episodes since July 2024. He initially presented with complaints of occasional dizziness and fatigue, likely due to 
Lisinopril 10 mg daily.

**Meds Adjustments:**  
- Lisinopril was reduced to 5 mg daily, but the patient later developed a persistent dry cough (suspected ADR). Switched to Losartan 50 mg daily, 
which resolved the cough.
- Added Atorvastatin 20 mg daily in May 2024 for cholesterol control but caused muscle cramps. Switched to Rosuvastatin 10 mg daily in June 2024.
- Noticed palpitations and headaches since starting Sitagliptin 100 mg daily for better glucose control. Reduced to 50 mg due to GI upset and 
added Pantoprazole 20 mg.

**Current Meds:**  
- Losartan 50 mg daily  
- Metformin 500 mg BID  
- Rosuvastatin 10 mg daily  
- Sitagliptin 50 mg daily + Pantoprazole 20 mg daily  
- Carvedilol 12.5 mg BID (increased from 6.25 mg for angina)

---

**Plan:**  
Dr. Johnson advised John to monitor his blood pressure closely and keep a log of any side effects or new symptoms, especially related to the 
recent medication changes. Follow-up scheduled for October 2024 to reassess symptom control, particularly regarding angina frequency and GI 
symptoms.
"""

We use Llama.cpp to run Meta-Llama-3.1-70B-Instruct with int8 quantization.

llm = LlamaCppInferenceEngine(repo_id="bullerwins/Meta-Llama-3.1-70B-Instruct-GGUF",
	                          gguf_filename="Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf",
                              n_ctx=16000,
                              verbose=False)

The named entity recognition and entity attribute extraction are performed end-to-end.

# Define extractor
extractor = SentenceFrameExtractor(llm, prompt_template, system_prompt="You are a helpful medical AI assistant.")

# Extract
frames =  extractor.extract_frames(note_text, entity_key="EntityText", stream=True)

# Check extractions
for frame in frames:
    print(frame.to_dict())

# Define document
doc = LLMInformationExtractionDocument(doc_id="Meidcal note", text=note_text)

# Add frames to document
doc.add_frames(frames, valid_mode="span", create_id=True)

Relation extraction

from typing import List

def possible_relation_func(frame_1, frame_2) -> bool:
    # If the two frames are > 500 characters apart, we assume "No Relation"
    if abs(frame_1.start - frame_2.start) > 500:
        return []
    
    # If the two frames are "Medication" and "Strength", the only possible relation types are "Strength-Drug" or "No Relation"
    if (frame_1.attr["Type"] == "Drug" and frame_2.attr["Type"] == "Condition") or \
        (frame_2.attr["Type"] == "Drug" and frame_1.attr["Type"] == "Condition"):
        return True
    
    # If the two frames are "Medication" and "Frequency", the only possible relation types are "Frequency-Drug" or "No Relation"
    if (frame_1.attr["Type"] == "Drug" and frame_2.attr["Type"] == "ADE") or \
        (frame_2.attr["Type"] == "Drug" and frame_1.attr["Type"] == "ADE"):
        return True

    return False

# Define relation extractor
relation_extractor = BinaryRelationExtractor(llm, prompt_template=prompt_template, possible_relation_func=possible_relation_func)

# Extract multi-class relations
relations = relation_extractor.extract_relations(doc, stream=True)

# Add to document
doc.add_relations(relations)

To visualize, we render the results to HTML and save to file.

html = doc.viz_render(color_attr_key="Type")

import os
with open(os.path.join("demo_ADE_extraction.html"), "w") as f:
    f.write(html)

Releases

No releases published

Packages

No packages published