GitHub - daviden1013/LLM-IE_Benchmark: This is the benchmark repo for the LLM-IE Python package.

This is a benchmark repo for the LLM-IE Python package. We used a synthesized medical note from GPT-4 for a system-wised evaluation. The 2012, 2014, and 2018 i2b2/ n2c2 datasets are used for benchmarking. Note that the datasets are NOT included in this repo in compliance to the data user agreements. To access the datasets, please refer to the DBMI data portal.

Overview

We utilized the LLE-IE package to build an information extraction pipeline for the drug, condition, and ADE entities, attributes, and relations. For all the frames extracted by the Frame extractor, the attribute “Type” represents the frame type as one of the “Drug”, “Condition”, or “ADE”. If the Type is “Drug”, “Dosage” and “Frequency” are extracted as additional attributes. If the Type is “Condition”, an “Assertion” attribute is assigned. The relations between a “Condition” frame and a “Drug” frame and between an “ADE” frame and a “Drug” frame are extracted by the Relation extractor. We visualized the results with the viz_render() method and displayed them on a browser.

For the NER and EA tasks, the Sentence Frame Extractor achieved the best F1 scores, while consuming more GPU time. The Review Frame Extractor had higher recall than the Basic Frame Extractor on all NER tasks.

Tasks	Algorithm	GPU time (s)/ Note	Benchmarks
Named Entity Recognition			2012 Temporal Relations Challenge
			EVENT			TIMEX
			Precision	Recall	F1	Precision	Recall	F1
	Basic	67.5	0.9406	0.2841	0.4364	0.9595	0.3516	0.5147
	Review	84.0	0.8965	0.3995	0.5527	0.9352	0.5473	0.6905
	Sentence	132.9	0.9101	0.6824	0.7799	0.8891	0.739	0.8071
			2014 De-identification Challenge
			Strict			Relaxed
			Precision	Recall	F1	Recall	Precision	F1
	Basic	9.4	0.7154	0.4813	0.5755	0.7172	0.4826	0.5769
	Review	15.7	0.5649	0.5454	0.555	0.5667	0.5471	0.5567
	Sentence	20.7	0.6683	0.7379	0.7014	0.6703	0.7401	0.7035
			2018 (Track 2) ADE and Medication Extraction Challenge
			Strict			Lenient
			Precision	Recall	F1	Recall	Precision	F1
	Basic	44.3	0.7384	0.3534	0.478	0.8537	0.4034	0.5479
	Review	63.2	0.7209	0.427	0.5363	0.8416	0.4918	0.6208
	Sentence	114.1	0.852	0.6166	0.7154	0.963	0.692	0.8053
Entity Attribute Extraction			2012 Temporal Relations Challenge
			EVENT			TIMEX
			Type	Polarity	Modality	Type	Value	Modifier
	Basic	67.5	0.2589	0.2707	0.2737	0.3236	0.2835	0.3198
	Review	84.0	0.358	0.3799	0.3828	0.4934	0.4209	0.4857
	Sentence	132.9	0.6056	0.642	0.6432	0.678	0.5505	0.667
Relation Extraction			2018 (Track 2) ADE and Medication Extraction Challenge
			Precision		Recall		F1
	Multi-class	213.9	0.3831		0.978		0.5505

Prerequisite

All the experiments were conducted with the LLM-IE Python package and vLLM inference engine.

pip install llm-ie==0.3.1
pip install vllm==0.5.4

For visualization, our plug-in package ie-viz is needed.

pip install ie-viz==0.1.4

We used the OpenAI compatible server to run Llama-3.1-70B-Instruct.

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --api-key EMPTY --tensor-parallel-size 4 --enable-prefix-caching

Methods

The full code is available in the pipeline. The configuration files are in the config directories: 2012 i2b2, 2014 i2b2, and 2018 n2c2. Below are technical highlights for each task.

Named Entity Recognition

We use the Sentence frame extraction pipeline to demo. The full code is available NER_sentence.py.

We import the inference engine, extractor (prompting algorithm), and document (for entity output storage) from the LLM-IE.

from llm_ie.engines import OpenAIInferenceEngine
from llm_ie.extractors import SentenceFrameExtractor
from llm_ie.data_types import LLMInformationExtractionDocument

Define inference engine. Since we use vLLM's OpenAI compatible server, we use OpenAIInferenceEngine. The config['base_url'] is http://localhost:8000/v1 following the default.

engine = OpenAIInferenceEngine(base_url=config['base_url'],
                               api_key="EMPTY",
                               model="meta-llama/Meta-Llama-3.1-70B-Instruct")

Define extractor with prompt template and system prompt. the full prompt templates are in the prompt_templates directories under each benchmark. The system prompt for all tasks is "You are a highly skilled clinical AI assistant, proficient in reviewing clinical notes and performing accurate information extraction"

extractor = SentenceFrameExtractor(inference_engine=engine,
                                   prompt_template=prompt_template,
                                   system_prompt=config['system_prompt'])

Iterate through all documents and extract frames with the extractor.extract_frames() method. The extracted frames are stored in the LLMInformationExtractionDocument and save to disk.

loop = tqdm(IEs, total=len(IEs), leave=True)
for ie in loop:
    loop.set_description(f"doc_id: {ie.doc_id}")
    frames = extractor.extract_frames(text_content=ie['text'], entity_key="entity_text", multi_turn=False, stream=False)
    doc = LLMInformationExtractionDocument(doc_id=ie['doc_id'], text=ie['text'])
    for frame in frames:
        doc.add_frame(frame, valid_mode="span", create_id=True)

    doc.save(os.path.join(config['out_dir'], config['run_name'], f"{doc.doc_id}.llmie"))

Entity Attribute Extraction

The named entity recognition and entity attribute extraction use the same pipeline, following the steps above. The only difference is the prompt template. The Schema defines the attributes.

...
# Schema definition
Your output should contain: 
    "entity_text": the exact wording as mentioned in the note.
    "entity_type": type of the entity. It should be one of the "EVENT" or "TIMEX3".
    if entity_type is "EVENT",
        "type": the event type as one of the "TEST", "PROBLEM", "TREATMENT", "CLINICAL_DEPT", "EVIDENTIAL", or "OCCURRENCE".
        "polarity": whether an EVENT is positive ("POS") or negative ("NAG"). For example, in “the patient reports headache, and denies chills”, the EVENT [headache] is positive in its polarity, and the EVENT [chills] is negative in its polarity.
        "modality": whether an EVENT actually occurred or not. Must be one of the "FACTUAL", "CONDITIONAL", "POSSIBLE", or "PROPOSED".

    if entity_type is "TIMEX3",
        "type": the type as one of the "DATE", "TIME", "DURATION", or "FREQUENCY".
        "val": the numeric value 1) DATE: [YYYY]-[MM]-[DD], 2) TIME: [hh]:[mm]:[ss], 3) DURATION: P[n][Y/M/W/D]. So, “for eleven days” will be 
represented as “P11D”, meaning a period of 11 days. 4)  R[n][duration], where n denotes the number of repeats. When the n is omitted, the expression denotes an unspecified amount of repeats. For example, “once a day for 3 days” is “R3P1D” (repeat the time interval of 1 day (P1D) for 3 times (R3)), twice every day is “RP12H” (repeat every 12 hours)
        "mod": additional information regarding the temporal value of a time expression. Must be one of the:
            “NA”: the default value, no relevant modifier is present;  
            “MORE”, means “more than”, e.g. over 2 days (val = P2D, mod = MORE);  
            “LESS”, means “less than”, e.g. almost 2 months (val = P2M, mod=LESS); 
            “APPROX”, means “approximate”, e.g. nearly a week (val = P1W, mod=APPROX);  
            “START”, describes the beginning of a period of time, e.g.  Christmas morning, 2005 (val= 2005-12-25, mod= START).  
            “END”, describes the end of a period of time, e.g. late last year, (val = 2010, mod = END)
            “MIDDLE”, describes the middle of a period of time, e.g. mid-September 2001 (val = 2001-09, mod = MIDDLE) 

# Output format definition
Your output should follow JSON format, 
if there are one of the EVENT or TIMEX3 entity mentions:
    [
        {"entity_text": "<Exact entity mentions as in the note>", "entity_type": "EVENT", "type": "<event type>", "polarity": "<event polarity>", "modality": "<event modality>"},
        {"entity_text": "<Exact entity mentions as in the note>", "entity_type": "TIMEX3", "type": "<TIMEX3 type>", "val": "<time value>", "mod": "<additional information>"}
        ...
     ]
if there is no entity mentioned in the given sentence, just output an empty list:
    []

I am only interested in the extracted contents in []. Do NOT explain your answer.
...

Relation Extraction

The full code is available RE_multiclass.

We import the MultiClassRelationExtractor class for relation extraction.

from llm_ie.engines import OpenAIInferenceEngine
from llm_ie.extractors import MultiClassRelationExtractor
from llm_ie.data_types import LLMInformationExtractionDocument

Define inference engine. Since we use vLLM's OpenAI compatible server, we use OpenAIInferenceEngine. The config['base_url'] is http://localhost:8000/v1 following the default.

engine = OpenAIInferenceEngine(base_url=config['base_url'],
                               api_key="EMPTY",
                               model="meta-llama/Meta-Llama-3.1-70B-Instruct")

We define a Python function possible_relation_types_func() that inputs 2 frames and outputs the possible relation types between them. In this dataset, there are relations:

Strength-Drug: this is a relationship between the drug strength and its name.
Dosage-Drug: this is a relationship between the drug dosage and its name.
Duration-Drug: this is a relationship between a drug duration and its name.
Frequency-Drug: this is a relationship between a drug frequency and its name.
Form-Drug: this is a relationship between a drug form and its name.
Route-Drug: this is a relationship between the route of administration for a drug and its name.
Reason-Drug: this is a relationship between the reason for which a drug was administered (e.g., symptoms, diseases, etc.) and a drug name.
ADE-Drug: this is a relationship between an adverse drug event (ADE) and a drug name.

The possible_relation_types_func() returns [] ("no relation") when the 2 frames are over 500 characters apart. If the entity types are a Drug and something else, return the something-Drug relation type. Else return [].

def possible_relation_types_func(frame_1, frame_2) -> List[str]:
    # If the two frames are > 500 characters apart, we assume "No Relation"
    if abs(frame_1.start - frame_2.start) > 200:
        return []
    
    # If the two frames are "Drug" and an attribute entity
    if (frame_1.attr["EntityType"] == "Drug" and frame_2.attr["EntityType"] != "Drug"):
        return [f'{frame_2.attr["EntityType"]}-Drug']
    if (frame_2.attr["EntityType"] == "Drug" and frame_1.attr["EntityType"] != "Drug"):
        return [f'{frame_1.attr["EntityType"]}-Drug']

    return []

Define extractor and pass the possible_relation_types_func().

extractor = MultiClassRelationExtractor(inference_engine=engine,
                                        prompt_template=prompt_template,
                                        system_prompt=config['system_prompt'],
                                        possible_relation_types_func=possible_relation_types_func)

Run extractor with extractor.extract_relations() and add the relations to the document object. Then save to disk.

loop = tqdm(docs, total=len(docs), leave=True)
for doc in loop:
    loop.set_description(f"doc_id: {doc.doc_id}")
    relations = extractor.extract_relations(doc=doc, stream=False)
    doc.add_relations(relations)
    doc.save(os.path.join(config['out_dir'], config['run_name'], f"{doc.doc_id}.llmie"))

System Evaluation

The GPT-4 synthesized medical note and the full code is available demo_ADE_extraction.py.

Import LLM-IE

from llm_ie.engines import LlamaCppInferenceEngine
from llm_ie.extractors import SentenceFrameExtractor, BinaryRelationExtractor
from llm_ie.data_types import LLMInformationExtractionDocument

The medical note

note_text = """**Patient:** John Doe, 45 M  
**Physician:** Dr. Emily Johnson, Cardiologist, Green Valley Hospital

---

John is a 45-year-old male with a history of hypertension (dx 2015), Type 2 diabetes (dx 2018), and hyperlipidemia. He has been experiencing 
increased angina episodes since July 2024. He initially presented with complaints of occasional dizziness and fatigue, likely due to 
Lisinopril 10 mg daily.

**Meds Adjustments:**  
- Lisinopril was reduced to 5 mg daily, but the patient later developed a persistent dry cough (suspected ADR). Switched to Losartan 50 mg daily, 
which resolved the cough.
- Added Atorvastatin 20 mg daily in May 2024 for cholesterol control but caused muscle cramps. Switched to Rosuvastatin 10 mg daily in June 2024.
- Noticed palpitations and headaches since starting Sitagliptin 100 mg daily for better glucose control. Reduced to 50 mg due to GI upset and 
added Pantoprazole 20 mg.

**Current Meds:**  
- Losartan 50 mg daily  
- Metformin 500 mg BID  
- Rosuvastatin 10 mg daily  
- Sitagliptin 50 mg daily + Pantoprazole 20 mg daily  
- Carvedilol 12.5 mg BID (increased from 6.25 mg for angina)

---

**Plan:**  
Dr. Johnson advised John to monitor his blood pressure closely and keep a log of any side effects or new symptoms, especially related to the 
recent medication changes. Follow-up scheduled for October 2024 to reassess symptom control, particularly regarding angina frequency and GI 
symptoms.
"""

We use Llama.cpp to run Meta-Llama-3.1-70B-Instruct with int8 quantization.

llm = LlamaCppInferenceEngine(repo_id="bullerwins/Meta-Llama-3.1-70B-Instruct-GGUF",
	                          gguf_filename="Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf",
                              n_ctx=16000,
                              verbose=False)

The named entity recognition and entity attribute extraction are performed end-to-end.

# Define extractor
extractor = SentenceFrameExtractor(llm, prompt_template, system_prompt="You are a helpful medical AI assistant.")

# Extract
frames =  extractor.extract_frames(note_text, entity_key="EntityText", stream=True)

# Check extractions
for frame in frames:
    print(frame.to_dict())

# Define document
doc = LLMInformationExtractionDocument(doc_id="Meidcal note", text=note_text)

# Add frames to document
doc.add_frames(frames, valid_mode="span", create_id=True)

Relation extraction

from typing import List

def possible_relation_func(frame_1, frame_2) -> bool:
    # If the two frames are > 500 characters apart, we assume "No Relation"
    if abs(frame_1.start - frame_2.start) > 500:
        return []
    
    # If the two frames are "Medication" and "Strength", the only possible relation types are "Strength-Drug" or "No Relation"
    if (frame_1.attr["Type"] == "Drug" and frame_2.attr["Type"] == "Condition") or \
        (frame_2.attr["Type"] == "Drug" and frame_1.attr["Type"] == "Condition"):
        return True
    
    # If the two frames are "Medication" and "Frequency", the only possible relation types are "Frequency-Drug" or "No Relation"
    if (frame_1.attr["Type"] == "Drug" and frame_2.attr["Type"] == "ADE") or \
        (frame_2.attr["Type"] == "Drug" and frame_1.attr["Type"] == "ADE"):
        return True

    return False

# Define relation extractor
relation_extractor = BinaryRelationExtractor(llm, prompt_template=prompt_template, possible_relation_func=possible_relation_func)

# Extract multi-class relations
relations = relation_extractor.extract_relations(doc, stream=True)

# Add to document
doc.add_relations(relations)

To visualize, we render the results to HTML and save to file.

html = doc.viz_render(color_attr_key="Type")

import os
with open(os.path.join("demo_ADE_extraction.html"), "w") as f:
    f.write(html)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmarks		benchmarks
pipelines		pipelines
.gitignore		.gitignore
README.md		README.md
demo_ADE_extraction.py		demo_ADE_extraction.py
llm-ie_demo.PNG		llm-ie_demo.PNG
requirements.txt		requirements.txt
vLLM_server.sh		vLLM_server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Overview

Prerequisite

Methods

Named Entity Recognition

Entity Attribute Extraction

Relation Extraction

System Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

daviden1013/LLM-IE_Benchmark

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Prerequisite

Methods

Named Entity Recognition

Entity Attribute Extraction

Relation Extraction

System Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages