You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🧪 Testing the Natural Language Processing Layer (NLP)
nav_order
6
🧪 Testing the Natural Language Processing Layer (NLP)
📋 Introduction
This report documents the testing and evaluation of the Natural Language Processing (NLP) layer, whose primary role is the automatic extraction of medical symptoms from unstructured text in English.
The core challenge of this layer is accurately identifying symptoms in complex sentences, distinguishing confirmed from negated conditions, and normalizing extracted terms to canonical English equivalents compatible with medical ontologies (SYMP/DOID).
The goal of testing is to identify the optimal combination of LLM architecture and system prompt configuration in order to achieve high precision and reliability before data is passed to the next layer — the embedding (vectorization) layer.
Testing was conducted across 7 phases, varying model size (from 3B to 120B parameters) and instruction complexity, with a focus on reaching an 80% success threshold across key metrics. The following models were evaluated:
Standard NLP metrics were used to assess extraction quality:
Precision — the model's ability to extract only relevant and correct symptoms without introducing noise
Recall — the model's ability to identify all symptoms mentioned in the text without omissions
F1 Score — the harmonic mean of precision and recall, representing the overall accuracy of the model
🎯 Target Metrics
Metric
Success Threshold
Formula
Precision
✅ ≥ 80%
TP / (TP + FP) — or 0 if no symptoms extracted
Recall
✅ ≥ 80%
TP / (TP + FN) — or 0 if no symptoms expected
F1 Score
✅ ≥ 80%
2 × (Precision × Recall) / (Precision + Recall) — or 0 if both are 0
🧪 Test 1
Field
Value
Test Set
symptom-extraction-test-data-en.json
Model
llama3.2:3b (local)
System Prompt
symptom-extraction-prompt-1.txt
Results
symptom-extraction-result-1.txt
🔍 Model Overview — llama3.2:3b
Property
Value
Developer
Meta
Released
September 2024
Architecture
Dense · Decoder-only Transformer
Parameters
3.21B
Attention
GQA + RoPE
Context Window
128K tokens
Quantization
Q4_K_M (Ollama)
Intended Use
On-device / edge deployment
License
Llama 3.2 Community License
Lightweight text-only model optimized for mobile and edge environments. Supports 8 languages. Smallest model in the Llama 3.2 family.
📊 Results
Metric
Score
Precision
0.747
Recall
0.730
F1 Score
0.731
💬 Commentary
The llama3.2:3B model offers surprisingly robust baseline extraction capabilities for its size, characterized by a balanced precision-recall profile. However, its utility in strict ontology-based systems is limited by its tendency to generate pre-coordinated, hyper-specific clinical phrases (e.g., chest pain on inhalation) and its occasional, yet significant, failures in accurately resolving logical negations (case 10 and 41) within complex clinical narratives.
🧪 Test 2
Field
Value
Test Set
symptom-extraction-test-data-en.json
Model
llama3:8b (local)
System Prompt
symptom-extraction-prompt-1.txt
Results
symptom-extraction-result-2.txt
🔍 Model Overview — llama3:8b
Property
Value
Developer
Meta
Released
April 2024
Architecture
Dense · Decoder-only Transformer
Parameters
8.03B
Attention
GQA + RoPE
Layers
32
Hidden Dim
4096
Context Window
8K tokens
Quantization
Q4_0 (Ollama)
License
Meta Llama 3 Community License
First generation of the Llama 3 family. Trained on 15T tokens. Strong reasoning and instruction following for its size class.
📊 Results
Metric
Score
Precision
0.809
Recall
0.806
F1 Score
0.800
💬 Commentary
The llama3:8B model achieved a solid 80% F1-score across the 100 test cases. While it excels at identifying standard symptoms like fever and headache, it struggles with semantic mapping—often extracting a clinically accurate term (e.g., paresthesia) that doesn't perfectly match the expected "gold standard" label (e.g., hypoesthesia), or failing to split compound descriptions into separate entities. Despite these mapping nuances, the zero negation errors indicate high reliability in distinguishing between present and absent symptoms.
🧪 Test 3
Field
Value
Test Set
symptom-extraction-test-data-en.json
Model
mistral-nemo:12b (local)
System Prompt
symptom-extraction-prompt-1.txt
Results
symptom-extraction-result-3.txt
🔍 Model Overview — mistral-nemo:12b
Property
Value
Developer
Mistral AI + NVIDIA (joint)
Released
July 2024
Architecture
Dense · Decoder-only Transformer
Parameters
12.2B
Layers
40
Attention
GQA (32 heads / 8 KV heads)
Context Window
128K tokens
Tokenizer
Tekken (Tiktoken-based, 131K vocab)
Quantization
FP8-aware training
Training Infra
NVIDIA Megatron-LM, 3072× H100 GPUs
License
Apache 2.0
Co-developed with NVIDIA on DGX Cloud. Drop-in replacement for Mistral 7B with significantly expanded context and multilingual capability across 11+ languages.
📊 Results
Metric
Score
Precision
0.784
Recall
0.810
F1 Score
0.790
💬 Commentary
The mistral-nemo:12b model achieved a nearly identical 79% F1-score, showing slightly better recall than llama3:8b but frequently losing points due to overly technical synonym mapping (e.g., presyncope for lightheadedness) and a tendency to combine separate symptoms into single compound entities.
🧪 Test 4
Field
Value
Test Set
symptom-extraction-test-data-en.json
Model
qwen2.5:14b (local)
System Prompt
symptom-extraction-prompt-1.txt
Results
symptom-extraction-result-4.txt
🔍 Model Overview — qwen2.5:14b
Property
Value
Developer
Alibaba Cloud (Qwen Team)
Released
September 2024
Architecture
Dense · Decoder-only Transformer
Parameters
14.7B total / 13.1B non-embedding
Layers
48
Attention
GQA (40 Q heads / 8 KV heads) + RoPE
Activation
SwiGLU
Normalization
RMSNorm
Context Window
128K tokens (generates up to 8K)
Pretraining Data
~18 trillion tokens
Multilingual
29+ languages
License
Qwen Research License
Dense open-weight model from the Qwen2.5 family. Pretrained on the largest dataset among local models tested (~18T tokens). Strong structured output and instruction-following capabilities.
📊 Results
Metric
Score
Precision
0.835
Recall
0.825
F1 Score
0.825
💬 Commentary
The qwen2.5:14b model is the current top performer with a 0.825 F1-score, showing a superior ability to map descriptive symptoms into formal clinical terminology while maintaining perfect accuracy in negation handling, though it still occasionally penalizes itself by grouping separate symptoms into single compound terms.
🧪 Test 5
Field
Value
Test Set
symptom-extraction-test-data-en.json
Model
phi4:14b (local)
System Prompt
symptom-extraction-prompt-1.txt
Results
symptom-extraction-result-5.txt
🔍 Model Overview — phi4:14b
Property
Value
Developer
Microsoft Research
Released
December 2024
Architecture
Dense · Decoder-only Transformer
Parameters
14B
Layers
40
Attention
GQA (24 heads / 8 KV heads) + RoPE
Context Window
16K tokens (extendable to 64K)
Tokenizer
tiktoken (vocab size 100,352)
Pretraining Data
~9.8T tokens (incl. ~400B synthetic)
Knowledge Cutoff
June 2024
License
MIT
STEM-focused SLM (Small Language Model) from Microsoft. Distinguished by heavy use of synthetic training data for mathematical and scientific reasoning. Punches above its weight class on GPQA and MATH benchmarks.
📊 Results
Metric
Score
Precision
0.771
Recall
0.783
F1 Score
0.772
💬 Commentary
The phi4:14B model demonstrates high clinical intelligence with an F1-score of 0.772, though it frequently loses points by using advanced medical terminology (e.g., presyncope instead of faint) that causes mismatches with your dataset's expected labels. While it excels at handling negations and complex symptoms, its primary challenge is over-specification, as it often provides more detailed anatomical descriptions than your ground truth requires.
🧪 Test 6
Field
Value
Test Set
symptom-extraction-test-data-en.json
Model
meta-llama/llama-4-scout-17b-16e-instruct (cloud)
System Prompt
symptom-extraction-prompt-1.txt
Results
symptom-extraction-result-6.txt
🔍 Model Overview — meta-llama/llama-4-scout-17b-16e-instruct
Property
Value
Developer
Meta
Released
April 2025
Architecture
Mixture-of-Experts (MoE) · Auto-regressive
Active Parameters
17B (Top-2 routing per token)
Total Parameters
109B (distributed across experts)
MoE Experts
16 specialized + 1 shared (always active)
Transformer Layers
40 total (20 MoE blocks)
Activation
SwiGLU
Multimodality
✅ Early fusion (up to 5 images per prompt)
Context Window
128K tokens (cloud)
Knowledge Cutoff
August 2024
Multilingual
12 languages
License
Llama 4 Community License
First MoE model in the Llama family. Each token activates only 2 of 16 experts, giving the inference cost of a 17B dense model with the knowledge capacity of 109B parameters. Natively multimodal via early fusion architecture.
📊 Results
Metric
Score
Precision
0.746
Recall
0.799
F1 Score
0.763
💬 Commentary
The llama-4-scout-17b model shows a solid baseline for symptom extraction with an average F1-score of 0.763, but it frequently encounters "False Positives" due to its high clinical precision. Much like the previous model, it tends to extract more detailed or technical terms (e.g., presyncope or hyperhidrosis) when the ground truth expects simpler descriptions (e.g., lightheadedness or diaphoresis). While its negation detection is nearly perfect, the overall score is primarily capped by this semantic gap between the model's medical vocabulary and your dataset's specific labels.
🧪 Test 7
Field
Value
Test Set
symptom-extraction-test-data-en.json
Model
openai/gpt-oss-120b (cloud)
System Prompt
symptom-extraction-prompt-1.txt
Results
symptom-extraction-result-7.txt
🔍 Model Overview — openai/gpt-oss-120b
Property
Value
Developer
OpenAI
Architecture
Dense · Decoder-only Transformer
Parameters
~120B
Deployment
Cloud (API)
Large-scale dense model. Highest parameter count among all tested models. Exhibits the most pronounced High-Intelligence Bias — consistently extracts more clinically nuanced and anatomically precise terms than the ground truth labels require.
📊 Results
Metric
Score
Precision
0.697
Recall
0.701
F1 Score
0.691
💬 Commentary
The gpt-oss-120b model achieves an average F1-score of 0.691, making it the most descriptive but least "label-compliant" model among those tested. Its primary failure mode is over-descriptive labeling, where it frequently includes anatomical details (e.g., severe abdominal pain or arm rash) that result in total mismatches with the simpler ground truth labels (e.g., abdominal pain or rash). While its extraction logic is physically accurate and highly sensitive to nuances like productive cough or high fever, its lack of constraint to your specific ontology leads to significantly lower precision and recall compared to the llama or phi models
📊 Overall Results
Model
Size
Type
Precision
Recall
F1
llama3.2:3b
3B
Local
0.747
0.730
0.731
llama3:8b
8B
Local
0.809
0.806
0.800
mistral-nemo:12b
12B
Local
0.784
0.810
0.790
qwen2.5:14b
14B
Local
0.835
0.825
0.825
phi4:14b
14B
Local
0.771
0.783
0.772
meta-llama/llama-4-scout-17b-16e-instruct
17B
Cloud
0.746
0.799
0.763
openai/gpt-oss-120b
120B
Cloud
0.697
0.701
0.691
✅ qwen2.5:14b achieved the highest F1 score across all tested models, surpassing the 80% target threshold and outperforming models up to 8x larger.
Conclusion
While the NLP layer achieved satisfactory performance (F1 ≥ 0.82 for the best model), several challenges remain that directly impact the embedding layer.
⚠️ Semantic Fragmentation and Normalization
The most critical issue identified is the lack of strict canonical normalization, which leads to semantic fragmentation. Different lexical representations of the same clinical concept (e.g., lightheadedness, dizziness, presyncope) result in distinct vector embeddings. This divergence reduces similarity accuracy and negatively affects downstream disease matching in the graph database.
🧠 Interpretation of Model Performance
It is important to emphasize that a lower F1 score does not necessarily indicate a "weaker" model. Analysis of the extraction logs reveals a distinct "High-Intelligence Bias" in larger architectures:
Descriptive Precision: Models like gpt-oss-120b often extract more technical or clinically accurate terms (e.g., productive cough instead of just cough).
Metric Penalty: These models are frequently penalized by the evaluation script for not strictly adhering to the simplified, flat labels of the ground truth, despite their output being medically valid.
Instruction Following: Mid-sized models like qwen2.5:14b demonstrate superior balance between clinical extraction and adherence to formatting constraints, leading to higher benchmark scores.