⚠️ This tool is a prototype in active development and may change significantly. Always verify results!
LLM Extractinator enables efficient extraction of structured data from unstructured text using large language models (LLMs). It supports configurable task definitions, CLI or Python usage, and flexible data input/output formats.
📘 Full documentation: https://DIAGNijmegen.github.io/llm_extractinator/
curl -fsSL https://ollama.com/install.sh | sh
Download the installer from:
https://ollama.com/download
You have two options:
pip install llm_extractinator
git clone https://github.com/DIAGNijmegen/llm_extractinator.git
cd llm_extractinator
pip install -e .
extractinate --task_id 001 --model_name "phi4"
from llm_extractinator import extractinate
extractinate(task_id=1, model_name="phi4")
Each task is defined using a JSON file stored in the tasks/
directory.
Filename format:
TaskXXX_name.json
Example contents:
{
"Description": "Extract product data from text.",
"Data_Path": "products.csv",
"Input_Field": "text",
"Parser_Format": "product_parser.py"
}
Parser_Format
refers to a .py
file in tasks/parsers/
that defines a Pydantic OutputParser
class used to structure the LLM output.
You can visually design the output schema using:
build-parser
This launches a web UI to create a Pydantic OutputParser
model, which defines the structure of the extracted data. Additional models can be added and nested for complex formats.
The resulting .py
file should be saved in:
tasks/parsers/
And referenced in your task JSON under the Parser_Format
key.
👉 See parser docs for full usage.
If you use this tool, please cite: 10.5281/zenodo.15089764
We welcome contributions! See the full contributing guide in the docs.