data format needed for training #460

jwijffels · 2025-11-14T13:22:52Z

jwijffels
Nov 14, 2025

Hello,

I'm trying to build a NER model based on the setup explained here: https://aphp.github.io/edsnlp/latest/tutorials/training-ner/
I basically have a pandas dataframe with a note_id and note_text and for each note a set of annotations which I annotated with inception. I combined the data with the texts with the entities

df = pd.DataFrame(dict(note_id = docs["doc_id"], note_text = docs["text"], text = docs["text"], note_datetime = docs["state"]))
df = df.join(
    ents[["note_id", "start_char", "end_char", "ent_text", "ent_label", "label", "note_nlp_source_value", "text"]].set_index('note_id').groupby(level=0).apply(pd.DataFrame.to_dict, orient='records').rename("entities")
)
d = edsnlp.data.from_pandas(df)

>>> z = df.iloc[840]
>>> z
note_id                                                        315
note_text        b'Some text blablabla more blablabla'.
text                 b'Some text blablabla more blablabla'.
note_datetime                               ANNOTATION-IN-PROGRESS
entities         [{'start_char': 188, 'end_char': 203, 'ent_tex...

entities looks like this

[{'start_char': 188, 'end_char': 203, 'ent_text': '14 januari 2011', 'ent_label': '03-Datum', 'label': '03-Datum', 'note_nlp_source_value': '03-Datum', 'text': '14 januari 2011'}, {'start_char': 199, 'end_char':    
203, 'ent_text': '2011', 'ent_label': '03-Datum', 'label': '03-Datum', 'note_nlp_source_value': '03-Datum', 'text': '2011'}, {'start_char': 211, 'end_char': 231, 'ent_text': 'MAMA MIA Jan Julien', 'ent_label':     
'01-Naam', 'label': '01-Naam', 'note_nlp_source_value': '01-Naam', 'text': 'MAMA MIA Jan Julien'}]

I next try to plug in the data in your training "From a script or a notebook" code where I replace

train_data = (d.map(eds.split(nlp=None, max_length=2000, regex="\n\n+")))
val_data = (d.map(eds.split(nlp=None, max_length=2000, regex="\n\n+")))

When I launch the train command it says it does not find text.
It's unclear based on the docs what the input data should look like. Could you elaborate that?

>>> train(
...     nlp=nlp,
...     max_steps=max_steps,
...     validation_interval=max_steps // 10,
...     train_data=TrainingData(
...         data=train_data,
...         batch_size="4096 tokens",  # 32 * 128 tokens
...         pipe_names=["ner"],
...         shuffle="dataset",
...     ),
...     val_data=val_data,
...     scorer={"ner": ner_metric},
...     optimizer=optimizer,
...     grad_max_norm=1.0,
...     output_dir="artifacts",
...     logger=loggers,
...     # Do preprocessing in parallel on 1 worker
...     num_workers=1,
...     # Enable on Mac OS X or if you don't want to use available GPUs
...     # cpu=True,
... )
Trainable components: ner
Training phases:
 - 1: ner
  File "<stdin>", line 1, in <module>
  File "C:\Users\jwijf\AppData\Local\Programs\Python\Python312\Lib\site-packages\confit\registry.py", line 393, in wrapper_function
    raise e.with_traceback(remove_lib_from_traceback(e.__traceback__))
  File "C:\Users\jwijf\AppData\Local\Programs\Python\Python312\Lib\site-packages\pydantic\deprecated\decorator.py", line 227, in execute
    return self.raw_function(**d, **var_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwijf\AppData\Local\Programs\Python\Python312\Lib\site-packages\edsnlp\training\trainer.py", line 651, in train
    val_docs = list(chain.from_iterable(val_data))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwijf\AppData\Local\Programs\Python\Python312\Lib\site-packages\edsnlp\processing\simple.py", line 104, in process
    for item in items:
                ^^^^^
  File "C:\Users\jwijf\AppData\Local\Programs\Python\Python312\Lib\site-packages\edsnlp\core\stream.py", line 168, in __call__
    yield from res
  File "C:\Users\jwijf\AppData\Local\Programs\Python\Python312\Lib\site-packages\edsnlp\pipes\misc\split\split.py", line 163, in __call__
    for sub_doc in self.split_doc(doc):
                   ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwijf\AppData\Local\Programs\Python\Python312\Lib\site-packages\edsnlp\pipes\misc\split\split.py", line 200, in split_doc
    for m in self.regex.finditer(doc.text)
                                 ^^^^^^^^
AttributeError: 'dict' object has no attribute 'text'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

data format needed for training #460

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

data format needed for training #460

Uh oh!

Uh oh!

jwijffels Nov 14, 2025

Replies: 0 comments

jwijffels
Nov 14, 2025