Self-contradictory summary in spacy debug data #8035

Pandalei97 · 2021-05-07T15:02:05Z

Pandalei97
May 7, 2021

I want to train a textcat model and I got this message when I execute the commande spacy debug data -V config.cfg:

============================ Data file validation ============================
✔ Corpus is loadable
✔ Pipeline can be initialized with data

=============================== Training stats ===============================
Language: fr
Training pipeline: tok2vec, textcat
3036 training docs
1952 evaluation docs
⚠ 15 training examples also in evaluation data

================== Text Classification (Exclusive Classes) ==================
ℹ Text Classification: 1 label(s)
Labels: 'NEGATION'
Labels in train data: 'NEGATION' (979.2503472432012)
⚠ The train data contains instances without mutually-exclusive classes. Use the
component 'textcat_multilabel' instead of 'textcat'.
✘ Train/dev mismatch: the dev data contains instances without mutually-exclusive
classes while the train data contains only instances with mutually-exclusive
classes.

There are 2 things that seem strange to me:

The number of labels is different from the number of the doc, and it's not an integer
The waning message says The train data contains instances without mutually-exclusive classes. but the error message says the train data contains only instances with mutually-exclusive classes. It's contradictory.

I was converting prodigy .jsonl files to .spacy. This is my function to parse the files (simplyfied):

doc_bin = DocBin()
for file_name in file_names:
    for eg in srsly.read_jsonl(os.path.join(input_path, file_name)):
        doc = nlp.make_doc(eg["text"])
        label = eg["label"]
        score = eg.get("score")
        if score is None:
            if eg.get("answer") == "accept":
                score = 1.0
            else:
                score = 0.0
        doc.cats = {label: score}
        doc_bin.add(doc)

All the data in jsonl files have the unique label 'NEGATION'. I don't understand why this is not working. Did i do something wrong in the conversion or it's an error in spacy debug data ?

Spacy version : 3.0.6
Python version: 3.6.5

Thanks for your response.

Answered by adrianeboyd

May 7, 2021

Thanks for the note, we'll have another look at the debug messages.

In this case, the warning message is correct given the format described above.

If you have a binary classification task, you can use two labels with textcat (one label is 1.0 for each instance and one label is 0.0, so NEGATION and NOT_NEGATION) or you can use one label with textcat_multilabel (just NEGATION as 1.0 or 0.0 as you have above).

The textcat_multilabel labels can be 0.0 or 1.0 for each label individually, but for textcat there should always be exactly one label per document with a score of 1.0 and all other labels should be 0.0. The textcat model always predicts scores that sum to 1.0 over all labels, so the mo…

View full answer

adrianeboyd · 2021-05-07T17:07:03Z

adrianeboyd
May 7, 2021

Thanks for the note, we'll have another look at the debug messages.

In this case, the warning message is correct given the format described above.

If you have a binary classification task, you can use two labels with textcat (one label is 1.0 for each instance and one label is 0.0, so NEGATION and NOT_NEGATION) or you can use one label with textcat_multilabel (just NEGATION as 1.0 or 0.0 as you have above).

The textcat_multilabel labels can be 0.0 or 1.0 for each label individually, but for textcat there should always be exactly one label per document with a score of 1.0 and all other labels should be 0.0. The textcat model always predicts scores that sum to 1.0 over all labels, so the model is kind of useless with just one category, since it will always predict 1.0.

If there's not already a warning here, we should consider adding one when you start training with just one label, since the model is not going to be useful. Is this the problem you were running into?

4 replies

Pandalei97 May 7, 2021
Author

Thanks a lot ! Indeed, when I was training, the score was always 1.0. Using textcat_multilabel should solve my problem ! :D

By the way, in my training data, some instances have a score which is between 0.0 and 1.0 (That's why I tried to get the score).

Can I pass these values as the label probability when using textcat_multilabel? Or should it always be 1.0 or 0.0 ?

adrianeboyd May 11, 2021

The textcat and textcat_multilabel components are only designed to be trained/evaluated with labels that are either 0.0 or 1.0 (or True/False).

The reason you see the confusing error above is that it checks for whether the sum of the cats is over 1.0 to see if there are multiple labels. If you have instances with scores below 1.0, this check doesn't work as intended. This is also why you see such a weird sum by NEGATION in the list of labels, since it's just summing the value provided for each label.

ctufts Jun 28, 2021

@adrianeboyd given what you are saying about using a single label that is true/false (or 0, 1), would this example project be incorrect? I ran into similar confusion when attempting to use this as a template for a binary text classification I'm currently working on. It uses a single label, Documentation, and the preprocess script converts to a T/F value for a single label, but the config is textcat not textcat_multilabel. I can open a separate thread or an issue for this, but figured it was simpler if I commented on this discussion since I would need to reference it anyway. Links for the mentioned project files: preprocess.py and the config.cfg

adrianeboyd Jun 28, 2021

You're right, I think that didn't get updated when we split textcat_multilabel into a separate component. Thanks for the note, we'll take a look!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-contradictory summary in spacy debug data #8035

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Self-contradictory summary in spacy debug data #8035

Pandalei97 May 7, 2021

Replies: 1 comment · 4 replies

adrianeboyd May 7, 2021

Pandalei97 May 7, 2021 Author

adrianeboyd May 11, 2021

ctufts Jun 28, 2021

adrianeboyd Jun 28, 2021

Pandalei97
May 7, 2021

Replies: 1 comment 4 replies

adrianeboyd
May 7, 2021

Pandalei97 May 7, 2021
Author