The limits of attribution-based explanations #82

cyber-raskolnikov · 2022-04-21T11:32:02Z

cyber-raskolnikov
Apr 21, 2022

Hi @cdpierse and anyone willing to join this discussion,

My name is Dani and I'm currently researching explainability techniques applied to transformer-based models.

For the last months, I have been applying several XAI techniques to a binary classifier that uses a fine-tuned version of a BERT-like transformer with mixed results. I got many coherent explanations which resonated with the expert knowledge in that corresponding field, but I also got many others which pretty much made no sense and didn't help at all to understand the decision-making process of the model from a human perspective.

I believe this is a common occurrence in XAI research which many of us have suffered from. I decided to put said model on hold and experiment with a simpler, thoroughly tested classifier. I picked "distilbert-base-uncased-finetuned-sst-2-english" from HuggingFace, a sentiment classifier trained on the SST-2 dataset.

As stated by HuggingFace, this classifier is known to be biased while dealing with gender and geographic topics, and I chose to use the (Layer) Integrated Gradients algorithm implemented in this library (thanks @cdpierse 😃) to further investigate said biased cases.

The following image portrays a satisfying explanation from the human perspective of XAI, it makes sense:
(All images below show in green the parts of the sentence bringing the decision towards 'positive feeling', and viceversa for the red color)

But if we take a look at the following biased comparison:

One would expect the explanation to blame the country where the movie was filmed for the label change, as it clearly seems to be the source for the bias, yet Afghanistan is actually 'pushing the decision' towards a positive feeling and the rest of the sentence is responsible for the negative rating (according to the algorithm output).

Integrated Gradients seems to point at very different treatments for very similar sentences, which seems contradictory (again, from the human need for XAI).

I have found similar results for several other comparisons such as:

I have been reflecting on these explanations' meaning and I have come to think that these results are not necessarily wrong, especially given the solid axiomatic foundations for IG, but that single word attributions are too simple to capture the decision-making process of the underlying transformer.
It might be true that the rest of the sentence is to blame for the 'negative' feeling attributed by the model to it, but these words are necessarily being affected by the appearance of the biased term, as they were perceived differently with a different biased term.

These effects are not being captured by attribution-based explanations, and they might not be able to be captured at all as they collide with the methods' expressiveness.

The closest thing to discussing attribution-based explanation methods failing to capture the decision-making process is the follow-up paper of IG where they use hessians to explain predictions with pairwise interactions between features (which comes with a big trade-off between cognitive overload and expressiveness).

What are your thoughts on this matter? Do you share the opinion that 'incomprehensible' explanations can be caused by a lack of expressiveness or is it a different model/explanation method flaw?

cyber-raskolnikov · 2022-04-21T11:42:40Z

cyber-raskolnikov
Apr 21, 2022
Author

@jessevig whenever you have a moment, I'd love to hear whether you have come across something similar, given your NLP/XAI background.
Thanks! 🙌🏻

5 replies

Rachneet Apr 21, 2022

In the countries example, doesn't the positive attribution for Afghanistan mean that the word was important for the particular negative label prediction? In that case, it is correct I feel.

cyber-raskolnikov Apr 22, 2022
Author

Hello @Rachneet,
As indicated, I have forced all the displayed examples to be explanations with respect to the positive label prediction, and thus consistent with the green color corresponding to 'positive' and red corresponding to 'negative'.

The reason for this is that it allows a more natural comparison between cases independently of the label associated to it by the model.

This exemplifies it well enough:

Hope it is clear now 😄

jessevig Apr 23, 2022

Hey @cyber-raskolnikov, I haven't worked a lot with these attribution / saliency methods, but I can share a few thoughts:

Your pairings of sentences that differ only in one word can be seen itself as a kind of counterfactual explanation.
As you describe, your counterfactual explanation seems inconsistent with the integrated gradients explanation
This inconsistency might be seen as an example of the the Disagreement Problem. I haven't looked at this in detail, but it might be a starting point if you haven't looked at it yet.
The SST-2 model might be particularly unstable / unpredictable since it is trained on only positive / negative sentiment examples, with neutral examples removed from the training set as I understand it. It's certainly valid to consider biases it might project into neutral examples, but these are somewhat out-of-distribution so the model may behave especially unpredictably.

cyber-raskolnikov Apr 25, 2022
Author

Hello @jessevig!
Thanks for taking the time to give some insights about my idea.

Indeed, I had thought of them as counterfactual examples, and it is their inconsistent IG explanation what I did not expect.
I had not heard about the Disagreement Problem, but after reading the paper I feel it is a great addition to my research and it absolutely remarks and clarifies problems I have run into in the past months.

Also, I find your point on the instability of SST-2 as a source for the problem's inconsistency very insightful.

I will keep working on researching the method expressiveness limits in addition to the insights you provided me.

Again, thank you!

jessevig Apr 26, 2022

Anytime! Glad if it was helpful :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The limits of attribution-based explanations #82

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The limits of attribution-based explanations #82

cyber-raskolnikov Apr 21, 2022

Replies: 1 comment · 5 replies

cyber-raskolnikov Apr 21, 2022 Author

Rachneet Apr 21, 2022

cyber-raskolnikov Apr 22, 2022 Author

jessevig Apr 23, 2022

cyber-raskolnikov Apr 25, 2022 Author

jessevig Apr 26, 2022

cyber-raskolnikov
Apr 21, 2022

Replies: 1 comment 5 replies

cyber-raskolnikov
Apr 21, 2022
Author

cyber-raskolnikov Apr 22, 2022
Author

cyber-raskolnikov Apr 25, 2022
Author