The limits of attribution-based explanations #82
cyber-raskolnikov
started this conversation in
General
Replies: 1 comment 5 replies
-
@jessevig whenever you have a moment, I'd love to hear whether you have come across something similar, given your NLP/XAI background. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi @cdpierse and anyone willing to join this discussion,
My name is Dani and I'm currently researching explainability techniques applied to transformer-based models.
For the last months, I have been applying several XAI techniques to a binary classifier that uses a fine-tuned version of a BERT-like transformer with mixed results. I got many coherent explanations which resonated with the expert knowledge in that corresponding field, but I also got many others which pretty much made no sense and didn't help at all to understand the decision-making process of the model from a human perspective.
I believe this is a common occurrence in XAI research which many of us have suffered from. I decided to put said model on hold and experiment with a simpler, thoroughly tested classifier. I picked "distilbert-base-uncased-finetuned-sst-2-english" from HuggingFace, a sentiment classifier trained on the SST-2 dataset.
As stated by HuggingFace, this classifier is known to be biased while dealing with gender and geographic topics, and I chose to use the (Layer) Integrated Gradients algorithm implemented in this library (thanks @cdpierse 😃) to further investigate said biased cases.
The following image portrays a satisfying explanation from the human perspective of XAI, it makes sense:
(All images below show in green the parts of the sentence bringing the decision towards 'positive feeling', and viceversa for the red color)
But if we take a look at the following biased comparison:
One would expect the explanation to blame the country where the movie was filmed for the label change, as it clearly seems to be the source for the bias, yet Afghanistan is actually 'pushing the decision' towards a positive feeling and the rest of the sentence is responsible for the negative rating (according to the algorithm output).
Integrated Gradients seems to point at very different treatments for very similar sentences, which seems contradictory (again, from the human need for XAI).
I have found similar results for several other comparisons such as:
I have been reflecting on these explanations' meaning and I have come to think that these results are not necessarily wrong, especially given the solid axiomatic foundations for IG, but that single word attributions are too simple to capture the decision-making process of the underlying transformer.
It might be true that the rest of the sentence is to blame for the 'negative' feeling attributed by the model to it, but these words are necessarily being affected by the appearance of the biased term, as they were perceived differently with a different biased term.
These effects are not being captured by attribution-based explanations, and they might not be able to be captured at all as they collide with the methods' expressiveness.
The closest thing to discussing attribution-based explanation methods failing to capture the decision-making process is the follow-up paper of IG where they use hessians to explain predictions with pairwise interactions between features (which comes with a big trade-off between cognitive overload and expressiveness).
What are your thoughts on this matter? Do you share the opinion that 'incomprehensible' explanations can be caused by a lack of expressiveness or is it a different model/explanation method flaw?
Beta Was this translation helpful? Give feedback.
All reactions