Skip to content

Issues with Confusion Matrix normalization and DDP computation #2724

@jpblackburn

Description

@jpblackburn

🐛 Bug

I started using the ConfusionMatrix metric to compute normalized confusion matrices within a mult-GPU DDP environment. However, I found the following issues:

  1. The normalization divisor is computed correctly in a row-wise manner; however, the division is applied column-wise.
  2. The normalization does not protect against divide by zero if there is no data in a particular row. While this is not a usual case for a well-designed validation set, it is possible when you have a large number of unbalanced classes and limit_val_batches is small (such as when debugging).
  3. There is no way to specify the number of classes for the confusion matrix. This is critical when performing DDP reduction as it is possible that the automatic computation of the number of classes could produce different answers for each process. I encountered this possibility when using a large number of unbalanced classes such that one of the DDP processes did not see any true data or declarations of the last class, causing its number of classes to be one less than for the other processes. This bug sometimes causes the entire training process to hang such that I needed to manually kill each DDP process.
  4. When computing a normalized confusion matrix with DDP reduction, the sum reduction needs to happen prior to the normalization.

To Reproduce

As a means to reproduce these issues, I have attached two python scripts. I had to change the extension to .txt such that they would upload the Github. The script confusion_matrix.py exposes the first two issues with normalization within a single process. The script confusion_matrix_ddp.py exposes the final two issues by computing the metric within a model trained using DDP over two GPUs. For both scripts, they create a 4 class problem with 20 samples unevenly divided among the classes. The true confusion matrix is computed within the script and printed to standard out, along with the confusion matrix computed by Pytorch Lightning.

confusion_matrix.py.txt
confusion_matrix_ddp.py.txt

Steps to reproduce the behavior:

  1. Download both scripts
  2. Rename them to remove the .txt extension
  3. ./confusion_matrix.py on a machine with at least 1 GPU. Compare the true and test confusion matrices.
  4. ./confusion_matrix_ddp.py on a machine with at least 2 GPUs. This computes the unnormalized confusion matrix. Compare the true and test confusion matrices.
    • WARNING: This may hang the process and require you to manually kill each process.
  5. ./confusion_matrix_ddp.py --normalize on a machine with at least 2 GPUs. This computes the normalized confusion matrix. Compare the true and test confusion matrices.
    • WARNING: This may hang the process and require you to manually kill each process.

Expected behavior

The computed confusion matrix will be identical to the true confusion matrix printed within the scripts provided above.

Environment

  • PyTorch Version (e.g., 1.0): 1.5.1=py3.8_cuda10.2.89_cudnn7.6.5_0
  • OS (e.g., Linux): CentOS
  • How you installed PyTorch (conda, pip, source): conda
  • Build command you used (if compiling from source): NA
  • Python version: 3.8.3
  • CUDA/cuDNN version: Conda cudatoolkit=10.2.89=hfd86e86_1
  • GPU models and configuration: GeForce GTX 1070 and GeForce GTX 970

Solution

I have forked Pytorch Lightning and created fixes for all of these issues: https://github.com/jpblackburn/pytorch-lightning/tree/bugfix/confusion_matrix. I am willing to turn this into a pull request.

The solution to the first two issues did not require any major changes. The third issue required the addition of a new argument to ConfusionMatrix for the number of classes. It is optional and the final argument, so that the API is backwards compatible. The fourth issue was most involved as it required modifying ConfusionMatrix to derive from Metric rather than TensorMetric. It then uses an internal class that drives from TensorMetric. In this manner, forward delegates the computation of the unnormalized confusion matrix and the DDP reduction to the internal class, performing the normalization itself after the DDP reduction is complete.

I incrementally fixed the issues for easier understanding:

To validate the solution:

  1. ./confusion_matrix.py on a machine with at least 1 GPU. Compare the true and test confusion matrices.
  2. ./confusion_matrix_ddp.py --set-num-classes on a machine with at least 2 GPUs. Compare the true and test confusion matrices.
  3. ./confusion_matrix_ddp.py --set-num-classes --normalize on a machine with at least 2 GPUs. Compare the true and test confusion matrices.

Note the addition of --set-num-classes to the DDP script.

Metadata

Metadata

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions