Issues with Confusion Matrix normalization and DDP computation

## 🐛 Bug

I started using the ConfusionMatrix metric to compute normalized confusion matrices within a mult-GPU DDP environment.  However, I found the following issues:
1. The normalization divisor is computed correctly in a row-wise manner; however, the division is applied column-wise.
2. The normalization does not protect against divide by zero if there is no data in a particular row.  While this is not a usual case for a well-designed validation set, it is possible when you have a large number of unbalanced classes and `limit_val_batches` is small (such as when debugging).
3. There is no way to specify the number of classes for the confusion matrix.  This is critical when performing DDP reduction as it is possible that the automatic computation of the number of classes could produce different answers for each process.  I encountered this possibility when using a large number of unbalanced classes such that one of the DDP processes did not see any true data or declarations of the last class, causing its number of classes to be one less than for the other processes.  This bug sometimes causes the entire training process to hang such that I needed to manually kill each DDP process.
4. When computing a normalized confusion matrix with DDP reduction, the sum reduction needs to happen prior to the normalization.  

### To Reproduce
As a means to reproduce these issues, I have attached two python scripts.  I had to change the extension to .txt such that they would upload the Github.  The script `confusion_matrix.py` exposes the first two issues with normalization within a single process.  The script `confusion_matrix_ddp.py` exposes the final two issues by computing the metric within a model trained using DDP over two GPUs.  For both scripts, they create a 4 class problem with 20 samples unevenly divided among the classes.  The true confusion matrix is computed within the script and printed to standard out, along with the confusion matrix computed by Pytorch Lightning.  

[confusion_matrix.py.txt](https://github.com/PyTorchLightning/pytorch-lightning/files/4982230/confusion_matrix.py.txt)
[confusion_matrix_ddp.py.txt](https://github.com/PyTorchLightning/pytorch-lightning/files/4982231/confusion_matrix_ddp.py.txt)

Steps to reproduce the behavior:

1. Download both scripts
2. Rename them to remove the `.txt` extension
3. `./confusion_matrix.py` on a machine with at least 1 GPU.  Compare the true and test confusion matrices.
4. `./confusion_matrix_ddp.py` on a machine with at least 2 GPUs.  This computes the unnormalized confusion matrix.  Compare the true and test confusion matrices.
    - WARNING: This may hang the process and require you to manually kill each process.
5. `./confusion_matrix_ddp.py --normalize` on a machine with at least 2 GPUs.   This computes the normalized confusion matrix.  Compare the true and test confusion matrices.
    - WARNING: This may hang the process and require you to manually kill each process.

### Expected behavior

The computed confusion matrix will be identical to the true confusion matrix printed within the scripts provided above.

### Environment

 - PyTorch Version (e.g., 1.0):  `1.5.1=py3.8_cuda10.2.89_cudnn7.6.5_0`
 - OS (e.g., Linux):  CentOS
 - How you installed PyTorch (`conda`, `pip`, source):  `conda`
 - Build command you used (if compiling from source):  NA
 - Python version:  3.8.3
 - CUDA/cuDNN version:  Conda `cudatoolkit=10.2.89=hfd86e86_1`
 - GPU models and configuration:  GeForce GTX 1070 and GeForce GTX 970

### Solution
I have forked Pytorch Lightning and created fixes for all of these issues:  https://github.com/jpblackburn/pytorch-lightning/tree/bugfix/confusion_matrix.  I am willing to turn this into a pull request.  

The solution to the first two issues did not require any major changes.  The third issue required the addition of a new argument to `ConfusionMatrix` for the number of classes.  It is optional and the final argument, so that the API is backwards compatible.  The fourth issue was most involved as it required modifying `ConfusionMatrix` to derive from `Metric` rather than `TensorMetric`.  It then uses an internal class that drives from `TensorMetric`.  In this manner, `forward` delegates the computation of the unnormalized confusion matrix and the DDP reduction to the internal class, performing the normalization itself after the DDP reduction is complete.  

I incrementally fixed the issues for easier understanding:
* Issues 1 and 2: 8b8b635bfaa8d7329e96e4ee3104f302c0c5c941
* Issue 3: 83927a9fd16b06a9941b24c72b8309b2cc7a7371 and 913c1e36f7ef31fb37a46a483132c0534bb4c3c6
* Issue 4: e3e0743c75c53ed3bdab063e2c6cd67454402495

#### To validate the solution:
1. `./confusion_matrix.py` on a machine with at least 1 GPU.  Compare the true and test confusion matrices.
2. `./confusion_matrix_ddp.py --set-num-classes` on a machine with at least 2 GPUs.  Compare the true and test confusion matrices.
3. `./confusion_matrix_ddp.py --set-num-classes --normalize` on a machine with at least 2 GPUs.   Compare the true and test confusion matrices.

Note the addition of `--set-num-classes` to the DDP script.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues with Confusion Matrix normalization and DDP computation #2724

🐛 Bug

To Reproduce

Expected behavior

Environment

Solution

To validate the solution:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues with Confusion Matrix normalization and DDP computation #2724

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Solution

To validate the solution:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions