-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
I started using the ConfusionMatrix metric to compute normalized confusion matrices within a mult-GPU DDP environment. However, I found the following issues:
- The normalization divisor is computed correctly in a row-wise manner; however, the division is applied column-wise.
- The normalization does not protect against divide by zero if there is no data in a particular row. While this is not a usual case for a well-designed validation set, it is possible when you have a large number of unbalanced classes and
limit_val_batchesis small (such as when debugging). - There is no way to specify the number of classes for the confusion matrix. This is critical when performing DDP reduction as it is possible that the automatic computation of the number of classes could produce different answers for each process. I encountered this possibility when using a large number of unbalanced classes such that one of the DDP processes did not see any true data or declarations of the last class, causing its number of classes to be one less than for the other processes. This bug sometimes causes the entire training process to hang such that I needed to manually kill each DDP process.
- When computing a normalized confusion matrix with DDP reduction, the sum reduction needs to happen prior to the normalization.
To Reproduce
As a means to reproduce these issues, I have attached two python scripts. I had to change the extension to .txt such that they would upload the Github. The script confusion_matrix.py exposes the first two issues with normalization within a single process. The script confusion_matrix_ddp.py exposes the final two issues by computing the metric within a model trained using DDP over two GPUs. For both scripts, they create a 4 class problem with 20 samples unevenly divided among the classes. The true confusion matrix is computed within the script and printed to standard out, along with the confusion matrix computed by Pytorch Lightning.
confusion_matrix.py.txt
confusion_matrix_ddp.py.txt
Steps to reproduce the behavior:
- Download both scripts
- Rename them to remove the
.txtextension ./confusion_matrix.pyon a machine with at least 1 GPU. Compare the true and test confusion matrices../confusion_matrix_ddp.pyon a machine with at least 2 GPUs. This computes the unnormalized confusion matrix. Compare the true and test confusion matrices.- WARNING: This may hang the process and require you to manually kill each process.
./confusion_matrix_ddp.py --normalizeon a machine with at least 2 GPUs. This computes the normalized confusion matrix. Compare the true and test confusion matrices.- WARNING: This may hang the process and require you to manually kill each process.
Expected behavior
The computed confusion matrix will be identical to the true confusion matrix printed within the scripts provided above.
Environment
- PyTorch Version (e.g., 1.0):
1.5.1=py3.8_cuda10.2.89_cudnn7.6.5_0 - OS (e.g., Linux): CentOS
- How you installed PyTorch (
conda,pip, source):conda - Build command you used (if compiling from source): NA
- Python version: 3.8.3
- CUDA/cuDNN version: Conda
cudatoolkit=10.2.89=hfd86e86_1 - GPU models and configuration: GeForce GTX 1070 and GeForce GTX 970
Solution
I have forked Pytorch Lightning and created fixes for all of these issues: https://github.com/jpblackburn/pytorch-lightning/tree/bugfix/confusion_matrix. I am willing to turn this into a pull request.
The solution to the first two issues did not require any major changes. The third issue required the addition of a new argument to ConfusionMatrix for the number of classes. It is optional and the final argument, so that the API is backwards compatible. The fourth issue was most involved as it required modifying ConfusionMatrix to derive from Metric rather than TensorMetric. It then uses an internal class that drives from TensorMetric. In this manner, forward delegates the computation of the unnormalized confusion matrix and the DDP reduction to the internal class, performing the normalization itself after the DDP reduction is complete.
I incrementally fixed the issues for easier understanding:
To validate the solution:
./confusion_matrix.pyon a machine with at least 1 GPU. Compare the true and test confusion matrices../confusion_matrix_ddp.py --set-num-classeson a machine with at least 2 GPUs. Compare the true and test confusion matrices../confusion_matrix_ddp.py --set-num-classes --normalizeon a machine with at least 2 GPUs. Compare the true and test confusion matrices.
Note the addition of --set-num-classes to the DDP script.