Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JOSS review] benchmark on Apple silicon #16

Closed
KedoKudo opened this issue Nov 28, 2023 · 2 comments
Closed

[JOSS review] benchmark on Apple silicon #16

KedoKudo opened this issue Nov 28, 2023 · 2 comments
Assignees

Comments

@KedoKudo
Copy link

This is part of the review feedback for JOSS submission (openjournals/joss-reviews#6024)

It would be interesting to see how the software performs on Apple silicon when running as a CPU process and using the mps backend.

@AndySAnker
Copy link
Collaborator

Thank you for the suggestion.

The short answer

We do not support MPS GPU's yet, and MPS CPU is slower than doing the calculation without MPS backend.

The longer answer:

To install torch with MPS backend, I have followed the guide here: https://developer.apple.com/metal/pytorch/
As described in the guide, I get the output: tensor([1.], device='mps:0') meaning that it is correctly installed.

We can allow DebyeCalculator to do calculations on the MPS device, however, the torch.pdist function is yet not supported resulting in the following error message:
NotImplementedError: The operator 'aten::_pdist_forward' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

I have set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 meaning that the time limiting step (pdist) of DebyeCalculator is performed at CPU. Afterwards, I benchmark the MPS device with CPU against the CPU in the following code:

from debyecalculator import DebyeCalculator
import timeit
import torch
import random
import matplotlib.pyplot as plt
 
# Define the number of atoms
num_atoms_list = [10, 100, 1000, 10000]
 
# Define the devices
devices = ['cpu', 'mps']
 
# Store the times for each device and number of atoms
times = {device: [] for device in devices}
 
for device in devices:
    for num_atoms in num_atoms_list:
        # Generate random coordinates for the atoms
        coordinates = [[random.random() for _ in range(3)] for _ in range(num_atoms)]
 
        # Create the structure tuple
        structure_tuple = (["Fe"] * num_atoms, torch.tensor(coordinates))
 
        # Initialise calculator object
        calc = DebyeCalculator(qmin=1.0, qmax=8.0, qstep=0.01, device=device)
 
        # Setup the timeit function
        setup = 'from __main__ import calc, structure_tuple'
 
        # Time the calc.iq function
        elapsed_time = timeit.timeit('calc.iq(structure_source=structure_tuple)', setup=setup, number=10)
 
        # Convert the time to milliseconds and store it
        elapsed_time_ms = elapsed_time * 1000
        times[device].append(elapsed_time_ms)
 
# Plot the times
for device, device_times in times.items():
    plt.plot(num_atoms_list, device_times, 'o--', label=device)
 
plt.xlabel('Number of atoms')
plt.ylabel('Time (ms)')
plt.legend()
plt.show()

The result is as follows:
Figure_1

We can conclude that for now, we cannot offer any acceleration on the MPS device. We hope that the torch.pdist function will be implemented in the MPS device. It seems like Apple puts some effort into this: https://github.com/ml-explore/mlx

If you have any ideas on how we can offer MPS acceleration, please let us know :-)

@AndySAnker
Copy link
Collaborator

With #42, we now offer MPS calculations, meaning that 'mps' is allowed as a input for device.

However, the software is not optimised for MPS and therefore does not give speed-ups. For a Mac M3 chip, it is about 10 % slower on MPS than CPU. However, it means that for calculations of scattering patterns from many structures, parallel calculations using both CPU and MPS can be done, giving a speed-up of about 15 %.

Note: MPS does not work with Python3.7. It works with Python >=3.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants