-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for arm64-apple #2247
Comments
The short story is that |
Sure, happy to support. What did you need to modify to get it to build the M1 binaries anyway? |
Add |
You probably do that simply by setting |
Actually, that likely won't work. Try this instead:
|
That was exactly what I first attempted, but the problem is that you cannot use What I have done for now is generate the config file for a native x86_64 build and then patch the config file to add |
OK, gotcha. Should be simple to add an environment variable to add the flags immediately prior to writing out the config. Something like |
That could work. One caveat are the binaries of the dependencies. For ./configure to succeed, you would need access to the x86_64 binaries for Qt etc, and they would then have to be swapped by the arm64 ones before the ./build stage, or they would have to be in separate directories, but then we also have to modify the config script to reflect that. Anyway, it would still be a bit of a mess, so maybe making the runtime parts of the configure script optional, could be a simpler method after all... |
We could I guess skip a lot of the runtime checks, but certainly not all. We'll still need to run |
My M1 Macbook arrived today! The cross-compiled arm64 binaries as well as the original x86_64 binaries appear to work just fine on arm64. arm64 binary on arm64:
x86_64 binary on arm64:
For reference, dwi.mif and mask.mif are taken from the preprocessed single subject Siemens data on osf. |
For reference, a full release build of mrtrix3 on the Apple M1 took 214 seconds (not including ./configure). |
mrview crashes when loading a tractogram with {20,50,100}K tracks (10K works fine) :(
|
Here is more in-depth view of the crash:
|
All unit tests pass: testing_units.log Some binary tests fail: testing_binaries.log, but it looks reasonable, @jdtournier @Lestropie ? |
Binary test failures are |
This is just to check whether the complexity of the geometry shader, or the actions we perform within it, are responsible for the poor performance and mrview crashes when rendering tractograms as streamtubes See details in #2247
@bjeurissen, take a look at the |
Thanks! I just tried it, but this throws an uncaught exception (even with just a handful of tracks), so I guess the pass-through geometry shader introduced another problem:
EDIT: This is probably more helpful:
|
OK, I see it's a bit more strict than my AMD drivers... Try that last commit, see if that works (I'm ignoring the warnings for now). |
with the passthrough shader, the behavior stays the same as with pseudotubes: slow performance with 10K tracks, crash with 100K tracks |
Hi Guys, I was wondering whether there is any progress on this issue and whether I could help? Currently on a M1 Mac and I was trying to visualise my streamlines (200k) but, as expected given the discussion above, mrview crashes with a somewhat nondescript error: The version that I'm using is: Note that I was able to view the 200k streamlines by adding 'MRViewDefaultTractGeomType: Lines' to the ~/.mrview.conf file. Inline with the suggestion here: |
Dear Max, thanks for reaching out! The problems on M1 Macs are believed to stem from the incomplete or buggy support of Geometry Shaders in the OpenGL implementation that is shipped with Apple M1 Macs. "Natively" the Apple M1 GPUs do not support Geometry Shaders at all, but given the OpenGL version that is bundled with macOS and given the fact that it still works with a limited amount of fibre tracks, it should be supported... However, there appear to be some caveats or bugs that we are unaware of.... (@neurolabusc: You seem to have quite some experience with 3D rendering using Apple M1 Macs. What is your take on this?) Currently, the way to help would be to identify exactly where this problem occurs (e.g. identify what OpenGL error is reported or where in the shader the crash occurs), so we can either think of a work-around on our side or file a detailed bug report to Apple. I have tried using an OpenGL debugger from Apple to get to the bottom of this, but mrview grinds to a halt as soon as I start logging the OpenGL commands... There are quite some other tools available for OpenGL debugging, but I struggle to find solutions that are compatible with ARM64... Another way you could help is by systematically testing all mrview features and check if anything else breaks down. So far we have:
|
I have a feeling we might be able to change our implementation to avoid the use of the geometry shader - but it's going to take quite a bit of effort to work through the maths and make sure the solution is actually OpenGL-compliant... I'm not sure I'll have any time to look into this for some time unfortunately... 😒 |
You may want to take a look at Surfice, which is distributed as a universal binary for macOS (natively supporting Intel and ARM64). It reads TCK format, so you can evaluate the performance. In the
When we developed this, we initially used the Geometry Shader, but this provided very poor performance with some vendors and some drivers. The role of the geometry shader was to generate new vertices dynamically. Our solution was to have the GPU buffer store two vertices instead of one for each line, these are identical in all respects except for the W component of their surface normal. We use this to extrude the two vertices so that they form a plane facing the viewer. Looking at MRtrix, you also use OpenGL 3.3. Core, so you can use the shaders verbatim. As both a caveat and suggestion, I note that MRtrix uses GL_FLOAT for virtually all the buffer data. You can use the GPU RAM much more efficiently by using the smallest acceptable data type (8-bit RGBA for colors, and for normals I have used GL_INT_2_10_10_10_REV - which gives 10-bits precision for the X,Y,Z components of the normal and the 2 bits for the W component is sufficient for this application). However, be aware that the Apple M1 will show a massive performance penalty for using the GL_INT_2_10_10_10_REV. This is not a native type in Metal. Perhaps someday the Apple engineers will get around to converting this to a native type when you load the buffer, but in the mean time I would suggest the having the normal be four 16-bit GL_HALF floats. The ARM has native scalar and SIMD instructions for converting 32-bit single floats to 16-bit half floats. Vertex Shader
Fragment Shader
|
Thanks @neurolabusc, your billboarding approach sounds very similar to what we currently do with the geometry shader in I might look into using 16-bit floats though, that might help a bit - though I have to admit I've never really found RAM to be a bottleneck - I've always been able to display 1M+ streamlines on my 2008 laptop, way too many for an acceptable frame rate... But it might come in handy if that does become a problem! |
@Lestropie for the linear algebra, be aware that the M1 includes undocumented but extremely high performance AMX2 instructions. Apple's Accelerate framework allows you to use these often with the same calling conventions of other linear algebra libraries, as demonstrated by the main_mmul.cpp. If you can use these, you will get superior performance. I agree that different implementations will provide different results, in particular as fused-multiply add (FMA) instructions have less rounding error than sequential multiply and add instructions. My web page also provides links describing why SciPy and NumPy do not use Apple's high performance libraries. If they do work for your applications, the performance is really outstanding. |
@jdtournier I agree that the ability of the geometry shader to emit vertices seems like the perfect solution. However, when I evaluated this (many years ago), I found these shaders often had a real performance hit. Each vertex in the code I describe is pretty compact: For most computers: For the M1: |
Thanks @neurolabusc! I did a quick surfice vs mrview test on an M1 MacBook Pro (click the image to watch the movie): I am using the same 10K tck file and surfice is clearly much more responsive. Moreover, with surfice I can load much larger tractograms on Apple M1, which would result in a crash when using MRview. One additional thing I noticed (e.g. in the first frame) is that the colour encoding of the tracks between the two viewers seems off. This could just be a matter of different coordinate systems, but it might be that there is more to it. |
That's just 10k streamlines...? OK, there's definitely room for improvement... On my Win10 laptop with Intel HD graphics 620, I max out at 60 FPS - basically hitting the monitor refresh rate since we're running with VSync. Running full-screen and zooming in to really fill the screen, I max out at around 40 FPS. I'd expect the M1 to be more capable than that, right...? |
Yes
The M1 should be able to destroy Intel HD graphics 620. It is supposed to be the current king of integrated graphics, so much that it can even rival older discrete GPUs. https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested/3 |
I ran mrview with the -fps option and got about 1 FPS with 10K pseudo tubes on Apple M1 MacBook Pro. Falling back to Line geometry FPS was 60. |
@bjeurissen can you send me some of your streamline files (e.g. send a dropbox/google drive link to the email in my avatar)? I wonder if a few minor tweaks can improve the performance. In particular, Metal performs best if you pad some datatype as shown in Table 2.3 of the Metal specification. For example, my shaders may perform best if the vec3 |
@bjeurissen thanks for the sample dataset. I did spend some time investigating this. Specifically, I investigated whether adjusting the alignment of data types to match early specifications of Metal helped (e.g. the GLSL vec3 equivalent in Metal if float3, but as noted in section 2.2 float3 requires the same 16-byte alignment of a float4, rather than the more compact 12 bytes). This had no impact, so I assume that the M1 GPUs are a bit less rigid in their expectations. I did make a few tiny performance gains for the M1. Since this is specific to the M1, I have not released new versions for other platforms. Therefore, when macOS users download the universal binary they will get While Surfice is mesh based, I have also released a new version of my volume-based MRIcroGL. There are two versions of the universal binary for macOS: OpenGL and Metal. They have the same functionality but the latter is a tiny bit faster on modern computers like the M1 Finally, macOS users might want to try out my minimal MRIcro tool from the Apple AppStore. This has far fewer features than my other tools, and is constrained by Apple's sandboxing requirement. However, it might be a nice lightweight option for teaching classes. It is also a universal binary. |
https://blog.dengine.net/2021/01/wonders-and-mysteries-of-the-m1/ Might contain some useful info. |
I just upgraded to the latest Big Sur (11.4) release and it looks as if the OpenGL situation has changed:
|
|
Just had a go: on my system (16 core AMD Ryzen 9 3950X, AMD Radeon RX590), I get frame rates >100 fps with a generic uncropped fixel plot. Seems the geometry shader is still causing issues... |
An interesting finding about this problem is that on my Macbook M2 running under Fedora Asahi (which has a working OpenGL implementation written from scratch by reverse engineering), the performance issues mentioned in this thread still persist. So the issue seems to be deeply in Apple's hardware/drivers. |
TODO:
modify configure script to facilitate cross compilation:required if we want to build for arm64-apple in CI and github actions without the availability of arm64-apple environment (this is the case now)mostly requires infrastructure to disable runtime tests and getting version numbers from source code rather than binary invocations when building for another architecturealternative is to hijack config file from x86_64 build and patch it with arm64 target options (requires no modifications to configure script)The text was updated successfully, but these errors were encountered: