Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with multiple SPIR-V devices #552

Merged
merged 5 commits into from
Sep 10, 2024

Conversation

mairooni
Copy link
Collaborator

@mairooni mairooni commented Sep 6, 2024

Description

It was observed that when having more than one SPIR-V devices, the following error was generated depending on the specified device of execution.

Unable to compile task s0.t0 - add
The internal error is: [Error During the Task Compilation]: Index 1 out of bounds for length 1
Stacktrace: [tornado.drivers.spirv@1.0.8-dev/uk.ac.manchester.tornado.drivers.spirv.runtime.SPIRVTornadoDevice.compileTask(SPIRVTornadoDevice.java:191), tornado.drivers.spirv@1.0.8-dev/uk.ac.manchester.tornado.drivers.spirv.runtime.SPIRVTornadoDevice.installCode(SPIRVTornadoDevice.java:129), tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.compileTaskFromBytecodeToBinary(TornadoVMInterpreter.java:664), tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:329), tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:903), java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:1024), java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762), tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.TornadoVM.executeInterpreterSingleThreaded(TornadoVM.java:127), tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:114), tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.scheduleInner(TornadoTaskGraph.java:884), tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1410), tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1422), tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.TaskGraph.execute(TaskGraph.java:759), tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.ImmutableTaskGraph.execute(ImmutableTaskGraph.java:50), tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.lambda$execute$0(TornadoExecutionPlan.java:466), java.base/java.util.ArrayList.forEach(ArrayList.java:1596), tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.execute(TornadoExecutionPlan.java:466), tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan.execute(TornadoExecutionPlan.java:118), tornado.examples@1.0.8-dev/uk.ac.manchester.tornado.examples.arrays.ArrayAddInt.main(ArrayAddInt.java:65)]
        tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.compileTaskFromBytecodeToBinary(TornadoVMInterpreter.java:672)
        tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:329)
        tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:903)
        java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:1024)
        java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
        tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.TornadoVM.executeInterpreterSingleThreaded(TornadoVM.java:127)
        tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:114)
        tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.scheduleInner(TornadoTaskGraph.java:884)
        tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1410)
        tornado.runtime@1.0.8-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1422)
        tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.TaskGraph.execute(TaskGraph.java:759)
        tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.ImmutableTaskGraph.execute(ImmutableTaskGraph.java:50)
        tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.lambda$execute$0(TornadoExecutionPlan.java:466)
        java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
        tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.execute(TornadoExecutionPlan.java:466)
        tornado.api@1.0.8-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan.execute(TornadoExecutionPlan.java:118)
        tornado.examples@1.0.8-dev/uk.ac.manchester.tornado.examples.arrays.ArrayAddInt.main(ArrayAddInt.java:65) 

This PR proposes a fix.

Problem description

The error above is generated from this function of the SPIRVTornadoDevice class:

public SPIRVBackend getBackend() {
    return findDriver().getBackend(platformIndex, deviceIndex);
}

Specifically, the error is triggered by the getBackend() function, which attempts to get the backend of the SPIR-V device from the 2D array spirvBackends, which is created in the SPIRVBackendImpl class.
The spirvBackends array is initialized in the following way. The number of rows corresponds to the number of SPIR-V dispatch drivers (OpenCL and LevelZero) and each row contains an array that stores the backends of the devices associated with each driver. For the example, on the cyclone server, for the following devices:

Number of Tornado drivers: 2
Driver: SPIR-V
  Total number of SPIR-V devices  : 4
  Tornado device=0:0  (DEFAULT)
        SPIRV -- SPIRV OCL - Intel(R) Arc(TM) A770 Graphics
                Global Memory Size: 15.1 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 1024]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
        SPIRV -- SPIRV OCL - Intel(R) UHD Graphics 770
                Global Memory Size: 117.5 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [512]
                Max WorkGroup Configuration: [512, 512, 512]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
        SPIRV -- SPIRV LevelZero - Intel(R) Arc(TM) A770 Graphics
                Global Memory Size: 15.1 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 1024]
                Device OpenCL C version:  (LEVEL ZERO) 1.3

  Tornado device=0:3
        SPIRV -- SPIRV LevelZero - Intel(R) UHD Graphics 770
                Global Memory Size: 117.5 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [512]
                Max WorkGroup Configuration: [512, 512, 512]
                Device OpenCL C version:  (LEVEL ZERO) 1.3

the contents of the spirvBackends array are:

spirvBackends[0][0] = Backend: arch=TornadoVM SPIR-V@OPENCL, device=Intel(R) Arc(TM) A770 Graphics
spirvBackends[1][0] = Backend: arch=TornadoVM SPIR-V@OPENCL, device=Intel(R) UHD Graphics 770
spirvBackends[2][0] = Backend: arch=TornadoVM SPIR-V@LEVEL_ZERO, device=Intel(R) Arc(TM) A770 Graphics
spirvBackends[2][1] = Backend: arch=TornadoVM SPIR-V@LEVEL_ZERO, device=Intel(R) UHD Graphics 770

When trying to run on device 0:3 (SPIRV LevelZero - Intel(R) UHD Graphics 770) the platform and device indices passed in the getBackend() function were 0-1. However, since there is no such entry in the spirvBackends array, the exception was thrown. To get the backend from the array correctly, the platform index should have been 2 instead, so as to follow the same semantics.

However, each SPIR-V device instance (OpenCL and LevelZero) has a view of its own dispatch drivers and based on this it sets the platform index for it. Since there is one LevelZero dispatch driver, the index 0 is correct for the device, but it does not match the representation in the spirvBackends array which as a universal view of all the drivers.

This PR proposes the following. Instead of getting the backend using the platform index, get it based on the SPIR-V device instance. This way, the LevelZero and OpenCL instances of the SPIR-V devices do not need to have a view of all the drivers and it is guaranteed that we will get the backend associated with the specific SPIR-V device.

Backend/s tested

Mark the backends affected by this PR.

  • OpenCL
  • PTX
  • SPIRV

OS tested

Mark the OS where this PR is tested.

  • Linux
  • OSx
  • Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

  • Yes
  • No

How to test the new patch?

The problem was identified by running the following test on the cyclone server:
tornado --jvm="-Dtornado.spirv.version=1.0 -Ds0.t0.device=0:3" --debug -m tornado.examples/uk.ac.manchester.tornado.examples.arrays.ArrayAddInt


@mairooni mairooni added bug Something isn't working spirv labels Sep 6, 2024
@mairooni mairooni self-assigned this Sep 6, 2024
Copy link
Collaborator

@stratika stratika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I tested it on the server that has two LevelZero devices.

@jjfumero jjfumero merged commit 0a2da6f into beehive-lab:develop Sep 10, 2024
2 checks passed
jjfumero added a commit to jjfumero/TornadoVM that referenced this pull request Sep 30, 2024
Improvements
============
- beehive-lab#565: New API call in the Execution Plan to log/trace the executed configuration plans.
- beehive-lab#563: Expand the TornadoVM profiler with Level Zero Sysman Energy Metrics.
- beehive-lab#559: Refactoring Power Metric handlers for PTX and OpenCL.
- beehive-lab#548: Benchmarking improvements.
- beehive-lab#549: Prebuilt API tests added using multiple backend-setup.
- Add internal tests for monitoring memory management [link](beehive-lab@0644225).

Compatibility
=============
- beehive-lab#561: Build for OSx 14.6 and OSx 15 fixed.

Bug Fixes
==============
- beehive-lab#564: Jenkins configuration fixed to run KFusion per backend.
- beehive-lab#562: Warmup action from the Execution Plan fixed to run with correct internal IDs.
- beehive-lab#557: Shared Execution Plans Context fixed.
- beehive-lab#553: OpenCL compiler flags for Intel Integrated GPUs fixed.
- beehive-lab#552: Fixed runtime to select any device among multiple SPIR-V devices.
- Fixed zero extend arithmetic operations: [link](beehive-lab@ea7b602).
@jjfumero jjfumero mentioned this pull request Sep 30, 2024
8 tasks
jjfumero added a commit to jjfumero/TornadoVM that referenced this pull request Sep 30, 2024
Improvements
============
- beehive-lab#565: New API call in the Execution Plan to log/trace the executed configuration plans.
- beehive-lab#563: Expand the TornadoVM profiler with Level Zero Sysman Energy Metrics.
- beehive-lab#559: Refactoring Power Metric handlers for PTX and OpenCL.
- beehive-lab#548: Benchmarking improvements.
- beehive-lab#549: Prebuilt API tests added using multiple backend-setup.
- Add internal tests for monitoring memory management [link](beehive-lab@0644225).

Compatibility
=============
- beehive-lab#561: Build for OSx 14.6 and OSx 15 fixed.

Bug Fixes
==============
- beehive-lab#564: Jenkins configuration fixed to run KFusion per backend.
- beehive-lab#562: Warmup action from the Execution Plan fixed to run with correct internal IDs.
- beehive-lab#557: Shared Execution Plans Context fixed.
- beehive-lab#553: OpenCL compiler flags for Intel Integrated GPUs fixed.
- beehive-lab#552: Fixed runtime to select any device among multiple SPIR-V devices.
- Fixed zero extend arithmetic operations: [link](beehive-lab@ea7b602).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working spirv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants