Skip to content

Vitis accelerator #991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: main
Choose a base branch
from

Conversation

axiotisk
Copy link

@axiotisk axiotisk commented Apr 5, 2024

Description

The Vitis Accelerator Backend builds upon the foundation laid by the Vitis backend and streamlines the generation process for PCIe accelerators using the Vitis Accelerator Flow.
Features:

  • This backend inherits from the Vitis backend, ensuring compatibility with existing workflows and projects.
  • Converts the input of the top-level design from AXI Stream to memory-mapped and the output from memory-mapped to AXI Stream.
  • Automates the generation of host code and the necessary makefile for kernel compilation.
  • Please note that the software and hardware emulation features are still a work in progress and will be added in subsequent commits.

Type of change

For a new feature or function, please create an issue first to discuss it
with us before submitting a pull request.

Note: Please delete options that are not relevant.

  • Bug fix (non-breaking change that fixes an issue)
  • Documentation update
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • A new research paper code implementation
  • Other (Specify)

Tests

The backend has been tested with the hls4ml getting started tutorial example.

Test Configuration:
The Vitis version used for the validation is 2022.2.
The functionality of the project was tested on a VCK5000 accelerator board.

Checklist

  • I have read the guidelines for contributing.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have installed and run pre-commit on the files I edited or added.
  • I have added tests that prove my fix is effective or that my feature works.

@jmitrevs jmitrevs added this to the v1.1.0 milestone Apr 5, 2024
@axiotisk axiotisk force-pushed the vitis_accelerator_dev branch 2 times, most recently from e92a6be to 86f75b5 Compare June 14, 2024 09:15
@qberthet qberthet force-pushed the vitis_accelerator_dev branch from c875785 to 64c8baa Compare July 2, 2024 13:13
@axiotisk axiotisk marked this pull request as ready for review July 10, 2024 17:14
@@ -865,6 +865,14 @@ class TraceData(ctypes.Structure):
else:
return output, trace_output

def hardware_predict(self, x, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has been added to enable performing predictions directly on the FPGA from the Python code. It feels a bit intrusive to add this backend-specific code to the hls4ml core. Another approach could be to modify predict() to allow backend-specific overloading. So, model.hardware_predict(x) could become model.predict(x, target='hw'), but this also requires some modification of the existing core code. Could an hls4ml dev provide advice on the best approach here? (@vloncar, @jmitrevs?). Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of naming this somewhat different (predict_hw for example) and moving the exception to backends (in FPGABackend should be enough to cover those that don't/cannot support it).

Some longer term idea here would be that we would have 3 ways of doing preciction: predict_emu (emulation, the current one), predict_sim (simulation via pyverilator) and predict_hw (real deal), with the predict being predict_emu by default with maybe a switch for user to control which one is called if it's just predict(x).

@qberthet qberthet force-pushed the vitis_accelerator_dev branch from abd46ca to 18f7fc7 Compare January 12, 2025 22:01
@qberthet
Copy link
Contributor

As testing of this PR have been mentioned in the minutes of the last dev meeting, the most recent work on the host code provided with the VitisAccelerator have been pushed to ensure testing of the latest version (Also rebased on current main).

- Multiple devices support
- Selection of device by BDF
- OpenCL error checking
- Automatic memory bank association
- Inferences validation
- Improved command line parameters
- Improved debug output
- Dummy buffer copy to avoid benchmarking buffer allocation time
- Removal of mutexes preventing buffer copies overlap with kernel executions on the same CU with multiple workers
- Documentation
@qberthet qberthet force-pushed the vitis_accelerator_dev branch from 18f7fc7 to 22401ba Compare February 8, 2025 08:52
@bo3z bo3z modified the milestones: v1.1.0, v1.2.0 Apr 8, 2025
@bo3z
Copy link
Contributor

bo3z commented Apr 8, 2025

I've done a first pass, going through the hls4ml-core changes. Most of the comments are minor, just to make sure the code is consistent with the rest of the hls4ml codebase. In the following days, I'll also try out the VitisAccelerator on a local set-up with a U55C / U250 and try to review the accelerator-specific (templates, build files etc.) changes.

Overall, a very nice addition to the hls4ml codebase and seems very orthogonal to all the other functionality, so shouldn't be many issues with merging it soon.

@qberthet
Copy link
Contributor

qberthet commented Apr 8, 2025

Thanks for the review. There is probably some room for improvement, so please comment on your testing experience. We intend to do a polishing pass, mostly to provide a more seamless integration from the Python code, but maybe this can be done in a subsequent PR if the current PR is deemed usable enough.

@bo3z
Copy link
Contributor

bo3z commented Apr 10, 2025

I just tried testing the VitisAccelerator backend on Alveo u55c and Alveo u250, but there were some issues:

  • The biggest issue are timing violations: On both the u55c and u250, there is very large WNS; around -3ns to -5ns. I tried synthesising with clock periods of 4ns and 5ns; both with 27% uncertainty. Also, tried lowering the batch size to 1 (hoping it simplifies the logic and reduces congestion). Finally, I tried both with and without hw_quant. Overall, all of these cases and across boards had significant timing violations; which is a bit unexpected. To me this seems like some missing constraint in the build process or similar. I've commonly seen timing violations on the u55c around the HBM, but they are usually much smaller (-0.5ns) and can be fixed by some floor-planning and passing more advanced Vivado directives.

  • I had to change the platform for u250 to: xilinx_u250_gen3x16_xdma_4_1_202210_1, because I got the error from Vitis "Platform not found". However, a quick google of the one in this PR: xilinx_u250_xdma_201830_2, does find it. I am wondering whether there are several versions of the u250?

  • On the u250; when I changed the platform, the placer in implementation failed. There was a constrain (I guess generated by hls4ml), that forces the model kernel onto SLR0; however the specific model couldn't fit into SLR0. The model was the jet tagging model, so not too large; but I think we should avoid such explicit placements of kernels to SLRs, as it can be quite hard to estimate the resource usage of a model before actual synthesis. Per-SLR placement should probably be left to more advanced users who have issues meeting timing, in my opinion.

@bo3z
Copy link
Contributor

bo3z commented Apr 10, 2025

So in response to the above comment: the significant timing issues are only for io_parallel; io_stream has no such issues.

@qberthet
Copy link
Contributor

Thanks again for taking the time to test this!

Yes, timing closure is very design-dependent and is generally expected to be handled by the model creator. That said, you raise a good point: we mostly tested with io_stream, so we didn’t encounter this kind of issue. Since io_stream is typically the preferred option for large models in acceleration contexts, this choice made sense for our use case. However, it does make quick evaluations using io_parallel less effective (and this might be a use case for this backend). Perhaps an io_parallel-optimized version in the HLS wrapper could help address this.

Regarding the platform: yes, there are multiple platform versions (think fpga shell versions) for each board. Rather than trying to cover all cases, our goal is to offer an easy way for users to switch between them, while providing sensible defaults, though these may change over time. We should make this clearer in the documentation, or at least refer to AMD documentation about XRT platforms.

You’re also right about constraints, we shouldn’t provide any by default. It’s better to let users add them as needed for their specific designs. In the same spirit, we’ve removed explicit buffer object memory associations as well.

We’ll be fixing the constraint handling and updating the platform documentation and defaults soon. Updating the wrapper to better support io_parallel might take a bit longer, so that could come in a future PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants