Skip to content

Releases: pulp-platform/Deeploy

Release v0.2.1

05 Feb 12:31

Choose a tag to compare

This release includes improvements to the tiling and DMA code generation, new networks and operators, improved CI workflows, migration to PyTest, and support for PyPi package releases.

Note: Since the release tag references the Docker container tagged with the release tag (ghcr.io/pulp-platform/deeploy:v0.2.1), the CI will initially fail. The Deeploy Docker image must be built after the release PR is merged and the CI restarted.

List of Pull Requests

  • PyPi Package Deployment + Remove Banshee Dept #154
  • PyTest Migration #144
  • Update submodule pulp-nn-mixed #145
  • Improve Profiling #138
  • FP32 ReduceMean operator improvement #137
  • Support for RMSNorm (Pow and Sqrt operators) #136
  • Demo TinyViT compatibility with tiled Siracusa #124
  • TinyViT on non-tiled Siracusa #117
  • Support Fully Asynchronous DMAs #114
  • Disallow shape inference #128
  • Remove memory-aware node bindings #123
  • Fix missing const's layout transformation and refactor NCHWtoNHWC passes #122
  • Fix aliasing #125
  • Support for 1D Autoencoder #98
  • Refactor Logging for Improved Debugging #115
  • Add reuse-tool as an SPDX license header linter #113
  • Bug fixes, API Cleanup and Reduce Compiler Warning on PULP #112
  • Fix PULP GEMM batch serialization #109
  • Split CI Workflows by Platform and Task, Improve Formatting and Linting Reliability #108
  • Refactor tiling code generation #105
  • Change order of typeMatching entries #68
  • Node Mangling to avoid duplication #93
  • Prepare Post v0.2.0 Release #104
  • Use Docker digests instead of arch-specific tags #106
  • Fix Unsqueeze Op. when using ONNX opset 13 or higher (from attribute to input) #119
  • Fix bias hoisting in generic GEMM with no bias #126

Added

  • The publish.yml action to build a branch and push it to PyPi. The action is automatically triggered when a tag with the "v*" format is emitted.
  • I created a release of Banshee so we don't need to rebuild it over and over. The Makefile now pulls that release depending on the platform.
  • I bumped the onnx-graphsurgeon version such that we don't need to use NVIDIA's PyPi index anymore.
  • _export_graph assigns their export type to the tensors before export.
  • pytest and pytest-xdist as dependencies of Deeploy.
  • A pytest.ini for the global configuration of PyTest for the project.
  • conftest.py to define CLI args for PyTest for the whole project, it also defines a set of global fixtures and markers.
  • pytestRunner.py contains helper functions and fixtures for the whole project.
  • test_platforms.py lists the E2E tests and sorts them into marked categories (per platform and per kernel/model).
  • Each platform has a test config file where a list or a dict describes the tests.
  • Support for unknown number of data dimensions in the tiler
  • Parallelization support for the FP32 ReduceMean operator on PULPOpen
  • Extensive testing for the ReduceMean operator
  • Pass to remove ReduceMean operators that don't change data content, but only its shape
  • Support for RMSNorm operation via operator decomposition.
  • Added Pow (Power) and Sqrt (Square Root) operation support (Parsers, Layers, Bindings, Templates, and FP32 Kernels) for the Generic platform.
  • Support for input tiling for PULP FP regular and DW conv 2D.
  • CI tests for tiled Siracusa FP regular and DW conv 2D, with and without bias, for skip connections, and for the demo version of TinyViT.
  • Documentation for PULP FP regular and DW conv 2D and MatMul tile constraints.
  • PULP ReduceMean and Slice tile constraints.
  • PULP 2D FP DW conv Im2Col template and kernel, with bias support.
  • Bias support for PULP 2D FP regular conv Im2Col in template & kernel.
  • PULP FP DW conv 2D parser.
  • FP conv 2D (simple & DW), reshape & skip connection, and TinyViT demo tests to the non-tiled Siracusa CI pipeline.
  • FP bindings and mappings for PULP slice, DW conv 2D, and reduce mean operations.
  • FP PULP DW conv lowering optimization pass similar to the existent one for integer version.
  • RemoveEmptyConvBiasPass to the PULP optimizer.
  • Add manual type inference feature (CLI: --input-type-map/--input-offset-map) to resolve ambiguities when test inputs are not representative enough
  • Added a testTypeInferenceDifferentTypes test case to validate type inference for different input types
  • Added _mangleNodeNames function to avoid duplicate node mappings
  • Output Docker image digests per platform (amd64, arm64) after build, which is used to construct the multi-arch Docker manifest. This preventes registry clutter caused by unnecessary per-architecture Docker tags.
  • AsyncDma abstraction of DMA's
  • test runner per DMA and a script that tests all the DMA's
  • generic Single/DoubleBufferingTilingCodeGeneration classes
  • TilingVariableReplacementUpdate class that updates the variable replacement refs
  • TilingHoistingMixIn class that encapsulates all the hoisting helper functions of tiling
  • sorting of input memory allocations to allow references that live in the same memory level as the memory they are referencing
  • a function that tests the tiling solution for correctness which currently only tests buffer allocation for byte alignment
  • IntrospectiveCodeTransformation: _indexPointer(), indexVars(), dereferenceVars(). The *Vars functions index/dereference a list of variables (useful for tiling)
  • NetworkContext: unravelReference() that unravels a _ReferenceBuffer until the base buffer
  • NetworkContext: is_object() - helper function that determines whether the string represents a name of a local or global object
  • NetworkContext: is_buffer() - helper function that determines whether the string represents a name of a buffer
  • missing checks for environment variables
  • _permuteHyperRectangle helper function
  • Added CI badges to the README
  • Added YAML linting to CI
  • Added missing license headers and C header include guards
  • Extended the pre-commit hooks to remove trailing whitespace, check licenses, format and lint files
  • Reshape operator support for PULP (ReshapeTemplate in bindings)
  • Missing class attributes in Closure.py
  • reuse_skip_wrapper.py to manually skip files
  • Centralized logging with DEFAULT_LOGGER, replacing print statements
  • Debug logs for type checking/parsing; __repr__ for core classes
  • Buffer utilities: checkNumLevels validation and sizeInBytes method
  • Per–memory-level usage tracking and worst-case reporting in NetworkContext
  • Memory/I/O summaries and input/output logging in deployers
  • RequantHelpers.py for Neureka's TileConstraints
  • Added assertion that all the graph tensors after lowering have a shape annotated
  • Added testFloatGEMMnobias
  • Profiling support and optional comments in generated DMA code for better traceability
  • Added new waiting-strategy logic with fine-grained PerTensorWaitingStrategy
  • PULPClusterEngine now accepts a n_cores parameter to set the number of cores used
  • annotateNCores method to PULPDeployer that adds an n_cores key to all PULPClusterEngine templates' operatorRepresentations
  • Calculate non-kernel overhead and show total time spent during profiling

Changed

  • Rename package name from PULP-Deeploy to deeploy-pulp.
  • Each CI workflow has been simplified to call the pytest suite with certain markers.
  • Structure of Tests subdir for improved ordering
  • Structure of .gitignore file for improved ordering
  • Decreased L1 maximal memory limit for CI pipeline tests where compatible thanks to the implementation of Conv2D input tiling support.
  • Reduced size of reshape & skip connection test, for non-tiled Siracusa memory compatibility.
  • Replaced platform-specific tags (*-amd64, *-arm64) with direct digest references in Noelware/docker-manifest-action.
  • mchan HAL is now reduced to bare-bones
  • refactor of the IntrospectiveCodeTransformation to work on the Mako template
  • refactor of memory allocation code transformation passes
  • _ReferenceBuffer accepts an optional offset argument to offset the reference
  • NetworkContext: hoistReference - accepts the actual buffer as reference instead of name, accepts shape, offset, and override_type arguments, and returns the actual buffer, not its name
  • _mangleNodeRep -> _mangleOpRepr - the canonical name we use is OperatorRepresentation. NodeRep and ParseDict are old iterations of the name.
  • rename of permutation functions to follow this convention: permute is an action that permutes something, `permutat...
Read more

v0.2.0

08 Jul 13:58
v0.2.0
c7fd2a1

Choose a tag to compare

Release v0.2.0 (2025-07-08) #103

This release contains major architectural changes, new platform support, enhanced simulation workflows, floating-point kernel support, training infrastructure for CCT models, memory allocation strategies, and documentation improvements.

List of Pull Requests

  • Prepare v0.2.0 release #102
  • Add Luka as Code Owner #101
  • Fix CI, Docker Files, and Documentation Workflow #100
  • Chimera Platform Integration #96
  • Add Tutorial and Refactor README #97
  • Reduce Mean Float Template #92
  • Reshape Memory Freeing and Generic Float GEMM Fixes #91
  • Prepare for Release and Separate Dependencies #90
  • Fix input offsets calculation #89
  • Move PULP SDK to main branch/fork #88
  • Finite Lifetime for IO Tensors #51
  • Improved Memory Visualization and Multi-Layer Tiling Profiling #56
  • Fix Linting in CI and Reformat C Files #86
  • Fix Broken CMake Flow For pulp-sdk #87
  • Refactor Changelog For Release #85
  • ARM Docker Container and Minor Bug Fix #84
  • Added Kernel for Generic Float DW Conv2D #63
  • Autoselect Self-Hosted Runners if the Action is on Upstream #81
  • TEST_RECENT linking on MacOS #78
  • Add RV32IMF Picolibc support for Siracusa platform #66
  • Improve Documentation and VSCode Support #76
  • Debug Print Topology Pass and Code Transformation #75
  • Find all subdirectories of Deeploy when installing with pip install #70
  • Add milestone issue template #71
  • Bunch of fixes and changes #58
  • Add SoftHier platform #65
  • rv32imf_xpulpv2 ISA support for Siracusa platform #64
  • One LLVM To Compile Them All #60
  • One GVSoC to Simulate Them All #59
  • Add Support for CCT Last Layer Training with Embedding Dim 8-128 #55
  • Add CCT Classifier Training Support #53
  • L3 Bugs: DMA Struct Datatype and Maxpool Margin Error #45
  • DeepQuant Quantized Linear Support #54
  • Implemented Dequant Layer for Generic and Siracusa #52
  • Infinite Lifetime Buffers Considered in Tiling & Memory Allocation (+ Visualization) #44
  • Implemented Quant Layer for Generic and Siracusa #49
  • Increase maximal Mchan DMA transfer sizes from 64KiB to 128KiB #47
  • Add MiniMalloc and Decouple Memory Allocation and Tiling #40
  • Float CCT Bugs on L3 #37
  • Memory Allocation Strategies and Visualization #36
  • Add CODEOWNERS #42
  • Add Tiling Support to All CCT Kernels and Fix CCT Operators on Siracusa Platform for L2 #35
  • Add Fp gemm and Softmax for Snitch platform #31
  • Add Float Kernels for CCT #29
  • documentation deployment #34
  • main.c Float Cast Bugs #28
  • Add Float GEMM on PULP with Tiling #26
  • Add Float Support & Float GEMM for Generic #25
  • GVSOC support for the Snitch Cluster platform #23
  • Snitch Cluster Tiling Support #22
  • Snitch support integration #14
  • Update bibtex citation #20
  • the PR template location, bump min python to 3.10, change install command #17
  • Add pre-commit for python formatting #15
  • FP integration (v2) #12
  • shell for sequential tests of Generic, Cortex, and Mempool platforms #11
  • Add issue templates #10
  • Minor CI and Readme Improvements #8
  • Fix GHCR Link for Docker Build #7
  • neureka's ccache id #6
  • GitHub-based CI/CD Flow #4
  • Generic Softmax Kernel #2
  • Port GitLab CI #1

Added

  • ChimeraDeployer, currently mainly a placeholder
  • Allocate templates for Chimera
  • ChimeraPlatform, using appropriate allocation templates and using the generic Parser + Binding for the Add node
  • Adder CI test for Chimera
  • Install flow for chimera-sdk via Makefile
  • DeeployChimeraMath library
  • Generic FP32 reduce mean bindings, parser, and template
  • New alias list parameter for buffer objects
  • New test, also included in the CI pipeline, for the reshape and skip connection situation
  • 'shape' parameter handling similar to the 'indices' parameter in the generic reshape template
  • Test the correcteness of the memory map generated by the tiler
  • Add attribute to VariableBuffer to distinguish I/Os
  • Add proper static memory allocation with finite lifetime for I/Os
  • The memory allocation visualization now displays the allocation for each level used
  • Tutorial section in the documentation
  • Guide on using the debug print topology pass and code transformation
  • VSCode configuration files for improved IDE support
  • Multi-branch GitHub Pages deployment support
  • Test for the DebugPrintTopologyPass.
  • Test for PrintInputGeneration, PrintOutputGeneration, MemoryAwarePrintInputGeneration, MemoryAwarePrintOutputGeneration
  • check for CMAKE variable and fallback to searching for cmake
  • tensor name mangling
  • identity operation removal
  • _unpack_const helper function to NodeParser to allow for node attributes that are direct Constant tensors or direct numpy values
  • load_file_to_local in dory_mem as a way to load values directly to a local memory (not ram). needed for copying values from flash to wmem needed for Neureka v2
  • Add the documentation.yml workflow to deploy doc pages.
  • Improved README with more detailed Getting Started section, a section listing related publications, and a list of supported platforms.
  • Schedule a CI run every 6 days at 2AM CET to refresh the cache (it expires after 7 days if unused).
  • Add the FloatImmediate AbstractType
  • Define fp64, fp32, fp16, and bf16
  • Add float binding for the Adder in the Generic platform
  • Add a FloatAdder test to the CI for Siracusa and Generic platforms
  • Extend testType.py with float tests
  • LIMITATION: Current LLVM compiler does not support bfp16 and fp16, these types are commented in the library header
  • cMake Flow for the Snitch Cluster
  • Added snitch_cluster to Makefile
  • New Snitch platform with testing application
  • Testrunner for tiled and untiled execution (testRunner_snitch.py, testRunner_tiled_snitch.py)
  • Minimal library with CycleCounter and utility function
  • Support for single-buffered tiling from L2.
  • Parsers, Templates, TypeCheckers, Layers, and TCF for the newly supported operators.
  • A code transformation pass to filter DMA cores or compute cores for an ExecutionBlock.
  • A code transformation pass to profile an ExecutionBlock.
  • Test for single kernels, both with and without tiling.
  • Adds the --debug flag to cargo install when installing Banshee to get the possibility of enabling the debug prints.
  • New tests for the snitch_cluster platform.
  • Add macros to main.c to disable printing and testing (convenient when running RTL simulations).
  • gvsoc in the Makefile and dockerfile
  • cmake flow for gvsoc
  • CI tests regarding Snitch run on GVSOC as well
  • Float Support for Constbuffer
  • Simple Float GEMM on Generic and PULP
  • FP...
Read more