Machine Learning Integration for DaCe (Autodiff - ONNX - PyTorch) #2164

affifboudaoud · 2025-10-01T09:56:10Z

Pull Request: Machine Learning Integration for DaCe

Overview

This PR adds comprehensive machine learning capabilities to DaCe through three tightly integrated components:

Automatic Differentiation (AD) - Reverse-mode gradient computation for SDFGs
ONNX Integration - Import and execute neural network models
PyTorch Integration - Bidirectional interoperability with PyTorch's autograd system

Together, these components enable DaCe to optimize and accelerate machine learning workloads, particularly neural network training and inference.

High-Level Architecture

PyTorch Model
     ↓
  ONNX Export
     ↓
DaCe SDFG (Forward)
     ↓
Automatic Differentiation
     ↓
DaCe SDFG (Backward)
     ↓
Compiled Code Generation
     ↓
PyTorch Operator (with Autograd)

Component 1: Automatic Differentiation (`dace/autodiff/`)

Purpose

Provides reverse-mode automatic differentiation for SDFGs, enabling gradient computation for any DaCe program. This is the foundation for neural network training and gradient-based optimization.

Key Capabilities

Full SDFG Support: Differentiates maps, tasklets, nested SDFGs, loops, and library nodes
Control Flow: Handles loops (LoopRegion) and conditionals
ONNX Operations: 50+ backward implementations for ONNX operators
Data Forwarding: Flexible strategies (store vs. recompute) for memory/compute tradeoffs
Extensible Registry: Plugin-based system for adding backward rules

Core Algorithm

Forward Pass Execution: Run original computation and identify required intermediates
Backward Pass Generation: Traverse computation graph in reverse, accumulating gradients
Node Reversal: Each forward node (Map, Tasklet, ONNXOp) has a registered backward implementation
Gradient Accumulation: Use write-conflict resolution (WCR) for multi-path gradients

Key Files

File	Lines	Purpose
`backward_pass_generator.py`	~800	Core AD engine that orchestrates backward pass generation
`implementations/onnx_ops.py`	~2000	Backward implementations for 50+ ONNX operations
`implementations/dace_nodes.py`	~600	Backward rules for core SDFG elements (Tasklet, Map, etc.)
`data_forwarding/manager.py`	~300	Store vs. recompute strategy coordination

Component 2: ONNX Integration (`dace/libraries/onnx/`)

Purpose

Enables importing and executing ONNX neural network models within DaCe. Converts ONNX graphs to optimized DaCe SDFGs for efficient execution on CPU/GPU.

Key Capabilities

Model Import: Load ONNX models from files or protobuf objects
100+ Operations: Dynamically generated node classes for all ONNX ops
Shape Inference: Automatic symbolic and concrete shape computation
Multi-Strategy Implementations: Pure (correctness), optimized (performance), hardware-specific
Type Safety: Schema-based validation and type checking

Core Architecture

Dynamic Node Generation:

Registry system generates Python classes for all ONNX operations at import time
Each operation has schema, properties, connectors, and implementations
Example: ONNXConv, ONNXMatMul, ONNXSoftmax (100+ generated classes)

Implementation Strategies:

Pure Implementations (pure_implementations.py): Reference implementations in Python/NumPy
Optimized Implementations (img_op_implementations.py): Hand-crafted SDFGs for performance
Hardware-Specific: Future GPU/FPGA specialized implementations

Import Pipeline:

ONNX Model → Validation → Shape Inference → Simplification → SDFG Construction → Compilation

Key Files

File	Lines	Purpose
`onnx_importer.py`	711	Main entry point, orchestrates import pipeline
`op_implementations/pure_implementations.py`	3052	Reference implementations for 40+ operations
`nodes/onnx_op_registry.py`	325	Dynamic node class generation
`schema.py`	390	Type system and validation
`shape_inference/symbolic_shape_infer.py`	1976	Symbolic shape inference (Microsoft-sourced)

Component 3: PyTorch Integration (`dace/libraries/torch/`)

Purpose

Provides bidirectional integration between PyTorch and DaCe. Enables optimizing PyTorch models with DaCe while maintaining PyTorch's autograd compatibility.

Key Capabilities

Model Optimization: Convert torch.nn.Module to optimized DaCe SDFGs
Autograd Integration: Backward pass generation integrates with PyTorch's autograd
Dual Dispatch: C++ extension (performance) or CTypes (flexibility)
Zero-Copy Tensors: DLPack protocol for efficient memory sharing
Training Support: Full forward + backward pass compilation

Core Architecture

Integration Flow:

PyTorch Model → ONNX Export → DaCe SDFG → Backward Generation → Compilation → PyTorch Operator

Dispatcher Strategies:

C++ Extension (cpp_torch_extension.py): Native PyTorch operator with autograd
- High performance
- 64 parameter limit
- Slower compilation
CTypes Module (ctypes_module.py): Pure Python dispatcher
- Unlimited parameters
- Faster compilation
- Slight overhead

Zero-Copy Memory Sharing:

DLPack protocol enables PyTorch tensors to view DaCe memory without copying
Bidirectional: DaCe → PyTorch (outputs) and PyTorch → DaCe (inputs)

Key Files

File	Lines	Purpose
`dispatchers/cpp_torch_extension.py`	717	C++ code generation for PyTorch operators
`dispatchers/ctypes_module.py`	230	CTypes-based dispatcher
`dlpack.py`	199	Zero-copy tensor sharing via DLPack
`environments/pytorch_env.py`	94	CMake build configuration

How Components Work Together

Example: Training a PyTorch Model with DaCe

import torch
from dace.frontend.python import DaceModule

# 1. Define PyTorch model
model = MyNeuralNetwork()
optimizer = torch.optim.Adam(model.parameters())

# 2. Wrap with DaCe (compiles on first call)
dace_model = DaceModule(model, dummy_inputs, backward=True)

# 3. Training loop (standard PyTorch code)
for inputs, labels in dataloader:
    optimizer.zero_grad()
    outputs = dace_model(inputs)  # DaCe-optimized forward pass
    loss = criterion(outputs, labels)
    loss.backward()  # DaCe-optimized backward pass
    optimizer.step()

What Happens Internally:

First Call: PyTorch model → ONNX export → DaCe SDFG (via ONNX integration)
Backward Generation: Forward SDFG → Backward SDFG (via autodiff)
Compilation: Both SDFGs compiled to optimized code
Dispatcher: C++ extension or CTypes wrapper created
Forward Pass: DaCe executes optimized forward computation
Backward Pass: DaCe executes generated backward computation
Gradient Return: Gradients flow back to PyTorch optimizer

Data Flow

PyTorch Tensor (input)
    ↓ Zero-copy (DLPack)
DaCe Array
    ↓ Optimized SDFG Execution
DaCe Array (output)
    ↓ Zero-copy (DLPack)
PyTorch Tensor (output)
    ↓ loss.backward()
PyTorch Tensor (grad_output)
    ↓ Zero-copy (DLPack)
DaCe Array (backward pass input)
    ↓ Backward SDFG Execution
DaCe Array (grad_input)
    ↓ Zero-copy (DLPack)
PyTorch Tensor (grad_input)

Testing Strategy

Test Organization

tests/
├── autodiff/                       # AD correctness tests
│   ├── test_single_state.py        # Basic AD operations
│   └── torch/                      # PyTorch integration tests
│       ├── test_training.py        # End-to-end training
│       ├── test_bert_encoder_backward.py    # BERT model
│       └── test_llama_decoder_backward.py   # LLaMA model
│
├── onnx/                          # ONNX import tests
│   ├── test_python_frontend.py    # Basic operations
│   ├── test_bert_subgraphs.py     # Real model subgraphs
│   └── test_input_outputs.py      # I/O handling
│
└── torch/                          # PyTorch integration tests
│   ├── test_lenet.py               # Simple CNN
│   ├── test_bert_encoder.py        # Transformer encoder
│   └── test_llama_decoder.py       # Decoder architecture
│
└── npbench/                        # AD tests on NPBench kernels

Test Coverage

Component	Test Files	Coverage
Autodiff Core	15+ files	Tasklets, maps, loops, nested SDFGs
ONNX Integration	20+ files	Import, execution, type handling
PyTorch Integration	15+ files	Forward, backward, training loops

Running Tests

# All basic tests (excluding hardware-specific)
pytest -m "(autodiff or torch or onnx) and not long" tests/

# AD tests only
pytest tests/autodiff/

# ONNX tests only
pytest tests/onnx/

# PyTorch tests only
pytest tests/torch/

Known Limitations and Future Work

Current Limitations

Recompute Strategy: Experimental, not production-ready
Control Flow: Conditionals are inlined into state machine (not reversed as ControlFlowRegions)
Second-Order Gradients: Not yest tested

Documentation

Each component has detailed design documentation:

dace/autodiff/autodiff.md - Complete AD system design
dace/libraries/onnx/onnx.md - ONNX integration architecture
dace/libraries/torch/torch.md - PyTorch integration details

These documents provide:

Detailed component descriptions
Algorithm explanations
Code walkthrough
Extension points
Implementation notes

Impact on DaCe

Code Additions

Component	Lines of Code	Files
Autodiff	~8,000	15+ files
ONNX	~7,000	20+ files
PyTorch	~1,500	10+ files
Total	~16,500	45+ files

Dependencies

New dependencies (already in setup.py):

onnx - ONNX model format
onnxsim - ONNX graph simplification
torch - PyTorch framework (optional)
protobuf - Protocol buffers (for ONNX)
jax - For gradient numerical validation tests
-transformers - For testing the Pytorch/ONNX frontends
efficientnet_pytorch- For testing EfficientNet

Removed the redundant ReverseReduceMax class and its methods, which duplicated functionality from ReverseReduce. Updated import statements and cleaned up the code.

affifboudaoud added 30 commits May 3, 2024 11:25

Initial commit: Merge DaCeML into DaCe

5a6dc4b

Merge branch 'master' of https://github.com/spcl/dace into autodiff

3c5219b

store-all initial implementation

0a5721c

structural changes and formatting

7f9ca01

yapf refactoring

d7b4c64

store all implementation for non-base-level nodes

32d8761

fix store-all implementation

9fd821f

recompute: initial implementation

02d0d7e

recomputation: small fixes

60f3ea8

Initial support for multistate SDFGs

e8c6933

[incomplete] Initial support for multistate AD

579c18a

[Incomplete] additional features for multistate support

68e1cd1

Fix ONNX model loading for duplicated inputs/outputs

18df683

Added AD for NPBench code

dad803b

[in progress] Added fix for conditional array assignment

24caf25

remove log sdfgs from commit

a2475ad

[in progress] Conditional tasklet support

cdb4a74

[in progress] Improved tasklet reversal to treat conditional tasklets

5973196

[in progress] Conditional tasklets and store strategy improvements

71b034c

Additional fixes to storing and removed unnecessary constraints

8f6024f

Merge branch 'master' of https://github.com/spcl/dace into loops

2e326af

[In progress] Added support for CFG loop representation

a11da41

[In progress] Additional improvements in LoopRegion support

8c50074

Merge branch 'master' of https://github.com/spcl/dace into loops_merge

c34ade8

Merge branch 'master' of https://github.com/spcl/dace into loops_merge

7b981b5

Fixed reversal of NestedSDFGs

0df8ac9

Fixed reversal of NestedSDFGs

ec29076

[In progress] Refactoring storing strategy

e875d1c

[In progress] Further improvements to storing

256bcb9

Initial implementation of recomputation

b5019c6

affifboudaoud added 2 commits October 1, 2025 14:41

Pre-commit formatting

073292a

Serialization fixes

fdc0e3f

affifboudaoud added no-ci Do not run any CI or actions for this PR and removed no-ci Do not run any CI or actions for this PR labels Oct 1, 2025

affifboudaoud added 4 commits October 2, 2025 01:47

Fix paths for cpp extensions

4d0792b

Unique auto_opt name and expansion edge case

877833a

Skip some AD tests until serialization issue is fixed

4a69aa9

Revert to main code

2b5d29a

affifboudaoud removed the no-ci Do not run any CI or actions for this PR label Oct 2, 2025

affifboudaoud added 17 commits October 2, 2025 14:46

Remove conda specific import

57168e0

Use expanded sdfgs instead of function call

37030b1

Make Torch and ONNX dependencies optional

f096ad0

Update CI installation

747e1e5

Update all CI installations

1bd2154

Avoid conflicting names got batch size in MKL implementation

8b73bf5

Build Torch module in unique dir to avoid baton issues

fafc460

Attempting to reduce CI runtime with smaller sizes

2aa4e3a

Simplify durbin test

d2d31b9

Simplify resent

f4b2c7c

Even smaller sizes for cavity_flow

1dc4d8b

Avoid data race in loop lifiting test

ad34aba

Formatting

f07a6e6

Set JAX version to avoid conflict with cupy

477738b

set JAX to <= 0.6.2

dec4703

Smaller inputs for Cholesky

341d3b3

Make ReplacementTransformation abstract to pass coverage tests

21a2538

affifboudaoud marked this pull request as ready for review October 6, 2025 13:01

Remove redundant ReverseReduceMax class

c998f1e

Removed the redundant ReverseReduceMax class and its methods, which duplicated functionality from ReverseReduce. Updated import statements and cleaned up the code.

affifboudaoud assigned phschaad and alexnick83 Oct 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Machine Learning Integration for DaCe (Autodiff - ONNX - PyTorch) #2164

Machine Learning Integration for DaCe (Autodiff - ONNX - PyTorch) #2164

Uh oh!

affifboudaoud commented Oct 1, 2025

Uh oh!

Uh oh!

Machine Learning Integration for DaCe (Autodiff - ONNX - PyTorch) #2164

Are you sure you want to change the base?

Machine Learning Integration for DaCe (Autodiff - ONNX - PyTorch) #2164

Uh oh!

Conversation

affifboudaoud commented Oct 1, 2025

Pull Request: Machine Learning Integration for DaCe

Overview

High-Level Architecture

Component 1: Automatic Differentiation (dace/autodiff/)

Purpose

Key Capabilities

Core Algorithm

Key Files

Component 2: ONNX Integration (dace/libraries/onnx/)

Purpose

Key Capabilities

Core Architecture

Key Files

Component 3: PyTorch Integration (dace/libraries/torch/)

Purpose

Key Capabilities

Core Architecture

Key Files

How Components Work Together

Example: Training a PyTorch Model with DaCe

Data Flow

Testing Strategy

Test Organization

Test Coverage

Running Tests

Known Limitations and Future Work

Current Limitations

Documentation

Impact on DaCe

Code Additions

Dependencies

Uh oh!

Uh oh!

Component 1: Automatic Differentiation (`dace/autodiff/`)

Component 2: ONNX Integration (`dace/libraries/onnx/`)

Component 3: PyTorch Integration (`dace/libraries/torch/`)