-
Notifications
You must be signed in to change notification settings - Fork 0
Chapter 1
Before a compiler can optimize a program, it must first understand it. Human experts and traditional compilers achieve this by reasoning about the flow of information: how data moves from a variable's definition to its uses (Data Flow), and the order in which instructions execute (Control Flow).
Standard Machine Learning approaches often treat code as a flat sequence of text, similar to NLP (Natural Language Processing), or as a rigid tree, using ASTs (Abstract Syntax Trees). While useful, these representations tend to overemphasize stylistic choices (like variable naming) while failing to capture the underlying semantics and non-linear flows that define how a program actually executes.
ProGraML (Program Graphs for Machine Learning) is a representation designed to bridge this gap. It treats a program not as raw text, but as a directed multigraph that explicitly encodes the same relational information used by compiler engineers.
ProGraML is a graph-based representation of program semantics designed to be portable and language-independent. Because it operates at the level of IRs (Intermediate Representations), such as LLVM-IR, it can represent programs written in any source language (C, C++, Rust, Swift, Fortran, etc.) that can be compiled into that IR.
Unlike standard IRs, which are designed for compiler transformations, ProGraML is specifically designed for machine learning (Figure 1).
Figure 1: ProGraML sits alongside the compiler pipeline, extracting a graph from the IR to enable deep learning-based optimizations. (Source: Cummins et al., 2021)
At its core, a ProGraML graph
-
Vertices (
$V$ ): These represent the fundamental components of the program. Unlike CFGs (Control Flow Graphs), which often group code into coarse "Basic Blocks", ProGraML is fine-grained, with distinct nodes for:-
Instructions: Operations like
add,load,br. -
Variables: Data carriers like
%1,%result. -
Constants: Static values like
42,3.14.
-
Instructions: Operations like
-
Edges (
$E$ ): These represent the semantic relationships between the vertices. To capture the full behavior of a program, ProGraML uses three distinct types of edges:- Control Flow: These edges connect instructions to their successors, modeling the execution path of the program.
-
Data Flow: These edges capture the flow of information. They connect variables/constants to the instructions that use them (operands), and connect instructions to the variables they define (results). These edges are position-aware, encoding the specific order of operands (e.g., to distinguish
a - bfromb - a). -
Call Flow: To enable reasoning across function boundaries (inter-procedural analysis), call edges connect
callinstructions to the entry instructions of the target functions, and return edges link function exits back to the call site.
Definition: A ProGraML graph is a directed multigraph where edges are typed to differentiate Control, Data, and Call flow, and are augmented with position attributes to explicitly encode operand order.
To understand how ProGraML represents a program, let's trace the lifecycle of a simple Recursive Fibonacci function. We will see how a linear sequence of instructions is dismantled and rebuilt into a semantic graph.
ProGraML does not analyze the source code directly. Instead, it analyzes the intermediate representation generated by the compiler. This allows the representation to be independent of the specific syntax or "sugar" of the source language.
In the example below (Figure 2.a), the C code (left) is lowered into a list of instructions (right).
Figure 2: (a) The C source code is compiled into an Intermediate Representation (LLVM-IR in this example); (b) An initial full-flow graph is constructed. Instructions become nodes; edges model the execution path. (Source: Cummins et al., 2021)
The graph construction begins by identifying every instruction in the IR to build a full-flow graph. Each instruction becomes a node, connected to others via Control Flow edges that map the execution path (Figure 2.b).
Notably, these edges are augmented with numeric positions. For instructions with diverging control flow (multiple successors), such as the switch statement in the example, these position attributes explicitly encode the order of the branches. This ensures the model can distinguish between different branch targets (e.g., case 0 vs case 1).
Notice in Figure 2.b above how the switch instruction diverges execution into different paths, while unconditional br (branch) instructions cause the control flow to converge back to common blocks. This structure forms the backbone of the graph.
A program is defined by how it manipulates data. While standard IRs often keep this implicit (via registers), ProGraML makes it explicit by inserting new nodes for Variables (ellipses) and Constants (diamonds) (Figure 3.a). This allows us to map Def-Use chains, the lifecycle of every value in the program, directly onto the graph structure using Data Flow edges:
-
Def-Use Chains: If an instruction defines a variable (e.g.,
%4 = add...), we draw an edge from the Instruction to the Variable. - Use-Def Chains: If an instruction uses a variable as an operand, we draw an edge from the Variable to the Instruction.
These Data Flow edges are also augmented with position attributes. This tells the neural network that in sub a, b, a is the first operand and b is the second. This allows the model to learn the semantics of non-commutative operations naturally.
Figure 3: (a) Variables and constants are added. Data edges track where values are created and used; (b) Functions are linked via Call edges, enabling whole-program reasoning. (Source: Cummins et al., 2021)
Finally, to support Inter-procedural Analysis (reasoning across function boundaries), we connect the functions together using Call Flow edges (Figure 3.b). These edges act as the glue between isolated functions, creating a unified, whole-program graph. They model the program's call stack by linking the call instruction to the entry of the target function, and subsequently linking the function's exit back to the call site.
Note on SPHINX: While the example above utilizes LLVM-IR, our project applies this exact same methodology to MLIR. By capturing the graph at the MLIR level, we can leverage high-level dialect information (like Affine loops or Linalg tensors) before they are discarded by the lowering process.
A graph gives us rich structural information, but neural networks cannot directly compute expressions like “instruction plus variable.” They operate purely on linear algebra, for example, on tensors, matrices, and vectors, and cannot process discrete symbols such as add, for, or %1. To bridge this gap, we use Embeddings.
In Deep Learning, an Embedding is a mapping from a discrete symbol (like a word or a token) into a continuous vector in a lower-dimensional numerical space. You can think of it as translating a "Symbol" into a "Coordinate" in a high-dimensional space.
Ideally, this translation preserves semantic meaning. Just as Word2Vec in NLP ensures that the vectors for "King" and "Queen" are mathematically similar, a good program embedding ensures that semantically similar instructions (like add and fadd) are located close to each other in the vector space.
In the specific context of ProGraML, we calculate an initial state vector
This is achieved via a Vocabulary Lookup. We maintain a fixed-size vocabulary of all known instruction opcodes (e.g., store, br) and data types (e.g., i32, float) derived from the training data.
When the system encounters a node, it follows this logic:
-
Identify the Key: Determine the semantic identity of the node.
- For Instructions, we use the opcode (e.g.,
add). - For Variables/Constants, we use the data type (e.g.,
i32orfloat).
- For Instructions, we use the opcode (e.g.,
-
Lookup ID: Find the integer index for this key in our vocabulary (e.g.,
add$\rightarrow$ Index 45). - Retrieve Vector: Pull the corresponding row from an embedding matrix.
These embedding vectors are not static; they are trained jointly with the rest of the model. This allows the network to learn the "meaning" of an i32 versus a float specifically for the task of optimization.
Once we have the graph topology (A) and the embeddings (X), the program is essentially converted into two matrices that the Machine Learning model can ingest.
Imagine a tiny program fragment: %2 = add %1, 5.
This subgraph has 4 nodes:
- Variable
%1 - Constant
5 - Instruction
add - Variable
%2
Each row represents a node's semantic meaning (its embedding).
| Node Index | Node Type | Embedding Vector (Simplified) |
|---|---|---|
0 (%1) |
variable |
[0.1, 0.9, 0.0, ...] |
1 (5) |
constant |
[0.8, 0.1, 0.1, ...] |
2 (add) |
instruction |
[0.5, 0.5, 0.9, ...] |
3 (%2) |
variable |
[0.1, 0.9, 0.0, ...] |
This represents the connections. In ProGraML, we actually use an Edge List with attributes (to distinguish control vs. data edges), but conceptually it looks like this:
| Source | Target | Edge Type | Position |
|---|---|---|---|
0 (%1) |
2 (add) |
Data Flow | 1 (1st operand) |
1 (5) |
2 (add) |
Data Flow | 2 (2nd operand) |
2 (add) |
3 (%2) |
Data Flow | 0 (Result) |
| ... |
2 (add) |
Control Flow | ... |
By combining
Figure 4: The complete transformation pipeline. Source code is lowered to LLVM IR and constructed into a ProGraML graph. In this state, every individual node (instructions and data) possesses its own learned embedding vector. The final point in the 'Embedding View' represents the aggregation of these vectors, effectively treating the entire graph as a composition of its parts. Note: The embedding space is depicted here in 3 dimensions for visualization purposes; actual production embeddings are typically much denser (e.g., 64 or 128 dimensions).
In the next chapter, we turn to MLIR and show how SPHINX translates compiler dialects into the ProGraML graphs introduced here.
Next: Chapter 2 - From MLIR to ProGraML ->
Cummins, C., Fisches, Z.V., Ben-Nun, T., Hoefler, T., O’Boyle, M.F.P. & Leather, H.. (2021). ProGraML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:2244-2253 Available from link.