-
Notifications
You must be signed in to change notification settings - Fork 2
/
intromoa.tex~
452 lines (412 loc) · 21.8 KB
/
intromoa.tex~
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
% See the ``Article customise'' template for come common customisations
\newcommand{\cat}{+\!\!\!+}
%\newcommand{\cat}{\mbox{$+\!\!+_s$}}
\newcommand{\Take}{\mbox{Take}}
\newcommand{\take}{\,\bigtriangleup\,}
\newcommand{\Drop}{\mbox{Drop}}
\newcommand{\drop}{\,\bigtriangledown\,}
\newcommand{\rshp}{\,\widehat{\Rho}\;}
\newcommand{\Rho}{\,\rho\,}
\newcommand{\Tau}{\,\tau\,}
\newcommand{\Dim}{\,\delta\,}
\newcommand{\transpose}{\bigcirc\!\!\!\!\!\backslash\;}
\newcommand{\kron}{\bigcirc\,\!\!\!\!\!\!\times\;}
\newcommand{\Ravel}{\,\mbox{\tt rav}\,}
\newcommand{\Gradeup}{\,\mbox{\tt gu}\,}
%\newcommand{\cat}{+\!\!\!\!+}
\newtheorem{definition}{Definition}
%\newtheorem{definition}{Definition}
\newtheorem{corollary}{Corollary}
\newtheorem{theorem}{Theorem}
\noindent
MoA, an acoronym for {\em A Mathematics of Arrays\cite{thesis}}, is a theory of arrays with both an
algebra and a calculus of
indexing based on shapes, i.e. sizes of each dimension. The index calculus, refered to as {\em The Psi Calculus},
gets its names from the {\em Psi} function which is the foundational function in the theory. The name {\em Psi}
was inspired by Dirac's {\em Psi Vectors} and his communications with Einstein to have a mathematical way to reason
about them. These results began with the awareness that Iverson's APL\cite{iverson62}
was inspired by Sylvester's Universal Algebra\cite{sylvester} and the fact that he and Cayley invented matrices.
The programming language, APL, embodied Ken's notation into an interpretive computing environment. It was at this point
that theoretical anamolies were introduced.
Not only did Ken believe that APL's array algebra was the ideal basis for a programming language, he (and others) believed that
an ideal machine language supported array operations. This led to HP building an
APL machine based on Phil Abram's dissertation\cite{abrams70}. Abrams was the first to identify ways
to define an algebra of arrays based on shapes; others attempted to
put closure on his work\cite{lyon,guibas}. Abrams believed that if array operations were based on
indexing with shapes, all indexing in the array expression could be composed
and thus could minimize the creation (and deletion) of temporary arrays,
a huge overhead. One of the reasons HP's APL Machine did not succeed was that this analysis was done at run time.
It was Alan Perlis, and later others\cite{miller,budd84}, who realized that these optimizations should be done at compile time. Attempts at compilation led Perlis to realize that anomalies in the language~\cite{gerhart}
made it difficult to optimize and impossible to verify correctness of
programs leading him to recommend a functional array calculator based on the lambda
calculus\cite{tu2,tu3}. This awareness led to Mullin's collaborations with Klaus Berkling to combine MoA, Psi Calculus, and
the Lambda Calculus, theoretically an ideal enviornment\cite{berkling} and basis for a
functional programming language with
arrays\cite{mul91,budd91,hm92,mul93,mul94,mul96}.
MoA and the Psi Calculus, as a theory, have also been used to describe and verify
hardware\cite{hardy1,hardy2,DBLP:journals/corr/abs-0811-2535}, describe the partitioning and
scheduling of array operations (and their costs)\cite{mdst93,lee95,aachen96,coffin94,ll2,mcmahon95,haleh98},
and the formulation of parallel and distributed processing\cite{Mul03}.
MoA and $\psi$-Calculus, can be used to abstract complex processor memory layouts
for tensor/array expressions, reduce these expressions first to a semantic
normal form (Denotational Normal Form-DNF) then to an Operational Normal Form
(ONF). The ONF describes how to {\em build} the code using
{\em start, stop, stride}, and {\em count}: a universal machine abstraction.
This is possible by {\em abstractly} lifting the dimension of an expression to include all aspects
of a processor, memory, communication topology. For example, to map a 2-d array expression to
n processors we would lift the dimenion to 3-d.
%
%
%
%\bibliographystyle{plain}
%\bibliography{paper1,hpf,workshop}
\subsection*{Elements of the theory}
\subsection*{\label{indshp} Indexing and Shapes}
The central operation of MoA is the indexing function
\[
p \psi A \]
in which a vector of $n$ integers $p$ is used to select an item of
the $n$-dimensional array $A$.
The operation is generalized to select a partition of $A$,
so that if $q$ has only $k < n $ components
then
\[ q \psi A \]
is an array of dimensionality $n-k$ and $q$ selects among the possible
choices for the first $k$ axes. In MoA zero origin indexing is assumed.
For example, if $A$ is the $3$ by $5$ by $4$ array
\small
\[ \left [
\begin{array}{lrrrr}
0 & 1 & 2 & 3 \\
4 & 5 & 6 & 7 \\
8 & 9 & 10 & 11 \\
12 & 13 & 14 & 15 \\
16 & 17 & 18 & 19
\end{array} \right ]
\left [ \begin{array}{rrrr}
20 & 21 & 22 & 23 \\
24 & 25 & 26 & 27 \\
28 & 29 & 30 & 31 \\
32 & 33 & 34 & 35 \\
36 & 37 & 38 & 39
\end{array} \right ]
\left [ \begin{array}{rrrr}
40 & 41 & 42 & 43 \\
44 & 45 & 46 & 47 \\
48 & 49 & 50 & 51 \\
52 & 53 & 54 & 55 \\
56 & 57 & 58 & 59
\end{array} \right ] \]
\normalsize
then
\[
<1> \psi A = \left [ \begin{array}{rrrr}
20 & 21 & 22 & 23 \\
24 & 25 & 26 & 27 \\
28 & 29 & 30 & 31 \\
32 & 33 & 34 & 35 \\
36 & 37 & 38 & 39
\end{array} \right ] \]
\[ <2 \; 1> \psi A \; \; = \; \; < \; 44 \;\; 45 \;\; 46 \;\; 47 \; > \]
\[ <2\; 1\; 3> \psi A \; \; = \; \; 47 \]
Most of the common array manipulation operations found in languages like Fortran90, Matlab, ZPL, etc.,
can be defined from $\psi$ and a few elementary vector operations.
We now introduce notation to permit us to define $\psi$ formally and to develop the
{\em Psi Correspondence Theorem}~\cite{muljen94}, which is central to the effective exploitation of MoA in array
computations. We will use $A, B, ...$ to denote an array of numbers (integer, real, or complex).
An array's dimensionality will be denoted by $d_A$ and will be assumed to
be $n$ if not specified.
The shape of an array $A$, denoted by $s_A$, is a vector of integers of length $d_A$,
each item giving the length of the corresponding axis. The total number of items in an
array, denoted by $t_A$, is equal to the product of the items of the shape. The subscripts
will be omitted in contexts where the meaning is obvious.
A full index is a vector of $n$ integers that describes one position
in an $n$-dimensional
array. Each item of a full index for $A$ is less than the corresponding item of
$s_A$. There are precisely $t_A$ indices for an array(due to a
zero index origin). A partial index of $A$ is a vector of $0 \leq k < n $
integers with each item less than the corresponding item of $s_A$.
We will use a tuple notation (omitting commas) to describe vectors of a fixed length.
For example,
\[ <i \; j \; k> \]
denotes a vector of length three. $<>$ will denote the empty vector which is
also sometimes written as $\Theta$.
For every $n$-dimensional array $A$, there is a vector of the items of $A$,
which we denote by the corresponding lower case letter, here $a$. The length of the vector of
items is $t_A$. A vector is itself a one-dimensional array, whose shape is the one-item
vector holding the length. Thus, for $a$, the vector of items of $A$, the
shape of $a$ is
\[ s_a = \; < t_A> \]
and the number of items or total number of components
$a$\footnote{We also use $\tau a , \delta a,$ and $\rho a$ to denote
total number of components, dimensionality and shape of a. } is
\[ t_a = t_A . \]
The precise mapping of $A$ to $a$ is determined by a one-to-one
ordering function, $gamma$. Although the choice of ordering is arbitrary, it is
essential in the following that a specific one be assumed. By convention
we assume the items of $A$ are placed in $a$ according to the lexicographic
ordering of the indices of $A$. This is often referred to
as $row \; major \; ordering$.
Many programming languages lay out the items of multidimensional
arrays in memory in a contiguous segment using this ordering.
Fortran uses the ordering corresponding to a transposed array in which the
axes are reversed {\em column major}. Scalars are introduced as arrays with an empty shape vector.
There are two equivalent ways of describing an array $A$:
\begin{description}
\item[(1)] by its shape and the vector of items, i.e. $A = \{s_A , a\}$, or
\item[(2)] by its shape and a function that defines the value at every index $p$.
\end{description}
These two forms have been shown to be formally equivalent~\cite{jenk94}.
We wish to use the second form in defining functions on multidimensional
arrays using their Cartesian coordinates (indices).
The first form is used in describing
address manipulations to achieve effective computation.
To complete our notational conventions, we assume that $p,q,...$ will
be used to denote
indices or partial indices and that $u,v,..$ will be used to denote
arbitrary vectors of integers.
In order to describe the $i_{th}$ item of a vector $a$, either $a_i$ or
$a[i]$ will be used. If $u$ is a
vector of $k$ integers all less than $t_A$, then $a[u]$
will denote the vector of length $k$, whose items are the items of $a$ at
positions $u_j$, $j = 0,...,k-1.$
Before presenting the formal definition of the $\psi$ indexing function we
define a few functions on vectors:
\begin{tabbing}
\hspace{.25cm}\= $u \;\cat\; v$ \hspace{1cm} \= \=catentation of vectors $u$ and $v$ \\
\>$u$ + $v$\> \>itemwise vector addition assuming $t_u = t_v$ \\
\>$u \times v$ \> \> itemwise vector multiplication \\
\>$n$ + $u$, $u$ + $n$ \> addition of a scalar to each item of a vector \\
\>$n \times u$, $u \times n$ \> multiplication of each item of a vector by a scalar \\
\>$\iota \; n$ \>\> the vector of the first n integers starting from $0$ \\
%\footnote{$\iota \vec n$ is also defined and produces an array of
%cartesian coordinates~\cite{mul88}.}\\
\>$\pi \; v$ \> \>a scalar which is the product of the components of v \\
\>$k \;\take \;u$ \>\> when $k \geq 0$ the vector of the first $k$ items of $u$,(called {\em take}) \\
\> \> \>and when $k<0$ the vector of the last $k$ items of $u$ \\
\>$k \;\drop \;u$ \> \> when $k \geq 0$ the vector of $t_u - k$ last items of $u$,(called{ \em drop}) \\
\> \> \>and when $k<0$ the vector of the first $t_u - |k|$ items of $u$ \\
\>$k \;\theta \;u$ \> \> when $k \geq 0$ the vector of $(k \drop u) \cat ( k \take u ) $\\
\> \> \>and when $k<0$ the vector or $(k \take u ) \cat ( k \drop u ) $
\end{tabbing}
\begin{definition}
Let A be an n-dimensional array and p a vector of integers.
If p is an index of A,
\[ p \psi A = a[\gamma(s_A,p)], \]
where
\begin{eqnarray}
\gamma(s_A,p) & = & x_{n-1} \;\;\;\;\; \mbox{defined by
the recurrence} \nonumber \\
x_0 & = & p_0, \nonumber \\
x_j & = & x_{j-1} * s_j + p_j, \; \; \; j=1,...,n-1. \nonumber
\end{eqnarray}
If $p$ is partial index of length $k < n,$
\[ p \psi A = B \]
where the shape of $B$ is
\[ s_B = k \drop s_A, \]
and for every index $q$ of $B$,
\[ q \psi B = (p \cat q) \psi A \]
\end{definition}
The definition uses the second form of specifying an array to define the result
of a partial index. For the index case, the function $\gamma(s,p)$
is used to convert an index $p$ to an
integer giving the location of the corresponding item
in the row major order list of items of an array of shape $s$. The
recurrence computation for $\gamma$ is the one used in most compilers
for converting an
index to a memory address~\cite{djr}.
\begin{corollary}
$<> \psi$ A = A.
\end{corollary}
The following theorem shows that a $\psi$ selection with a partial index can
be expressed as a composition of $\psi$ selections.
\begin{theorem}
Let A be an n-dimensional array and p a partial index so
that $p = q \cat r.$ Then
\[ p \psi A = r \psi (q \psi A) . \]
Proof: The proof is a consequence of the fact that for vectors u,v,w
\[ (u \cat v) \cat w = u \cat (v \cat w). \]
If we extend p to a full index by $p \cat p',$ then
\begin{eqnarray}
p' \psi (p \psi A) & = & (p \cat p') \psi A \nonumber \\
& = & ((q \cat r) \cat p') \psi A \nonumber \\
& = & (q \cat (r \cat p')) \psi A \nonumber \\
& = & (r \cat p') \psi (q \psi A) \nonumber \\
& = & p' \psi ( r \psi (q \psi A)) \nonumber \\
p \psi A & = & r \psi (q \psi A) \nonumber
\end{eqnarray}
\end{theorem}
{\em which completes the proof.}
We can now use {\em psi} to define other operations on arrays.
For example, consider definitions of {\em take} and {\em drop} for multidimensional arrays.
\begin{definition}[take: $\take$]
Let A be an n-dimensional array, and k a non-negative integer
such that $0 \leq k < s_0$. Then
\[ k \take A = B \]
where
\[ s_B = <k> \cat (1 \drop s_A) \]
and for every index p of B,
\[ p \psi B = p \psi A . \]
\end{definition}
(In MoA $\take$ is also defined for negative integers and is generalized to any vector u with its
absolute value vector a partial index of A. The details are omitted here.)
\begin{definition}[reverse: $\Phi$]
Let A be an n-dimensional array. Then
\[ s_{\Phi A} = s_A \]
and for every integer i, $0 \leq i < s_0, $
\[ <i> \psi \Phi A = <s_0 - i - 1> \psi A. \]
\end{definition}
This definition of $\Phi$ does a reversal of the 0th axis of A.
Note
also that all operations are over the 0th axis. The operator $\Omega$~\cite{mul88} extends operations over all other dimensions.
\subsubsection*{Example}
Consider the evaluation of the following expression using the
3 by 5 by 4 array, $A$, introduced in Section~\ref{indshp}.
\begin{eqnarray}
<1\;\; 2> \psi (2 \take \Phi A)
\end{eqnarray}
where A is the array given in the previous section.
The shape of the result is
\begin{eqnarray}
& & 2 \drop s_{(2 \take \Phi A)} \nonumber \\
& = & 2 \drop (<2> \cat (1 \drop s_{\Phi A} )) \nonumber \\
& = & 2 \drop (<2> \cat (1 \drop s_A)) \nonumber \\
& = & 2 \drop (<2> \cat <5\; 4>) \nonumber \\
& = & 2 \drop <2\; 5 \;4> \nonumber \\
& = &<4> .
\end{eqnarray}
The expression can be simplified using the definitions:
\begin{eqnarray}
&& <1\; 2> \psi (2 \take \Phi A) \nonumber \\
& = & <1 \; 2> \psi \Phi A \nonumber \\
& = & <2> \psi (<1> \psi \Phi A) \nonumber \\
& = & <2> \psi (<3 - 1 - 1 > \psi A) \nonumber \\
& = & <1 \; 2> \psi A \nonumber \\
\end{eqnarray}
This process of simplifying the expression for the item in terms of
its Cartesian coordinates is called {\em Psi Reduction}. The operations of MoA
have been designed so that all expressions can be reduced to a minimal
normal form~\cite{mul88}.
Some MoA operations defined by {\em psi} are found in Fig.\ref{moa}.
%\break
\begin{figure}[h]
%\scriptsize
\tiny
\begin{tabular}{|l|l|l|}
\hline\hline
{\bf Symbol} & {\bf Name}& {\bf Description} \\ \hline
$\delta$ & Dimensionality & Returns the number of dimensions of an array.\\ \hline
$\rho$ & Shape & Returns a vector of the upper bounds \\
& &or sizes of each dimension in an array. \\ \hline
$\iota \xi^n$ & Iota & When $n=0$(scalar),
returns a vector containing elements\\
& & $0$, to $\xi^0 - 1$. When $n = 1$(vector), returns an \\
& & array of indices defined by the shape vector $\xi^1$ \\ \hline
$\psi$ & Psi & The main indexing function of the Psi Calculus \\
& & which defines all operations in MoA. Returns a scalar \\
& & if a full index is provided, a sub-array otherwise. \\ \hline
\mbox{rav} & Ravel & vectorizes a multi-dimensional array based \\
&&on an array's layout($\gamma_{row},\gamma_{col},\gamma_{sparse}, ...$)\\ \hline
$\gamma$ & Gamma & Translates indices into offsets given a shape. \\ \hline
$\gamma^{'}$ & Gamma Inverse & Translates offsets into indices given a shape. \\ \hline
$\vec s \rshp \xi $ & Reshape& Changes the shape vector of an array,
possibly affecting \\
& &its dimensionality. Reshape depends on
layout($\gamma$). \\ \hline
$\pi \vec x$ & Pi & Returns a scalar and is equivalent to $\prod_{i=0}^{(\tau x)-1}x[i]$ \\ \hline
$\tau$& Tau & Returns the number of components in an array,($\tau \xi \equiv
\pi (\rho \xi) $) \\ \hline
$\xi_l \cat \xi_r $ & Catenate& Concatenates two arrays over their
primary axis. \\ \hline
$\xi_l f \xi_r $ & Point-wise & A data parallel application of $f$ is performed \\
&Extension & between all elements of the arrays.\\ \hline
$\sigma f \xi_r $ & Scalar Extension & $\sigma$ is used with
every component of $\xi_r$ in the data parallel \\
$\xi_l f \sigma$ & &application of $f$. \\ \hline
$\take$&Take& Returns a sub-array from the beginning or end of an array\\
& & based on its argument being positive or negative. \\ \hline
$\drop$ & Drop & The inverse of Take \\ \hline
$_{op} \mbox{red}$&Reduce& Reduce an array's dimension by one by applying \\
& & op over the primary axis of an array. \\ \hline
$\Phi$& Reverse& Reverses the components of an array. \\ \hline
$\Theta $& Rotate & Rotates, or shifts cyclically, components of an
array. \\ \hline
$\transpose$ & Transpose & Transposes the elements of an array based on \\
&&a given permutation vector \\ \hline
$\Omega$& Omega& Applies a unary or binary function to array argument(s) \\
&&given partitioning information. $\Omega$ is used to perform all operations \\&& (defined over the primary axis only) over all dimensions. \\ \hline
\end{tabular}
\caption{\label{moa}Summary of MoA Operations}
\end{figure}
\subsection*{Higher Order Operations}
Thus far operation on arrays, such as concatenation, rotation, etc., have been
performed over their 0th dimensions. We introduce the higher order binary
operation $\Omega$, which is defined when its left argument is a unary or binary
operation and its right argument is a vector describing the dimension upon
which operations are to be performed, or which sub-arrays are used in
operations. The dimension upon which operations are to be performed is often
called the {\it axis} of operation. The result of $\Omega$ is a unary or
binary operation.
\subsection*{From Semantic to Operational Forms}
%Section with Psi Correspondence Theorem and thinking in arrays *\
Consider the expression
\begin{eqnarray}
X & = & (2 \take ( \Phi A)) \times ( 1 \drop (\Phi A))
\end{eqnarray}
That is, take the first two planes after reversing A along the primary axis then multiply that array by the last two
planes after reversing A. We want to concurrently perform both 5 by 4 multiplications.
A is the array previously defined with shape $<3\;5\;4>$.
Suppose also that A is in shared memory and accessible to 2 processors.
First, we {\em Psi-Reduce} our expression to its Denotational Normal Form (DNF) or semantic
normal form. This normal form can be used to prove the equivalence of two array expressions.
Get shape:
\begin{eqnarray}
\rho X & = & \rho (2 \take (\Phi A)) \nonumber \\
& = & 2 \cat ((1\drop (\rho ( \Phi A)))) \nonumber \\
&=& 2 \cat ((1 \drop (\rho A))) \nonumber \\
&=& 2 \cat <\;5\;4> \nonumber \\
&=& <2\;4\;5>
\end{eqnarray}
Now that we have the {\em shape}, we can {Psi-Reduce}.
\[ \forall \;\;\; i,j,k \;\;\; \ni \;\;\; 0 \;\;\;\leq i,j,k < \;\;\;<2\;4\;5> \]
\[<i\;j\;k> \psi X \]
\begin{eqnarray}
& =&<i\;j\;k> \psi (2 \take ( \Phi A)) \times ( 1 \drop (\Phi A)) \nonumber \\
&=&<j\;k> \psi (<i> \psi (2 \take ( \Phi A)) \times ( 1 \drop (\Phi A)) )\nonumber \\
&= &<j\;k> \psi ((<i> \psi ( 2 \take (\Phi A))) \nonumber \\
&&\;\;\;\;\;\times ( <i> \psi ( 1 \drop (\Phi A)))) \nonumber \\
&= &<j\;k> \psi (( <i> \psi (\Phi A)) \times (<i+1> \psi (\Phi A))) \nonumber \\
&=&<j\;k> \psi ((<((\rho A)[0] - 1 - i)> \psi A ) \nonumber \\
&& \;\;\;\;\;\times (<((\rho A)[0] - 1 -( i + 1))> \psi A ) )\nonumber \\
&=&<j\;k> \psi ((< 2-i> \psi A ) \times (<1-i> \psi A) )
\end{eqnarray}
\normalsize
This is the DNF, or semantic normal form. To proceed further we must know more about the data, i.e. layout, and architecture.
\subsection* {Applying the Psi Correspondence Theorem: PCT}
We now want to {\em transform} the DNF above to a {\em Generic} Operational Normal Form (ONF) or way
to {\em build the code}. The PCT~\cite{muljen94} turns cartesian coordinates into start, stops, and strides.
Note that we still use {\em Psi-Reduction} but now just on vectors.
Let Y denote $(<2-i> \psi A ) \times ( <1-i> \psi A)$. Then the shape of Y, i.e.
\begin{eqnarray}
\rho Y= 1 \drop (\rho A) = 1 \drop <3\;5\;4> = <5\;4>
\end{eqnarray}
i.e. the bounds for j and k above.
Continuing with the PCT to create the {\em generic} ONF on p=2 processors, we use the index of the primary axis, i, $\ni\;\;0 \leq i < p$, since $\rho X = <2\;4\;5>$.
Since we are accessing contiguous planes we can collapse j and k into a single index l, $\ni \;\; 0 \leq l <20$.
Using $\gamma_{row}$ for row major ordering, Y is transformed into:
\small
\begin{eqnarray}
a[ (( \gamma(<2-i >; <1 \take (\rho A>))) \times \pi ( 1 \drop (\rho A)) ) + \iota \pi 1 \drop \rho A ] \nonumber \\
\times a[ (( \gamma(<1-i >; <1 \take (\rho A>))) \times \pi ( 1 \drop (\rho A)) ) + \iota \pi 1 \drop \rho A ]
\end{eqnarray}
Reducing, we get
\begin{eqnarray}
a[ (( \gamma(<2-i >; <3>))) \times \pi ( <5\;4>) + \iota \pi <5\;4> ] \nonumber \\
\times a[ (( \gamma(<1-i >; <3>))) \times \pi ( <5\;4>) ) + \iota \pi <5\;4> ]\nonumber \\
a[( (2-i) \times 20) + \iota 20 ] \nonumber \\
\times a[( (i-1) \times 20 ) + \iota 20 ]
\end{eqnarray}
\normalsize
Which breaks the indexing into start, stride, and stop, a universal hardware description and easiliy transformed
to the mnemonics used at any hardware leveDBLP:journals/corr/abs-0811-2535l ~\cite{DBLP:journals/corr/abs-0811-2535}.
We could add more dimensions to analyze prefetching, buffer sizes, vector registers, cache, etc.