@@ -44,6 +44,7 @@ Julian Oppermann, Codeplay +
44
44
Ewan Crawford, Codeplay +
45
45
Ben Tracy, Codeplay +
46
46
John Pennycook, Intel +
47
+ Greg Lueck, Intel +
47
48
48
49
== Dependencies
49
50
@@ -192,8 +193,9 @@ By adding the `insert_barriers` property, a _work-group barrier_ will be
192
193
inserted between the kernels. To achieve a device-wide synchronization, i.e.,
193
194
a synchronization between different work-groups that is implicit between two
194
195
kernels when executed separately, users should leverage the subgraph feature of
195
- the SYCL graph proposal. By creating two subgraphs, fusing each and adding both
196
- to the same graph, a device-wide synchronization between two fused parts can be
196
+ the SYCL graph proposal, as device-wide synchronization inside the fused kernel
197
+ is not achievable. By creating two subgraphs, fusing each and adding both to
198
+ the same graph, a device-wide synchronization between two fused parts can be
197
199
achieved if necessary.
198
200
====
199
201
@@ -233,9 +235,11 @@ the property is specified on an accessor) or in any kernel in the graph (in case
233
235
the property is specified on a buffer or an USM pointer).
234
236
235
237
More concretely, the two shortcuts express the following semantics:
238
+
236
239
* `access_scope_work_group`: Applying this specialization asserts that each
237
240
element in the buffer or allocated device memory is accessed by no more than one
238
241
work-group.
242
+
239
243
* `access_scope_work_item`: Applying this specialization asserts that each
240
244
element in the buffer or allocated device memory is accessed by no more than one
241
245
work-item.
@@ -320,7 +324,7 @@ in the sequence before the command itself.
320
324
321
325
The exact linearization of the dependency DAG (which generally only implies a
322
326
partial order) is implementation defined. The linearization should be
323
- deterministic, i.e. it should yield the same sequence when presented with the
327
+ deterministic, i.e., it should yield the same sequence when presented with the
324
328
same DAG.
325
329
326
330
=== Synchronization in kernels
@@ -336,13 +340,13 @@ barrier can added between each of the kernels being fused by applying the
336
340
As the fusion compiler can reason about the access behavior of the different
337
341
kernels only in a very limited fashion, **it's the user's responsibility to
338
342
make sure no data races occur in the fused kernel**. Data races could in
339
- particular be introduced because the implicit inter-work-group synchronization
343
+ particular be introduced because the implicit device-wide synchronization
340
344
between the execution of two separate kernels is eliminated by fusion. The user
341
345
must ensure that the kernels combined during fusion do not rely on this
342
346
synchronization or introduce appropriate synchronization.
343
347
344
348
Device-wide synchronization can be achieved by splitting the graph into multiple
345
- subgraphs and fusing each separately, as decribed above.
349
+ subgraphs and fusing each separately, as described above.
346
350
347
351
=== Limitations
348
352
@@ -421,7 +425,7 @@ To achieve this result during fusion, a fusion compiler must establish some
421
425
additional context and information.
422
426
423
427
First, the compiler must know that two arguments refer to the same underlying
424
- memory. This is possible during runtime, so no additional user input is
428
+ memory. This can be inferred during runtime, so no additional user input is
425
429
required.
426
430
427
431
For the remaining information that needs to be established, the necessity of
@@ -456,10 +460,12 @@ must be provided by the user by applying the `fusion_internal_memory` property
456
460
to the buffer or allocated device memory as described above.
457
461
458
462
The type of memory that can be used for internalization depends on the memory
459
- access pattern of the fuses kernel. Depending on the access pattern, the buffer
463
+ access pattern of the fused kernel. Depending on the access pattern, the buffer
460
464
or allocated device memory can be classified as:
465
+
461
466
* _Privately internalizable_: If not a single element of the buffer/memory is to
462
467
be accessed by more than one work-item;
468
+
463
469
* _Locally internalizable_: If not a single element of the buffer/memory is to
464
470
be accessed by work items of different work groups.
465
471
@@ -483,10 +489,10 @@ dataflow internalization. Implementations should document the necessary
483
489
properties required to enable internalization in implementation documentation.
484
490
485
491
All internalization-related properties are only _descriptive_, so it is not an
486
- error if an implementation is unable to perform internalization based on the
487
- specified properties. Implementations can provide a diagnostic message in case
488
- the set of specified properties are not sufficient to perform internalization,
489
- but are not required to do so.
492
+ error if an implementation is unable to or for other reasons decides not to
493
+ perform internalization based on the specified properties. Implementations can
494
+ provide a diagnostic message in case the set of specified properties are not
495
+ sufficient to perform internalization, but are not required to do so.
490
496
491
497
[NOTE]
492
498
====
@@ -575,87 +581,86 @@ internalization is performed.
575
581
```c++
576
582
#include <sycl/sycl.hpp>
577
583
578
- using namespace sycl;
584
+ namespace sycl_ext = sycl::ext::oneapi::experimental ;
579
585
580
586
struct AddKernel {
581
- accessor<int, 1> accIn1;
582
- accessor<int, 1> accIn2;
583
- accessor<int, 1> accOut;
587
+ sycl:: accessor<int, 1> accIn1;
588
+ sycl:: accessor<int, 1> accIn2;
589
+ sycl:: accessor<int, 1> accOut;
584
590
585
- void operator()(id<1> i) const { accOut[i] = accIn1[i] + accIn2[i]; }
591
+ void operator()(sycl:: id<1> i) const { accOut[i] = accIn1[i] + accIn2[i]; }
586
592
};
587
593
588
594
int main() {
589
595
constexpr size_t dataSize = 512;
590
596
int in1[dataSize], in2[dataSize], in3[dataSize], out[dataSize];
591
597
592
- queue q{default_selector_v};
598
+ sycl:: queue q{default_selector_v};
593
599
594
600
{
595
- buffer<int> bIn1{in1, range{dataSize}};
601
+ sycl:: buffer<int> bIn1{in1, sycl:: range{dataSize}};
596
602
bIn1.set_write_back(false);
597
- buffer<int> bIn2{in2, range{dataSize}};
603
+ sycl:: buffer<int> bIn2{in2, sycl:: range{dataSize}};
598
604
bIn2.set_write_back(false);
599
- buffer<int> bIn3{in3, range{dataSize}};
605
+ sycl:: buffer<int> bIn3{in3, sycl:: range{dataSize}};
600
606
bIn3.set_write_back(false);
601
607
buffer<int> bTmp1{range{dataSize}};
602
608
// Internalization specified on the buffer
603
- buffer<int> bTmp2{
604
- range{dataSize},
605
- {sycl::ext::oneapi::experimental ::property::access_scope_work_item{},
606
- sycl::ext::oneapi::experimental ::property::fusion_internal_memory{},
607
- no_init}};
609
+ sycl:: buffer<int> bTmp2{
610
+ sycl:: range{dataSize},
611
+ {sycl_ext ::property::access_scope_work_item{},
612
+ sycl_ext ::property::fusion_internal_memory{},
613
+ sycl:: no_init}};
608
614
// Internalization specified on the buffer
609
- buffer<int> bTmp3{
610
- range{dataSize},
611
- {sycl::ext::oneapi::experimental ::property::access_scope_work_item{},
612
- sycl::ext::oneapi::experimental ::property::fusion_internal_memory{},
613
- no_init}};
614
- buffer<int> bOut{out, range{dataSize}};
615
+ sycl:: buffer<int> bTmp3{
616
+ sycl:: range{dataSize},
617
+ {sycl_ext ::property::access_scope_work_item{},
618
+ sycl_ext ::property::fusion_internal_memory{},
619
+ sycl:: no_init}};
620
+ sycl:: buffer<int> bOut{out, sycl:: range{dataSize}};
615
621
bOut.set_write_back(false);
616
622
617
- ext::oneapi::experimental ::command_graph graph{
623
+ sycl_ext ::command_graph graph{
618
624
q.get_context(), q.get_device(),
619
- sycl::ext::oneapi::experimental::property::graph::
620
- assume_buffer_outlives_graph{}};
625
+ sycl_ext::property::graph::assume_buffer_outlives_graph{}};
621
626
622
627
graph.begin_recording(q);
623
628
624
- q.submit([&](handler &cgh) {
629
+ q.submit([&](sycl:: handler &cgh) {
625
630
auto accIn1 = bIn1.get_access(cgh);
626
631
auto accIn2 = bIn2.get_access(cgh);
627
632
// Internalization specified on each accessor.
628
633
auto accTmp1 = bTmp1.get_access(cgh,
629
- sycl::ext::oneapi::experimental ::property::access_scope_work_item{}
630
- sycl::ext::oneapi::experimental ::property::fusion_internal_memory{},
631
- no_init);
634
+ sycl_ext ::property::access_scope_work_item{}
635
+ sycl_ext ::property::fusion_internal_memory{},
636
+ sycl:: no_init);
632
637
cgh.parallel_for<AddKernel>(dataSize, AddKernel{accIn1, accIn2, accTmp1});
633
638
});
634
639
635
- q.submit([&](handler &cgh) {
640
+ q.submit([&](sycl:: handler &cgh) {
636
641
// Internalization specified on each accessor.
637
642
auto accTmp1 = bTmp1.get_access(cgh,
638
- sycl::ext::oneapi::experimental ::property::access_scope_work_item{}
639
- sycl::ext::oneapi::experimental ::property::fusion_internal_memory{},
640
- no_init);
643
+ sycl_ext ::property::access_scope_work_item{}
644
+ sycl_ext ::property::fusion_internal_memory{},
645
+ sycl:: no_init);
641
646
auto accIn3 = bIn3.get_access(cgh);
642
647
auto accTmp2 = bTmp2.get_access(cgh);
643
648
cgh.parallel_for<class KernelOne>(
644
- dataSize, [=](id<1> i) { accTmp2[i] = accTmp1[i] * accIn3[i]; });
649
+ dataSize, [=](sycl:: id<1> i) { accTmp2[i] = accTmp1[i] * accIn3[i]; });
645
650
});
646
651
647
- q.submit([&](handler &cgh) {
652
+ q.submit([&](sycl:: handler &cgh) {
648
653
// Internalization specified on each accessor.
649
654
auto accTmp1 = bTmp1.get_access(cgh,
650
- sycl::ext::oneapi::experimental ::property::access_scope_work_item{}
651
- sycl::ext::oneapi::experimental ::property::fusion_internal_memory{},
652
- no_init);
655
+ sycl_ext ::property::access_scope_work_item{}
656
+ sycl_ext ::property::fusion_internal_memory{},
657
+ sycl:: no_init);
653
658
auto accTmp3 = bTmp3.get_access(cgh);
654
659
cgh.parallel_for<class KernelTwo>(
655
- dataSize, [=](id<1> i) { accTmp3[i] = accTmp1[i] * 5; });
660
+ dataSize, [=](sycl:: id<1> i) { accTmp3[i] = accTmp1[i] * 5; });
656
661
});
657
662
658
- q.submit([&](handler &cgh) {
663
+ q.submit([&](sycl:: handler &cgh) {
659
664
auto accTmp2 = bTmp2.get_access(cgh);
660
665
auto accTmp3 = bTmp3.get_access(cgh);
661
666
auto accOut = bOut.get_access(cgh);
@@ -667,8 +672,7 @@ int main() {
667
672
668
673
// Trigger fusion during finalization.
669
674
auto exec_graph =
670
- graph.finalize({sycl::ext::oneapi::experimental::property::
671
- graph::require_fusion{}});
675
+ graph.finalize({sycl_ext::property::graph::require_fusion{}});
672
676
673
677
q.ext_oneapi_graph(exec_graph);
674
678
@@ -683,8 +687,6 @@ int main() {
683
687
```c++
684
688
#include <sycl/sycl.hpp>
685
689
686
- using namespace sycl;
687
-
688
690
namespace sycl_ext = sycl::ext::oneapi::experimental;
689
691
690
692
int main() {
@@ -693,56 +695,56 @@ int main() {
693
695
694
696
int in1[dataSize], in2[dataSize], in3[dataSize], out[dataSize];
695
697
696
- queue q{default_selector_v};
698
+ sycl:: queue q{default_selector_v};
697
699
698
700
sycl_ext::command_graph graph{q.get_context(), q.get_device()};
699
701
700
702
int *dIn1, dIn2, dIn3, dTmp, dOut;
701
703
702
- dIn1 = malloc_device<int>(q, dataSize);
703
- dIn2 = malloc_device<int>(q, dataSize);
704
- dIn3 = malloc_device<int>(q, dataSize);
705
- dOut = malloc_device<int>(q, dataSize);
704
+ dIn1 = sycl:: malloc_device<int>(q, dataSize);
705
+ dIn2 = sycl:: malloc_device<int>(q, dataSize);
706
+ dIn3 = sycl:: malloc_device<int>(q, dataSize);
707
+ dOut = sycl:: malloc_device<int>(q, dataSize);
706
708
707
- // Specify internalization for an USM pointer
708
- dTmp = malloc_device<int>(q, dataSize)
709
+ // Specify internalization to local memory for an USM pointer
710
+ dTmp = sycl:: malloc_device<int>(q, dataSize)
709
711
auto annotatedTmp = sycl_ext::annotated_ptr(
710
- dTmp, sycl_ext::property::access_scope_work_item {},
712
+ dTmp, sycl_ext::property::access_scope_work_group {},
711
713
sycl_ext::property::fusion_internal_memory{}, no_init);
712
714
713
715
// This explicit memory operation is compatible with fusion, as it can be
714
716
// linearized before any device kernel in the graph.
715
717
auto copy_in1 =
716
- graph.add([&](handler &cgh) { cgh.memcpy(dIn1, in1, numBytes); });
718
+ graph.add([&](sycl:: handler &cgh) { cgh.memcpy(dIn1, in1, numBytes); });
717
719
718
720
// This explicit memory operation is compatible with fusion, as it can be
719
721
// linearized before any device kernel in the graph.
720
722
auto copy_in2 =
721
- graph.add([&](handler &cgh) { cgh.memcpy(dIn2, in2, numBytes); });
723
+ graph.add([&](sycl:: handler &cgh) { cgh.memcpy(dIn2, in2, numBytes); });
722
724
723
725
auto kernel1 = graph.add(
724
- [&](handler &cgh) {
726
+ [&](sycl:: handler &cgh) {
725
727
cgh.parallel_for<class KernelOne>(
726
- dataSize, [=](id<1> i) { annotatedTmp[i] = in1[i] + in2[i]; });
728
+ dataSize, [=](sycl:: id<1> i) { annotatedTmp[i] = in1[i] + in2[i]; });
727
729
},
728
730
{sycl_ext::property::node::depends_on(copy_in1, copy_in2)});
729
731
730
732
// This explicit memory operation is compatible with fusion, as it can be
731
733
// linearized before any device kernel in the graph.
732
734
auto copy_in3 =
733
- graph.add([&](handler &cgh) { cgh.memcpy(dIn3, in3, numBytes); });
735
+ graph.add([&](sycl:: handler &cgh) { cgh.memcpy(dIn3, in3, numBytes); });
734
736
735
737
auto kernel2 = graph.add(
736
- [&](handler &cgh) {
738
+ [&](sycl:: handler &cgh) {
737
739
cgh.parallel_for<class KernelTwo>(
738
- dataSize, [=](id<1> i) { out[i] = annotatedTmp[i] * in3[i]; });
740
+ dataSize, [=](sycl:: id<1> i) { out[i] = annotatedTmp[i] * in3[i]; });
739
741
},
740
742
{sycl_ext::property::node::depends_on(copy_in3, kernel1)});
741
743
742
744
// This explicit memory operation is compatible with fusion, as it can be
743
745
// linearized after any device kernel in the graph.
744
746
auto copy_out =
745
- graph.add([&](handler &cgh) { cgh.memcpy(out, dOut, numBytes); },
747
+ graph.add([&](sycl:: handler &cgh) { cgh.memcpy(out, dOut, numBytes); },
746
748
{sycl_ext::property::node::depends_on(kernel2)});
747
749
748
750
// Trigger fusion during finalization.
@@ -751,11 +753,11 @@ int main() {
751
753
// use queue shortcut for graph submission
752
754
q.ext_oneapi_graph(exec).wait();
753
755
754
- free(dIn1, q);
755
- free(dIn2, q);
756
- free(dIn3, q);
757
- free(dOut, q);
758
- free(dTmp, q);
756
+ sycl:: free(dIn1, q);
757
+ sycl:: free(dIn2, q);
758
+ sycl:: free(dIn3, q);
759
+ sycl:: free(dOut, q);
760
+ sycl:: free(dTmp, q);
759
761
760
762
return 0;
761
763
}
0 commit comments