|
257 | 257 | "!mpirun -np 2 ./heat 256 256 16000"
|
258 | 258 | ]
|
259 | 259 | },
|
260 |
| - { |
261 |
| - "cell_type": "markdown", |
262 |
| - "metadata": {}, |
263 |
| - "source": [ |
264 |
| - "When using the NVIDIA C++ compiler, we currently need to workaround lack of proper support for `views::cartesian_product` in the parallel algorithms as follows:\n", |
265 |
| - "\n", |
266 |
| - "```c++\n", |
267 |
| - " auto cp = std::views::cartesian_product(xs, ys);\n", |
268 |
| - " auto is = std::views::iota((int)0, (int)std::size(cp)); // Create 1D range of ints\n", |
269 |
| - " return std::transform_reduce(\n", |
270 |
| - " std::execution::par, is.begin(), is.end(), \n", |
271 |
| - " 0., std::plus{}, [u_new, u_old, p, ids = cp.begin()](auto i) {\n", |
272 |
| - " auto [x, y] = ids[i]; // Use int to advance a cartesian_product Iterator.\n", |
273 |
| - " return stencil(u_new, u_old, x, y, p);\n", |
274 |
| - " });\n", |
275 |
| - "```" |
276 |
| - ] |
277 |
| - }, |
278 | 260 | {
|
279 | 261 | "cell_type": "code",
|
280 | 262 | "execution_count": null,
|
|
284 | 266 | "!rm output || true\n",
|
285 | 267 | "!rm heat || true\n",
|
286 | 268 | "!OMPI_CXX=nvc++ mpicxx -std=c++20 -stdpar=multicore -O4 -fast -march=native -Mllvm-fast -DNDEBUG -o heat solutions/exercise1.cpp\n",
|
287 |
| - "!mpirun -np 2 ./heat 256 256 16000" |
| 269 | + "!mpirun -np 2 ./heat 256 256 16000\n", |
| 270 | + "visualize()" |
288 | 271 | ]
|
289 | 272 | },
|
290 | 273 | {
|
|
305 | 288 | "!rm output || true\n",
|
306 | 289 | "!rm heat || true\n",
|
307 | 290 | "!OMPI_CXX=nvc++ mpicxx -std=c++20 -stdpar=gpu -O4 -fast -march=native -Mllvm-fast -DNDEBUG -o heat solutions/exercise1.cpp\n",
|
308 |
| - "!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 2 ./heat 256 256 16000" |
| 291 | + "!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 2 ./heat 256 256 16000\n", |
| 292 | + "visualize()" |
309 | 293 | ]
|
310 | 294 | },
|
311 | 295 | {
|
|
508 | 492 | "!mpirun -np 2 ./heat 256 256 16000\n",
|
509 | 493 | "visualize()"
|
510 | 494 | ]
|
| 495 | + }, |
| 496 | + { |
| 497 | + "cell_type": "markdown", |
| 498 | + "metadata": {}, |
| 499 | + "source": [ |
| 500 | + "## Exercise 3: Senders & Receivers\n", |
| 501 | + "\n", |
| 502 | + "The goal of this exercise is to simplify the implementation of Exercise 2 - Overlap Communication and Computation - by using Senders & Receivers with a `static_thread_pool` to manage the host threads, while combining this with the C++ parallel algorithms.\n", |
| 503 | + "\n", |
| 504 | + "The implementation of Exercise 2 is quite complex. It requires:\n", |
| 505 | + "\n", |
| 506 | + "```c++\n", |
| 507 | + "// A shared atomic variable to accumulate the energy:\n", |
| 508 | + "std::atomic<double> energy = 0.;\n", |
| 509 | + "\n", |
| 510 | + "// A shared barrier for synchronizing threads:\n", |
| 511 | + "std::barrier bar(3);\n", |
| 512 | + "\n", |
| 513 | + "// User must manually create and start threads:\n", |
| 514 | + "std::thread thread_inner(..[&] {\n", |
| 515 | + " energy += computation(...);\n", |
| 516 | + " bar.arrive_and_wait();\n", |
| 517 | + " // User must manually create a critical section for MPI rank reduction: \n", |
| 518 | + " MPI_Reduce(...);\n", |
| 519 | + " // User must manually reset the shared state on each iteration:\n", |
| 520 | + " energy = 0;\n", |
| 521 | + " bar.arrive_and_wait();\n", |
| 522 | + " });\n", |
| 523 | + "\n", |
| 524 | + "std::thread thread_prev(...);\n", |
| 525 | + "std::thread thread_next(...);\n", |
| 526 | + "\n", |
| 527 | + "// User must manually join all threads before doing File I/O\n", |
| 528 | + "thread_prev.join();\n", |
| 529 | + "thread_next.join();\n", |
| 530 | + "thread_inner.join();\n", |
| 531 | + "\n", |
| 532 | + "// File I/O\n", |
| 533 | + "```\n", |
| 534 | + "\n", |
| 535 | + "In this exercise, we'll use Senders & Receivers instead to create a graph representing the computation:\n", |
| 536 | + "\n", |
| 537 | + "```c++\n", |
| 538 | + "stde::sender iteration_step(stde::scheduler sch, parameters p, long it,\n", |
| 539 | + " std::vector<double>& u_new, std::vector<double>& u_old) {\n", |
| 540 | + " // TODO: use Senders & Receivers to create a graph representing the computation of a single iteration \n", |
| 541 | + "}\n", |
| 542 | + "```\n", |
| 543 | + "\n", |
| 544 | + "and will then dispatch it to an execution context:\n", |
| 545 | + "\n", |
| 546 | + "```c++\n", |
| 547 | + "stde::static_thread_pool ctx{3}; // Thread Pool with 3 threads\n", |
| 548 | + "stde::scheduler auto sch = ctx.get_scheduler();\n", |
| 549 | + "\n", |
| 550 | + "for (long it = 0; it < p.nit(); ++it) {\n", |
| 551 | + " stde::this_thread::sync_wait(iteration_step(sch));\n", |
| 552 | + "}\n", |
| 553 | + "```\n", |
| 554 | + "\n", |
| 555 | + "### Compilation and run commands\n", |
| 556 | + "\n", |
| 557 | + "[exercise3.cpp]: ./exercise3.cpp\n", |
| 558 | + "\n", |
| 559 | + "The template [exercise3.cpp] compiles and runs as provided, but produces incorrect results due to the incomplete `iteration_step` implementation.\n", |
| 560 | + "\n", |
| 561 | + "After completing it the following blocks should compile and run correctly:" |
| 562 | + ] |
| 563 | + }, |
| 564 | + { |
| 565 | + "cell_type": "markdown", |
| 566 | + "metadata": {}, |
| 567 | + "source": [ |
| 568 | + "### Solutions Exercise 3\n", |
| 569 | + "\n", |
| 570 | + "The solutions for each example are available in the [`solutions/exercise3.cpp`] sub-directory.\n", |
| 571 | + "\n", |
| 572 | + "[`solutions/exercise3.cpp`]: ./solutions/exercise3.cpp\n", |
| 573 | + "\n", |
| 574 | + "The following blocks compiles and runs the solutions for Exercise 3 using different compilers and C++ standard versions.\n", |
| 575 | + "By default, the [`static_thread_pool`] scheduler is used.\n", |
| 576 | + "\n", |
| 577 | + "[`static_thread_pool`]: https://github.com/NVIDIA/stdexec/blob/main/include/exec/static_thread_pool.hpp" |
| 578 | + ] |
| 579 | + }, |
| 580 | + { |
| 581 | + "cell_type": "code", |
| 582 | + "execution_count": null, |
| 583 | + "metadata": {}, |
| 584 | + "outputs": [], |
| 585 | + "source": [ |
| 586 | + "!rm output || true\n", |
| 587 | + "!rm heat || true\n", |
| 588 | + "!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise3.cpp -ltbb\n", |
| 589 | + "!mpirun -np 2 ./heat 256 256 16000\n", |
| 590 | + "visualize()" |
| 591 | + ] |
| 592 | + }, |
| 593 | + { |
| 594 | + "cell_type": "code", |
| 595 | + "execution_count": null, |
| 596 | + "metadata": {}, |
| 597 | + "outputs": [], |
| 598 | + "source": [ |
| 599 | + "!rm output || true\n", |
| 600 | + "!rm heat || true\n", |
| 601 | + "!OMPI_CXX=clang++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise3.cpp -ltbb\n", |
| 602 | + "!mpirun -np 2 ./heat 256 256 16000\n", |
| 603 | + "visualize()" |
| 604 | + ] |
| 605 | + }, |
| 606 | + { |
| 607 | + "cell_type": "code", |
| 608 | + "execution_count": null, |
| 609 | + "metadata": {}, |
| 610 | + "outputs": [], |
| 611 | + "source": [ |
| 612 | + "!rm output || true\n", |
| 613 | + "!rm heat || true\n", |
| 614 | + "!OMPI_CXX=nvc++ mpicxx -std=c++20 -stdpar=gpu -O4 -fast -march=native -Mllvm-fast -DNDEBUG -o heat solutions/exercise3.cpp\n", |
| 615 | + "!mpirun -np 2 ./heat 256 256 16000\n", |
| 616 | + "visualize()" |
| 617 | + ] |
511 | 618 | }
|
512 | 619 | ],
|
513 | 620 | "metadata": {
|
|
0 commit comments