Skip to content

Conversation

@anyzelman
Copy link
Member

@anyzelman anyzelman commented Feb 9, 2024

Closes #249 .

Closes #303 .

@anyzelman anyzelman added the bug Something isn't working label Feb 9, 2024
@anyzelman anyzelman added this to the v0.8 milestone Feb 9, 2024
@anyzelman anyzelman self-assigned this Feb 9, 2024
@anyzelman anyzelman linked an issue Feb 9, 2024 that may be closed by this pull request
@anyzelman anyzelman marked this pull request as draft February 12, 2024 11:29
@anyzelman anyzelman marked this pull request as ready for review February 12, 2024 15:33
@anyzelman anyzelman added the enhancement New feature or request label Feb 12, 2024
@anyzelman anyzelman force-pushed the 249-alp-passes-0-as-max_calls-to-lpf-collectives-initialisation branch from 53c2765 to 01c4158 Compare February 12, 2024 15:33
@anyzelman
Copy link
Member Author

Crossref #303

@anyzelman
Copy link
Member Author

Concept release notes while waiting for tests and reviews:

This MR fixes a bug where lpf_coll_t were not initialised properly. This bug never resulted in a noticeable issue, since the present implementation of the LPF collectives presently ignores the erroneous argument that ALP was passing in.

While fixing the bug, this MR furthermore detaches the implementation of the public grb::collectives<> API for ALP's distributed-memory backends from the implementation of the internal collectives. It furthermore introduces a global collectives instance that is directly managed by the distributed-memory ALP backend, and in so doing, prevents having to create an lpf_coll_t instance every time a collective is called.

As always, the MR includes both related and unrelated code style fixes.

…ublic collectives and internal collectives. Fixed, together with its code style
@anyzelman anyzelman marked this pull request as draft February 12, 2024 16:43
@anyzelman
Copy link
Member Author

Some issues with smoke tests show that the previous pre/post amble programming pattern for collective communication is wholly incompatible with driving the LPF collectives as its specs require-- and sadly to the point #303 should be resolved now. Getting on with that now...

(At this point though: all unit tests with LPF are OK)

…tionalities: memcpy and folding a raw matrix column-wise into a raw vector.
…ble. Rationale: that coding pattern could not correctly capture the number of calls, number of bytes that may be reduced, nor the number of bytes that may be subject to other collectives. Removed implementation of reduce/allreduce of vector data -- the vector data should first be reduced locally, and then a reduction using a scalar (all)reduce should follow. TODO: remove the disabled code related to the latter removal in bsp1d/vector.hpp, tests/smoke/hook/collectives_blas1.cpp, and tests/smoke/hook/collectives_blas1_raw.cpp
@anyzelman
Copy link
Member Author

Handled #303 as well with the latest commits. Running the full test suite with LPF (including perftests) to make sure there are no snafus that snuck in for the BSP1D and hybrid backends.

@anyzelman
Copy link
Member Author

There are snafus:

  • the dense spmv unit test fails for P=2, west0497, for Ax and xA^T, which seems to indicate an issue with the allgather
  • the most annoying one, however, are deadlocks that seem to randomly appear. gdb traces follow:

process A:

(gdb) bt
#0  0x00007f52b45a72c0 in MPID_nem_tcp_connpoll () from /usr/lib64/mpich/lib/libmpi.so.12
#1  0x00007f52b4596212 in MPIDI_CH3I_Progress () from /usr/lib64/mpich/lib/libmpi.so.12
#2  0x00007f52b44859ef in PMPI_Sendrecv () from /usr/lib64/mpich/lib/libmpi.so.12
#3  0x00007f52b47783c6 in sparse_all_to_all () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#4  0x00007f52b475de74 in lpf::mpi::SparseAllToAll::exchange(lpf::mpi::Comm const&, bool, int*, int) ()
   from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#5  0x00007f52b475915b in lpf::MessageQueue::sync(bool) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#6  0x00007f52b4769996 in lpf::Interface::abort() () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#7  0x00007f52b474ee4e in lpf::Process::hook(lpf::mpi::Comm const&, lpf::Process&, void (*)(void*, unsigned int, unsigned int, lpf_args), lpf_args) [clone .cold] () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#8  0x00007f52b4767377 in lpf::Process::exec(unsigned int, void (*)(void*, unsigned int, unsigned int, lpf_args), lpf_args) ()
   from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#9  0x000000000040cd61 in grb::internal::BaseLpfLauncher<(grb::EXEC_MODE)0>::run_lpf<unsigned long, grb::RC, grb::internal::ExecDispatcher<unsigned long, grb::RC, (grb::EXEC_MODE)0, true, false> > (data_out=0x7ffe85d86c44, in_size=8, data_in=0x7ffe85d86c48, alp_program=<optimized out>, 
    this=<synthetic pointer>) at /home/yzelman/Documents/graphblas/include/graphblas/bsp1d/exec.hpp:620
#10 grb::internal::BaseLpfLauncher<(grb::EXEC_MODE)0>::pack_data_and_run<unsigned long, grb::RC, false> (broadcast=true, data_out=0x7ffe85d86c44, 
    in_size=8, data_in=0x7ffe85d86c48, alp_program=0x40ecc0 <grb_program(unsigned long const&, grb::RC&)>, this=<synthetic pointer>)
    at /home/yzelman/Documents/graphblas/include/graphblas/bsp1d/exec.hpp:702
#11 grb::internal::BaseLpfLauncher<(grb::EXEC_MODE)0>::exec<unsigned long, grb::RC> (broadcast=true, data_out=@0x7ffe85d86c44: grb::SUCCESS, 
    data_in=@0x7ffe85d86c48: 100, alp_program=0x40ecc0 <grb_program(unsigned long const&, grb::RC&)>, this=<synthetic pointer>)
    at /home/yzelman/Documents/graphblas/include/graphblas/bsp1d/exec.hpp:762
#12 main (argc=<optimized out>, argv=0x7ffe85d86ee8) at /home/yzelman/Documents/graphblas/tests/unit/eWiseMatrix.cpp:210

Process B:

#0  0x00007fe2b5662219 in MPIDI_CH3I_Progress () from /usr/lib64/mpich/lib/libmpi.so.12
#1  0x00007fe2b555a133 in MPIR_Wait_impl () from /usr/lib64/mpich/lib/libmpi.so.12
#2  0x00007fe2b55d18da in MPIC_Wait () from /usr/lib64/mpich/lib/libmpi.so.12
#3  0x00007fe2b55d1e43 in MPIC_Recv () from /usr/lib64/mpich/lib/libmpi.so.12
#4  0x00007fe2b558f1e3 in MPIR_Bcast_intra_binomial () from /usr/lib64/mpich/lib/libmpi.so.12
#5  0x00007fe2b54ea4df in MPIR_Bcast_intra_auto () from /usr/lib64/mpich/lib/libmpi.so.12
#6  0x00007fe2b54ea685 in MPIR_Bcast_impl () from /usr/lib64/mpich/lib/libmpi.so.12
#7  0x00007fe2b559079b in MPIR_Bcast_intra_smp () from /usr/lib64/mpich/lib/libmpi.so.12
#8  0x00007fe2b54ea4ff in MPIR_Bcast_intra_auto () from /usr/lib64/mpich/lib/libmpi.so.12
#9  0x00007fe2b54ea685 in MPIR_Bcast_impl () from /usr/lib64/mpich/lib/libmpi.so.12
#10 0x00007fe2b54eaf00 in PMPI_Bcast () from /usr/lib64/mpich/lib/libmpi.so.12
#11 0x00007fe2b582d672 in lpf::mpi::Comm::broadcast(void*, unsigned long, int) const ()
   from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#12 0x00007fe2b5833a29 in lpf::Process::slave() () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#13 0x00007fe2b5833aa8 in lpf::Process::Process(lpf::mpi::Comm const&) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#14 0x00007fe2b58364f0 in lpf::Interface::initRoot(int*, char***) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#15 0x00007fe2b582205e in lpf::mpi_initializer(int, char**) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#16 0x00007fe2b586ef2a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#17 0x00007fe2b586f031 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#18 0x00007fe2b586014a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#19 0x0000000000000001 in ?? ()
#20 0x00007ffff52d37ae in ?? ()
#21 0x0000000000000000 in ?? ()

Process B is still in daemon mode that process A is supposed to resolve via its call to lpf_exec. The last traceable code block (ALP side) is bsp1d/exec.hpp, around line 620:

                                RC run_lpf(
                                        const lpf_func_t alp_program,
                                        const void * const data_in,
                                        const size_t in_size,
                                        U * const data_out
                                ) const {
                                        // construct LPF I/O args
                                        lpf_args_t args = {
                                                data_in, in_size,
                                                data_out, sizeof( U ),
                                                &alp_program, 1
                                        };

                                        // get LPF function pointer
                                        lpf_spmd_t fun = reinterpret_cast< lpf_spmd_t >(
                                                internal::alp_exec_dispatch< T, U, DispatcherType > );

                                        // execute
                                        const lpf_err_t spmdrc = init == LPF_INIT_NONE
                                                ? lpf_exec( LPF_ROOT, LPF_MAX_P, fun, args )
                                                : lpf_hook( init, fun, args );

while the trace confirms lpf_exec is indeed called.

…casting between input and i/o domains on the one hand, and operator domains on the other hand. Also, the specification of grb::collectives<>::{allreduce,reduce} was broken, in that sometimes a monoid is needed. TODO: bring base specification up to date with this new implementation in BSP1D, after testing it has completed.
@anyzelman
Copy link
Member Author

After merging in latest develop, found not all LPF tests complete successfully. This was to a bug regarding casting behaviour in the new allreduce implementation. This has been fixed by creating a generic allreduce/reduce implementation with corrected casting. However, fixing it found that sometimes a monoid identity is required for proper behaviour, which leads to a specification change. Before making that final in the base spec, now first making sure the tests are OK again. Another TODO remains to apply Alberto's suggested fixes.

@anyzelman
Copy link
Member Author

Quickly before flying out: the last commit is bugged again-- pinnedVector for BSP1D P=2 deadlocks while several other unit tests fail. The easiest to debug are likely set and capacity.

@anyzelman
Copy link
Member Author

anyzelman commented Apr 4, 2024

Updated concept release notes:

This MR fixes a bug where lpf_coll_t was not initialised properly. This bug never resulted in a noticeable issue, since the present implementation of the LPF collectives ignores the erroneous argument that ALP was passing in. The nevertheless not-up-to-spec initialisation of LPF collectives happened repeatedly throughout the code base. This MR totally overhauls the implementation of the collectives to address the issue.

Within LPF, there are two classes of collectives: the one exposed via the public API grb::collectives<>, and raw internal collectives that are used in the implementation of e.g. the BSP1D and hybrid backends. Prior to this MR, the implementations of these collectives were disjoint, while with this MR, both types of collectives are managed by a shared context (internally encapsulated by the BSP1D_Data object). This central management means every process maintains but a single lpf_coll_t instance that is reused every time a public or internal collective is called.

The revision of the collectives also now includes a systematic choice where collectives on scalars are orchestrated through the global buffer so as to prevent having to register the scalar memory addresses every time a collective is called. This is likely to cut latency costs significantly compared to the status prior to this MR. For collectives on arrays, the source and/or destination arrays will be registered which incurs a latency costs-- but a cost that, assuming large enough array sizes, is much lower than that of having to copy the payload to and from the global buffer.

Features this MR introduces in realising the refactored collectives:

  • a notion of a final backend, which is the final backend a composed backend relies on. The FinalBackend class collects a set of minimal and low-level functions that all final backends should provide, such as memcpy; and
  • a utility that translates LPF error codes into corresponding ALP error codes.

Other issues this MR addresses:

  • BSP1D did not seem to handle some border cases of grb::setElement correctly -- these would trigger only in out-of-memory conditions, or in erroneous use of the primitive, so impact of the bug is expected to be minor;
  • the grb::setElement and grb::select base specifications did not list all error conditions or did not do so correctly, herewith fixed;
  • a metabug in grb::select had in _DEBUG mode an attempt to print an iterator to stdout, instead of first dereferencing it-- herewith fixed.

As always, the MR includes both related and unrelated code style fixes.

@anyzelman
Copy link
Member Author

Hi @alberto-scolari @KADichev this one is ready, pending CIs and a manual test run.

@alberto-scolari your comments have been handled in line with the (resolved) discussions above.

If one of you could flag it as approved by tomorrow morning we can merge it (if tests indeed all OK:)

@anyzelman
Copy link
Member Author

All tests and CIs OK at this point

Copy link
Contributor

@alberto-scolari alberto-scolari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, previous issues were indeed addressed

@anyzelman anyzelman merged commit 40f33ca into develop Apr 5, 2024
@anyzelman
Copy link
Member Author

Hi @KADichev -- this MR has been merged and is closed, but I'll retain the branch so you can still refer to the above comments in case useful for the LPF collectives

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make internal collectives rely on global lpf_coll_t ALP passes 0 as max_calls to LPF collectives initialisation...

4 participants