Skip to content

MPI bindings for distributed parallel calculations with GAP (BROKEN, NEEDS NEW MAINTAINER)

License

Notifications You must be signed in to change notification settings

gap-packages/pargap

Repository files navigation

                           The ParGAP package

The ParGAP (Parallel GAP) package provides  a  way  of  writing  parallel
programs using the  GAP  language.  Former  names  of  the  package  were
ParGAP/MPI and GAP/MPI; the word MPI refers to Message Passing Interface,
a well-known standard  for  parallelism.  ParGAP  is  based  on  the  MPI
standard, and this distribution includes a subset implementation of  MPI,
to provide a portable layer with a high level interface to  BSD  sockets.
Since knowledge of MPI is not required for use of this software,  we  now
refer to the package as simply ParGAP. For  more  information  visit  the
author's ParGAP home page at:

  http://www.ccs.neu.edu/home/gene/pargap.html

ParGAP may be obtained as `pargap-XXX.zoo' (for some version number  XXX)
from the same places as GAP.

`pargap' is available for download via the GAP www page at

  http://www-gap.mcs.st-and.ac.uk/Packages/packages.html

or, alternatively, `pargap-XXX.tar.gz' (which is  assured  to  be  the
most recent version) can be obtained from the author's ftp site:

  ftp://ftp.ccs.neu.edu/pub/people/gene/pargap/

ParGAP has been tested on Linux (ELF), Solaris 2.6, OSF 1 (alpha) and OS X.


                           MPI Libraries

The  ParGAP  package uses an MPI (Message Passing Interface)  library  to 
communicate  between  processes.  One  such  library,  called MPINU,   is 
included  with this package, but you can also run ParGAP using a  version 
of MPI  that   is   already  present  on  your  system.  When you install
ParGAP  it will try to find and use a system MPI implementation,  and  if 
not  then  it will use its own MPINU. MPINU only works on  Unix  variants 
such as Linux, Solaris and OS X, but it should be possible to port it  to
Windows under Cygwin.

Using  a system MPI implementation is recommended: they can  have  better 
performance,  support  more  systems  and  are more robust.  Two  popular
implementations that are known to work with ParGAP are

  MPICH2     http://www.mcs.anl.gov/research/projects/mpich2/
  Open MPI   http://www.open-mpi.org/

They  can be downloaded from their website, or may be available via  your 
operating system's standard package management mechanism.

This version of ParGAP has an issue on Macs when when using  both  MPINU2 
and a system MPI implementation, so we recommend using the original MPINU 
library on these systems.


                      Installing the ParGAP package

To  install  the  ParGAP  package,  move  the  file  `pargap-XXX.zoo'  or
`pargap-XXX.tar.gz' into the `pkg' directory in which you plan to install
ParGAP. Usually, this will be the directory `pkg'  in  the  hierarchy  of
your version of GAP 4. Also note that  currently  it is not possible   to 
have the `pkg' directory  separate  from  GAP's  `pkg' directory; we hope 
to  remedy  this  in  future  versions  of ParGAP  (so  that it will also 
possible  to  keep  an   additional  `pkg'  directory   in  your  private 
directories; section "ref:Installing GAP Packages" of the GAP 4 reference 
manual gives details on how to do this, when it's  possible.) (If you are 
not  a system administrator and your system administrator  won't  install 
ParGAP for you on the system and you don't have enough disk space in your 
own directory  to create a whole new GAP, what you can do is  create  the 
illusion of having a complete version of GAP in your own directory  using 
symbolic links (sorry! currently that's all we can offer.)

Now change into the `pkg' directory in which you plan to install  ParGAP.
If you got a `.zoo' file, unpack it with:

  unzoo -x pargap-XXX

If you got a `.tar.gz' file and  your  `tar'  command  supports  the  `z'
option, unpack it with:

  tar zxf pargap-XXX.tar.gz

or otherwise unpack in two steps with:

  gunzip pargap-XXX.tar
  tar xvf pargap-XXX.tar

Whether you got the `.zoo' or `.tar.gz' archive you should now have a new
directory `pargap'. As for a generic GAP package, do:

  cd pargap
  ./configure 
  make

If  you have a system-wide implementation of MPI (such as MPICH2 or  Open
MPI) then GAP will also need to be rebuilt, and configure will stop  with 
a  message about this. If you are content for GAP to be rebuilt  with  no 
special configure options then run the ParGAP configure with 

  ./configure --with-basic-gap-configure

otherwise,  follow  the instructions  given  after  running  the   ParGAP 
./configure.

Your ParGAP should now be ready to use. In the `bin' subdirectory there
will be a script

  pargap.sh

which you should use to start ParGAP. Edit the script if necessary,  copy
it to a standard path and rename it according to how you intend  to  call
ParGAP (e.g. rename it: `pargap'). 


                            Running ParGAP

To run ParGAP when built with a system MPI library, you need to use your
system's  MPI  launcher. With both MPICH and  Open MPI  this  is  called 
mpiexec, and you can type

  mpiexec -n 3 pargap

to  run  three  copies of ParGAP, i.e. one master and two  slaves  (this 
assumes that you have renamed the script as `pargap'). 

If ParGAP was built using MPINU then you should run  ParGAP  by  calling
`pargap'  directly. In this case, it looks for a `procgroup' file  which
defines  the  master and slave processes that will be  used  by  ParGAP. 
A  sample `procgroup' file can be found in the `bin'  subdirectory,  and 
when  ParGAP is started this should be in the current directory, or  the
full  path to the file supplied using the `-p4pg' option.  Thus  if  you 
renamed  your  shell script `pargap', the following are  valid  ways  of 
starting ParGAP:

  pargap

(if current directory contains the file: `procgroup'), or

  pargap -p4pg myprocgroupfile

(where `myprocgroupfile' is the complete path of your  procgroup  file  -
there is no restriction on how you name it).

If you had trouble installing ParGAP, please see the next section of this
file. Otherwise, try it out:

gap> # This assumes your procgroup file includes two slave processes.
gap> PingSlave(1); #a `true' response indicates Slave 1 is alive
true
gap> # Print() on slave appears on standard output 
gap> # i.e. after the master's prompt.
gap> SendMsg( "Print(3+4)" );
gap> 7
gap> # A <return> was input above to get a fresh prompt.
gap> #
gap> # To get special characters (including newline: `\n')
gap> # into a string, escape them with a `\'.
gap> SendMsg( "Print(3+4,\"\\n\")" );
gap> 7

gap> # Again, a <return> was input above after the 7 and new-line
gap> # were printed to get a fresh prompt.
gap> #
gap> # Each SendMsg() is normally balanced by a RecvMsg().
gap> SendMsg( "3+4", 2);
gap> RecvMsg( 2 );
7
gap> # The following is equivalent to the two previous commands.
gap> SendRecvMsg( "3+4", 2);
7
gap> # The two SendMsg() commands that were sent to Slave 1 earlier have
gap> # responses that are waiting in the message queue from that slave.
gap> # Check that there is a message waiting. With some MPI implementations
gap> # the message is not immediately available, but when ProbeMsg() does
gap> # return true then RecvMsg() is guaranteed to succeed. 
gap> ProbeMsgNonBlocking( 1 );
false
gap> ProbeMsgNonBlocking( 1 );
true
gap> # Print() is a `no-value' functions, and so the result of a RecvMsg() 
gap> # in both these cases is "<no_return_val>".
gap> RecvMsg( 1 );
"<no_return_val>"
gap> RecvMsg( 1 );
"<no_return_val>"
gap> # As with Print() the result of Exec() appears on standard
gap> # output, and the result is "<no_return_val>".
gap> SendRecvMsg( "Exec(\"pwd\")" ); # Your pwd will differ :-)
/home/gene
"<no_return_val>"
gap> # Define a variable on a slave
gap> SendRecvMsg( "a:=45; 3+4", 1 );
7
gap> # Note "a" is defined on slave 1, not slave 2.
gap> SendMsg( "a", 2 ); # Slave prints error, output on master
gap>  Variable: 'a' must have a value
gap> # <return> entered to get fresh prompt.
gap> RecvMsg( 2 ); # No value for last SendMsg() command
"<no_return_val>"
gap> RecvMsg( 1 );
45
gap> # Execute analogue of GAP's List() in parallel on slaves.
gap> squares := ParList( [1..100], x->x^2 );
[ 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 
  289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 
  900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 
  1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 
  2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 
  3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 
  5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 
  7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 
  9216, 9409, 9604, 9801, 10000 ]
gap> # Send a large, local (non-remote) data structure to a slave
gap> Concatenation("x := ", PrintToString([1..10]*2));
"x := [ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20 ]\n\000"
gap> SendMsg( Concatenation("x := ", PrintToString([1..10]*2)) ); 
gap> RecvMsg();
[ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20 ]
gap> # Send a local (non-remote) function to a slave
gap> myfnc := function() return 42; end;;
gap> # Use PrintToString() to define myfnc on all slave processes
gap> BroadcastMsg( PrintToString( "myfnc := ", myfnc ) );
gap> SendRecvMsg( "myfnc()", 1 );
42
gap> # Ensure problem shared data is read into master and slaves.
gap> # Try one of your GAP program files instead.
gap> ParRead( "/home/gene/myprogram.g");


The ParGAP package was designed and written by:

  Gene Cooperman
  College of Computer Science
  Northeastern University, Boston, MA, U.S.A.

If you use ParGAP to solve a problem then please send a  short  email  to
`gene@ccs.neu.edu' about it, and reference the ParGAP package as follows:

\bibitem[Coo99]{Coo99}
      Cooperman, Gene,
      {\sl Parallel GAP/MPI (ParGAP/MPI)}, Version 1,
      College of Computer Science, Northeastern University, 1999,
      \verb|http://www.ccs.neu.edu/home/gene/pargapmpi.html|.

=========================================================================

                             Troubleshooting

                             General problems

0.  If  you  are  using  ParGAP on a Mac with  MPINU2  or  a  system  MPI 
    implementation then {\ParGAP} may consistently crash on  startup.  If 
    this is the case then try using MPINU instead by reconfiguring ParGAP
    with

      ./configure --with-mpi=MPINU

    This is a known issue which will be fixed in a forthcoming version.

1.  Do you have enough swap space to support multiple  GAP  processes?  A
    simple way to check this is with the UNIX command, `top'.  The  Linux
    version of `top' sorts by memory usage if you type `M'.

2.  `make' tries to automatically create:

       pkg/pargap/bin/pargap.sh

    and copy the parameters from `<GAP_ROOT>/bin/gap.sh'. <GAP_ROOT>  was
    specified when  you  executed  `./configure  <GAP_ROOT>'  to  install
    ParGAP. This can be error-prone if your site has an unusual setup. If
    you execute `<GAP_ROOT>/bin/gap.sh', does gap come up? If so, compare
    it   with   `pargap.sh'   and   check   for   correct   settings   in
    `.../pkg/pargap/bin/pargap.sh'?

3.  Were the remote slave processes able to start up? If so,  could  they
    connect back to  the  master?  To  test  connectivity  problems,  try
    manually starting a remote slave by executing a line in  the  script.
    Try a simple `ssh remote_hostname'  to  see  if  the  issue  is  with
    security.

4.  If  the  previous  step  failed  due  to  security  issues,  such  as
    requesting a password, you have several options. `man ssh'  tells you
    the security model at your site.  Then read "Problems  with Passwords
    (Getting Around  Security)" in the ParGAP manual in the `doc' directory.

5.  Is `pargap' listed in `.../pkg/ALLPKG'?
    [It's needed to autostart slaves.]

6.  Inside ParGAP, has MPI been successfully initialized?
    Try:  
    
    gap> MPI_Initialized();

7.  A remote (slave) ParGAP process starts in  your  home  directory  and
    tries to cd to a directory of the same name as your local  directory.
    Check your assumptions about the remote machine. Try:

    gap> SendRecvMsg("Exec(pwd)"); SendRecvMsg("UNIX_Hostname()");
    gap> SendRecvMsg("UNIX_Getpid()");

8.  Every  ParGAP  slave  process displays its  GAP  banner  and  startup 
    messages on the terminal of the master  process.  If  you  have  many 
    slaves  and  do not wish to see these messages, then  pass  the  `-b' 
    and/or  `-q'  switches to {\ParGAP} when it starts,  to  disable  the 
    banner or all messages respectively. See the GAP Reference Manual for 
    further details.


9. Read the documentation for further possible problems.


                      Problems when using MPINU 

1.  Did ParGAP find your `procgroup' file?
    [It looks in the current directory for `procgroup', or for:

          ... -p4pg PATH/procgroup

     on the command line.]

2.  Is the `procgroup' file in your current directory set correctly?
    Test it.  If you are calling it on a remote host, manually type:

       ssh <HOSTNAME> <ParGAP>

    where <HOSTNAME> and <ParGAP> appear exactly as in `procgroup', e.g.
    
       ssh denali.ccs.neu.edu /usr/local/gap4r3/bin/pargap.sh

    In some cases, `exec' is used to save process overhead. Also try:

       ssh <HOSTNAME> exec <ParGAP>

    If you plan to call it on localhost, try just:   <ParGAP>

    Note that if not all the slave processes succeed in connecting
    to the master, then ParGAP writes out a file:

       /tmp/pargapmpi-ssh.$$
       
    where $$ is replaced by the the process id of the ParGAP process.

3.  If the connection dies at random, after some period of time:
    You can experiment with SO_KEEPALIVE and variants.  (man  setsockopt)
    This periodically sends *null messages* so the  remote  machine  does
    not think that the originating  machine  is  dead.  However,  if  the
    remote machine fails to reply, the  local  process  sends  a  SIGPIPE
    signal to notify current processes of a broken  socket,  even  though
    there might have been only a temporary lapse in connectivity.
    `ssh' specifies `KeepAlive yes' by default, but setting `KeepAlive no'
    might get you through some transient lapses in  connectivity  due  to
    high congestion. 
    You may also want to experiment with: `setenv SSH "ssh -n"'

4.  If a host is on multiple networks, it will have multiple IP addresses 
    and usually multiple hostnames. In  this  case,  the  master  process  
    cannot  always  guess  correctly which  IP  address  (which  internet  
    address)  should be passed to the slave process, so  that  the  slave 
    process can call back to the master. In such cases, you may  need  to  
    tell {\ParGAP} which hostname or IP address to use for  the callback. 
    This   is   done   by   setting   the   UNIX   environment  variable, 
    `CALLBACK_HOST', as in the example below.

      # [ in sh/bash/... ]
      CALLBACK_HOST=denali.ccs.neu.edu; export CALLBACK_HOST
      # [ in csh/tcsh/... ]
      setenv CALLBACK_HOST=denali.ccs.neu.edu

    The appropriate line for your shell  can  be  placed  in  your  shell
    initialization file. Alternatively, you can set this up for all users  
    by placing the Bourne shell version (for `sh') somewhere between  the  
    first and last line of `.../pkg/pargap/bin/pargap.sh'.

5.  ParGAP is supplied with two different versions of MPINU: the original
    MPINU  and a later version, MPINU2, and it will also work with  other 
    MPI libraries if they are present on your system. By default, if  you 
    do  not have a system MPI implementation then MPINU2 is used. If  you 
    have  problems which appear to be MPI-related, try rebuilding  ParGAP
    with  a different MPI library. For example, to use MPINU  instead  of 
    MPINU2 then run configure using

      ./configure --with-mpi=MPINU


                 Problems when using a system MPI library


1.  Line  editing  at the GAP command prompt is  unlikely  to  work  when 
    ParGAP is invoked with an MPI launcher, since they tend to  do  their 
    own processing of the terminal I/O (stdin/stdout/stderr)  which  does 
    not work well either the readline library used in newer  versions  of 
    GAP or the in-built terminal editing in earlier versions of GAP.   It 
    may  be  useful  to run ParGAP  through  the  `rlwrap'  utility,   if 
    available. For example, if ParGAP is run using `mpiexec',   then  try

      rlwrap mpiexec -n 3 pargap

    This should restore some of the line editing, although tab completion 
    is  limited to commands that `rlwrap' has already seen you use.   For 
    more information, try `man rlwrap'.

2.  The command `FlushAllMsgs()' is not available when using a system MPI 
    implementation,   since it tests show  that  `ProbeMsgNonBlocking()', 
    which it uses cannot be relied upon to always return `true' the first 
    time that it is called after a message has been sent. If your  system 
    MPI   implementation  does  exhibit   this  desired   behaviour   for 
    `ProbeMsgNonBlocking()' then you can install your own local  copy  of 
    `FlushAllMsgs()'  by  copying  the  code  for  this   function   from 
    `lib/slavelist.g',  removing the  `if'  statement  and  renaming  the 
    function.

3.  The command `ParReset()' (see~"ParReset") is not available when using 
    a system MPI implementation. When using a MPINU library,  the  slaves 
    are launched by ParGAP itself and so can be contacted and  restarted, 
    but with a system MPI library the slaves are  launched  by  `mpiexec' 
    (or  whichever MPI launcher you use) and  so  cannot  be  reset  from 
    within ParGAP. There is no known workaround for this.

4.  GAP and, in particular, the \package{IO} Package install handlers for 
    the SIGCHLD signal. Many implementations of MPI  also  install  their 
    own SIGCHLD handler, which may then conflict with {\ParGAP}.  Testing 
    has revealed no issues, but we cannot guarantee that there will be no 
    interaction  between the two.   In particular,  this  may  result  in 
    temporary files not being cleaned up properly.

5.  The GAP memory manager,  GASMAN,  can run into problems extending the 
    GAP  workspace if external libraries use `malloc' to  allocate  their 
    own memory. MPINU avoids the use of `malloc' as much as possible, but
    system  MPI  implementations  may not be as  careful.   This  can  be 
    resolved by starting ParGAP with the `-s' command-line switch,  which 
    asks ParGAP to pre-allocate memory before it starts.   You can safely 
    pre-allocate  more memory than you will actually need since  physical 
    memory will only be mapped when it is actually used,   so for example 
    you could allocate 3Gb:

      mpiexec -n 3 pargap -s 3g

    The `-a' and `-m' switches can also be used to control memory  usage. 
    See the GAP Reference Manuel for further information.

News of any other issues or solutions would be gratefully accepted.


=========================================================================

                               Final Notes

Note that this package modifies  the  GAP  `src'  and  `bin'  files,  and
creates a  new  GAP  kernel.  This  new  GAP  kernel  can  be  shared  by
traditional users of the old, sequential GAP kernel, and by  those  doing
parallel processing.

The GAP kernel will have identical behavior to the old  GAP  kernel  when
invoked through the gap.sh script or the `bin/@GAParch@/gap' binary.  The
new ParGAP variables will appear to the end user _ONLY_ if the GAP binary
was invoked as `pargapmpi': a symbolic link to the actual GAP binary. The
script, `pargap.sh', does this.

So, in a multi-user environment, traditional users can  continue  to  use
`gap.sh'  without  noticing  any  difference.  Only  an   invocation   as
`pargap.sh' will add the new features.

Comments and contributions to a ParGAP user library, or any other type of
assistance, are gratefully accepted.

							Gene Cooperman
							gene@ccs.neu.edu

About

MPI bindings for distributed parallel calculations with GAP (BROKEN, NEEDS NEW MAINTAINER)

Resources

License

Stars

Watchers

Forks

Packages

No packages published