Index of /dist-ex/udp-conduit

Name	Last modified	Size

Parent Directory		-
Makefile.am	2024-05-23 16:26	3.9K
Makefile.in	2024-05-23 16:27	30K
README	2024-05-23 16:26	20K
conduit.mak.in	2024-05-23 16:26	1.2K
gasnet_core.c	2024-05-23 16:26	42K
gasnet_core.h	2024-05-23 16:26	5.2K
gasnet_core_fwd.h	2024-05-23 16:26	8.4K
gasnet_core_help.h	2024-05-23 16:26	499
gasnet_core_internal.h	2024-05-23 16:26	5.8K
GASNet udp-conduit documentation
Dan Bonachea <[email protected]>

User Information:
-----------------

udp-conduit is a portable implementation of GASNet over UDP, the User Datagram
Protocol that is a standard component of the TCP/IP protocol suite.

Where this conduit runs:
-----------------------

udp-conduit is fully portable and should work correctly on any POSIX-like
system with a functional TCP/IP stack and a working C++98 compiler.  It is
expected to provide competitive performance on standard Ethernet hardware.  
It is not intended for production use on networks that expose a
high-performance network API, where an appropriate GASNet native conduit
implementation should be preferred. It does not explicitly exploit any
RDMA-over-Ethernet capabilities.

In rare cases, systems may require tweaks to an IP firewall to allow UDP
communication between compute nodes on the udp-conduit worker subnet. See more
below.

Optional compile-time settings:
------------------------------

* All the compile-time settings from extended-ref (see the extended-ref README)

* Some OS's (notably Cygwin) default to a very low value for FD_SETSIZE in
  sys/select.h, which can limit job scaling because the master spawning process 
  maintains 1 TCP socket per worker (or 3 with ROUTE_OUTPUT=1).
  Cygwin at least allows this value to be grown by the user at AMUDP library 
  compile-time, eg:
     make MANUAL_DEFINES=-DFD_SETSIZE=1024

* Similarly, most OS's have a limit on the number of file descriptors (aka open
  files) per process, which may need to be raised on the master process to
  support large-scale runs with 1..3 TCP connections per worker.
  This is usually controlled from the shell's limit or ulimit command, subject
  to system-imposed hard limits.

Choosing the udp-conduit spawn mechanism (GASNET_SPAWNFN):
---------------------------------------------------------

udp-conduit is very a portable conduit, requiring only a UNIX-like system with
a reasonable C/C++ compiler and a standard sockets-based TCP/IP stack (which
includes UDP support).  However, the one aspect that remains somewhat
site-specific is the means for spawning the GASNet job across the UDP-connected
worker nodes. Consequently, the primary user-visible configuration decision to
be make when installing/using udp-conduit is the spawning mechanism to use for
starting udp-conduit jobs. udp-conduit includes built-in support for several
spawning mechanisms (including a very portable ssh-based spawner), and an
extensibility option which allows you to plug in your own job spawning command,
if desired.

All udp-conduit jobs should be started from the console (or from wrapper
scripts such as tcrun, upcrun) using a command such as:

         $ ./a.out <num_nodes> [program args...]

where the first argument <num_nodes> is the number of GASNet nodes to spawn,
and any subsequent arguments will be passed to the GASNet client as argc/argv
upon return from gasnet_init().

The GASNET_SPAWNFN environment variable is used to tell udp-conduit which
mechanism to use for spawning the worker node processes for the job, and may be
one of the following values (some of which have additional related environment
variables):

* GASNET_SPAWNFN='L'  (localhost spawn)

Uses a standard UNIX fork/exec to spawn all the worker processes on the local
machine, and UDP traffic between the nodes is sent over the localhost loopback
interface (which usually bypasses network hardware, but not the kernel). Useful
for debugging and testing, but probably not of interest for production jobs.

* GASNET_SPAWNFN='S'  (ssh/rsh-based spawn, the default spawner)

Uses any command-line based ssh or rsh client to connect to worker nodes, which should
be running an ssh/rsh daemon (ie sshd).  Requires that users setup password-less
authentication to the worker nodes (eg. using RSA public-key-based
authentication and/or ssh-agent). Specifically, users need the ability to run a
command such as: "ssh machinename echo hi" and have that command execute on the
remote node without the need for typing any passwords. Finally, any network
firewalls present must be configured such that the worker nodes have the
ability to make TCP connections to the machine that executes the initial spawn
command (used for bootstrapping) and such that the worker nodes can send UDP 
packets to each other (used to implement GASNet communication).

See the ssh-agent tutorial here:  https://upc.lbl.gov/docs/user/sshagent.shtml
or the documentation for your ssh client/daemon for more information on setting
up secure, password-less authentication for ssh. rsh-based spawning is also
supported, although not generally recommended due to the inherent security
flaws in rsh-based authentication (although it may still be appropriate on 
physically secure private networks).

the ssh-based spawner also recognizes the following environment variables
(the SSH_SERVERS value provides the names of the worker nodes running sshd
and is required):

 option      default                   description
----------------------------------------------------------------
SSH_CMD      "ssh"                     ssh command to use. Can also be set to "rsh",
                                       or the name of any other remote shell spawner 
				       program/script with a similar interface.
SSH_SERVERS  none - must be provided   space-delimited list of server names 
                                       to pass to SSH_CMD, one per node, in order of
				       priority (trailing extra server names ignored)
	                               may specify DNS names or IP addresses
SSH_OPTIONS  ""                        additional options to give SSH_CMD client
SSH_REMOTE_PATH  current working dir.  the directory to use on remote machine
                                       must contain a copy of the udp-conduit a.out 
                                       executable to be started

(these variables may also optionally be prefixed with "AMUDP_" or "GASNET_")

So for example, one could use the ssh spawner to start a job for an a.out
executable linked against libgasnet-udp-*.a as follows (assuming a csh-like shell):

$ ssh node0 echo ssh is working
ssh is working
$ setenv GASNET_SPAWNFN 'S'
$ setenv GASNET_SSH_SERVERS 'node0 node1 node2 node3 node4 node5'
$ ./a.out 3 arg1 arg2 arg3
Hello from node0
Hello from node1
Hello from node2
$

* GASNET_SPAWNFN='C'  (custom spawner)

The custom spawner allows the user and/or site installer to provide a custom
command to be used for starting the worker processes across the worker nodes.
This provides spawning extensibility - the custom command can invoke any
arbitrary site-specific spawning command (for example to call to an OS-provided
spawner, a batch system, or a custom-written wrapper script that performs
whatever actions are necessary to start the job). The only required environment 
variable setting is CSPAWN_CMD, which provides to command to be invoked for 
performing the spawn, upon which the following substitutions are performed:

  %N => number of worker nodes requested
  %C => the command that should be run once by each worker node participating in the job
  %D => the current working directory
  %% => %
  %M => optional list of servers taken from CSPAWN_SERVERS (the first nproc are passed)
  %P => the program name as invoked (the original argv[0])

The custom command specified by CSPAWN_CMD is invoked exactly once at startup,
and is responsible for starting all the %N remote worker processes and having
them execute the command passed as %C, in a directory containing the a.out
udp-conduit executable. The worker processes then use information passed within 
%C to connect to the master process on the spawning console and bootstrap the job.

Note that any network firewalls present must be configured such that the worker
nodes have the ability to make TCP connections to the machine that executes the
initial spawn command (used for bootstrapping) and such that the worker nodes
can send UDP packets to each other (used to implement GASNet communication). 


The custom spawner recognizes the following environment variables:

 option           default    description
--------------------------------------------------
CSPAWN_CMD          none     the custom command to be called for spawning, 
                             after replacement of the patterns listed above
			     the command must result in %N invocations of the
			     %C command, once on each worker node
                               
CSPAWN_SERVERS      none     space-delimited list of servers to use - 
                             only required if %M is used in CSPAWN_CMD

CSPAWN_ROUTE_OUTPUT  0       set this variable to request built-in stdout/stderr 
                             routing from worker processes to the console, if your 
			     CSPAWN_CMD doesn't automatically provide that capability.
			     Note GASNET_ROUTE_OUTPUT is ignored for this spawner.

(these variables may also optionally be prefixed with "AMUDP_" or "GASNET_")

So for example, one could use the custom spawner in conjunction with an mpirun
command in a mixed MPI+GASNet/udp-conduit executable to start a job for an a.out
executable as follows (assuming a csh-like shell):

$ setenv GASNET_SPAWNFN 'C'
$ setenv GASNET_CSPAWN_CMD 'mpirun -np %N %C'  
$ ./a.out 3 arg1 arg2 arg3
Hello from node0
Hello from node1
Hello from node2

Similarly, one can use the srun command in SLURM:

$ setenv GASNET_SPAWNFN 'C'
$ setenv GASNET_CSPAWN_CMD 'srun -n %N %C'  
$ setenv GASNET_WORKER_RANK "SLURM_PROCID"    // optional, see docs below

Recognized environment variables:
---------------------------------

* All the standard GASNet environment variables (see top-level README)

* GASNET_NETWORKDEPTH - depth of network buffers to allocate. 
  Specifically, the number of AMRequests that can be outstanding to a
  given peer before further requests to that peer will stall to await reply/ack.
  Use larger values to increase bandwidth on longer-latency networks. 
  Note that internal buffer space grows linearly with depth, and specifying
  a value which is too large can lead to increased packet loss and 
  retransmission costs on OS's with small UDP buffer limits or 
  hardware that drops packets in response to congestion (includes most Ethernet
  hardware).

* GASNET_SPAWNFN - job spawn mechanism to use (see above)

* GASNET_RECVDEPTH - max depth in messages of the shared AM receive queue (shared
  by all incoming Requests and Replies from any peer). This limits the number
  of messages that can be received during a single Poll.  Defaults to a
  reasonable value based on job size and GASNET_NETWORKDEPTH.

* GASNET_SENDDEPTH - max number of AMRequests that each node can have outstanding,
  combined across ALL peers (as opposed to NETWORKDEPTH which is a per-peer limit). 
  This can be used to limit AMRequest send buffer utilization in jobs where nodes
  communicate with a large number of distinct peers.  
  Set to -1 to specify unlimited, which amounts to NETWORKDEPTH*jobsize.
  Defaults to a reasonable value based on job size and GASNET_NETWORKDEPTH.

* GASNET_SOCKETBUFFER_INITIAL, GASNET_SOCKETBUFFER_MAX
  Set the initial and maximum socket buffer sizes (i.e. SO_SNDBUF/SO_RCVBUF) in
  bytes to request for worker sockets. These are the *kernel-side* buffers used
  to hold in-flight datagrams. INITIAL defaults to the lesser of MAX and
  RECVDEPTH * <max-packet-size>.  Note that actual/achieved socket buffer sizes
  are also subject to kernel-imposed defaults and limits (e.g. on Linux
  determined by `/proc/sys/net/core/{r,w}mem_{default,max}`).

* GASNET_REQUESTTIMEOUT_MAX - max timeout in microseconds before a request
  is considered undeliverable (a fatal error). Defaults to 30000000 (30 seconds).
  Can be set to 0 for infinite timeout, which can be useful when freezing 
  processes for long periods of time during debugging sessions.

* GASNET_REQUESTTIMEOUT_INITIAL, GASNET_REQUESTTIMEOUT_BACKOFF
  Advanced options for tweaking the retransmission algorithm (initial timeout in us, 
  and timeout backoff multiplier). Larger timeouts allow more tolerance for 
  network congestion and inattentiveness of remote nodes.  Shorter timeouts
  provide faster recovery from packet loss (which is uncommon in most LANs),
  at a potential overhead cost of more useless retransmissions.
  Most users should probably leave these alone.

* GASNET_ROUTE_OUTPUT
  If non-zero, this option request AMUDP perform explicit forwarding of
  stdout/stderr streams from the workers to the console using TCP socket
  connections. If zero, no explicit forwarding is performed and this duty 
  is left up to the underlying system and job spawner.  The default is
  system-dependent.

* GASNET_LINEBUFFERSZ
  The size (in bytes) of the buffer used for line buffering of stdin/stdout.
  Should be large enough for decent bandwidth on multi-line chunks, but 
  scales master-side memory utilization linearly with workers that generate output.
  Default 1024.  Set to 0 to disable line buffering, which allows partial
  lines to be displayed sooner, but may result in concurrent output from
  multiple workers being interleaved on a single line.
  Has no effect if GASNET_ROUTE_OUTPUT is disabled.

* GASNET_MASTERIP
  Specify the exact IP address which the worker nodes should use to connect
  to the master (spawning) node.  By default the master node will pass the
  result of gethostname() to the worker nodes, which will then resolve that
  to an IP address using gethostbynname().

* GASNET_WORKERIP
  Specify the IP subnet to be used for communication among the worker nodes.
  By default, worker nodes will communicate among themselves using the same
  interface used to connect to the master node (see GASNET_MASTERIP, above).
  Example: GASNET_WORKERIP=192.168.0.0

* GASNET_ENV_CMD
  Specify the full path to the "env" command on the worker nodes.  By default
  the value "env" is used, which is sufficient as long as the command can be
  found in the default search path.

* GASNET_HOST_DETECT - host identification
  In addition to the standard options described in the top-level README,
  udp-conduit supports a "conduit" option.  This option uses the 32-bit IP
  address and a hash of the hostname to form a 64-bit identifier.  This is
  the default, even if configured using '--with-host-detect=...' to specify
  a different one.  However, an explicit value of GASNET_HOST_DETECT in the
  environment is always honored.
  This default generally works well for the common case where all worker nodes
  have a single IP interface on the worker subnet (see `GASNET_WORKERIP` above).
  However if the worker nodes include a system with multiple distinct active IP
  addresses on the worker subnet, then it's recommended to set GASNET_HOST_DETECT
  to "hostid" or "hostname" (whichever is guaranteed to return a value that
  uniquely identifies each actual worker). Otherwise, the default of "conduit"
  could mistakenly identify such multi-IP worker nodes as distinct hosts, and
  thus ineligible for shared-memory communication.  

* GASNET_USE_GETHOSTID
  ** THIS VARIABLE IS DEPRECATED **
  One can use GASNET_HOST_DETECT=gethostid to achieve the behavior
  previously requested using GASNET_USE_GETHOSTID=1.

* GASNET_WORKER_RANK
  May be set to force a particular rank assignment for worker processes.
  If set by any process, then it must be set by all worker processes before
  init to a disjoint set of integers in the range [0..numprocs).
  It may alternatively be set to the name of another environment variable in the
  worker environment from which to retrieve the assignment (e.g. "SLURM_PROCID",
  "PMIX_RANK", "OMPI_COMM_WORLD_RANK", etc).
  Default behavior is arbitrary rank assignment that groups co-located processes.


Known problems:
---------------

* Udp-conduit does not support user-provided values for GASNET_TSC_RATE* (which
  fine-tune timer calibration on x86/Linux).  This is partially due to a
  dependency cycle at startup with propagation of environment variables, but
  more importantly because the retransmission algorithm (and hence all conduit
  comms) rely on gasnet timers to be accurate (at least approximately), so we
  don't allow the user to weaken or disable their calibration.

* See the GASNet Bugzilla server for details on other known bugs:
  https://gasnet-bugs.lbl.gov/

Future work:
------------

* Integrate ssh-spawner to remove spawning scalability issues.

* Push thread-safety down into the AMUDP library, to expose more concurrency
  in GASNET_PAR mode.

==============================================================================

Design Overview:
----------------

The core API implementation is a very thin wrapper around the AMUDP
implementation by Dan Bonachea. See documentation in the other/amudp directory
or the AMUDP webpage (https://gasnet.lbl.gov/amudp) for details.

The udp-conduit directly uses the extended-ref implementation for the extended
API - see the extended-ref directory for details.

After some discussions at SC03, I decided it was worthwhile to provide a GASNet
conduit that runs as-natively-as-possible on gigabit ethernet. Given the
hardware characteristics of ethernet (most significantly, store-and-forward
routing), we don't ever expect this to be a very low-latency target, however
Ethernet is ubiquitous and the bandwidths can be respectable.. and we can
certainly hope to do better than
GASNet-over-AMMPI-over-MPICH-over-TCP-over-Ethernet, which was previously our
only choice on Ethernet. 

Implementing udp-conduit was basically just a matter of adapting my
mpi-conduit-over-AMMPI glue to accommodate my AM-over-UDP implementation
(AMUDP), which Titanium tests show can significantly outperform
AMMPI-over-MPICH-over-TCP (primarily because AM is a natural fit for much
smarter reliability protocols than TCP and the connectionless model of UDP is
more scalable).

The GASNet-level design of udp-conduit is very similar to mpi-conduit. It's a
bare core API implementation using the reference extended API implementation.
All handlers and network access are serialized on client threads. There's no
concurrent handler execution or conduit threads. GASNet segment is mmapped but
not specially registered in any way. Both aligned and unaligned segments are
supported. Unlike AMMPI (which is limited to MPI-1.1's functionality), AMUDP
provides a fully functional job management system (which runs on the spawning
console), which means that udp-conduit provides very robust job exit behavior -
it passes all testexit tests and should never leave zombie processes.

Current limitations:
* AMUDP is implemented in C++ (fairly straightforward C++ with no templates,
should work on any C++ compiler with support for simple exception-handling -
works on every one I've tried). The GASNet configure script checks for C++
support if and only if udp-conduit is enabled - platforms lacking a C++
compiler should configure with --disable-udp.

* job spawning is a little messy because we have to deal with site-specific
wrinkles, especially for acquiring node names. However, AMUDP offers a number
of job spawning options (the most portable being ssh-based spawn), and it's
successfully been deployed on a number of systems. There is also spawner
support for Localhost/fork (for debugging), and a "custom
command" option for extensibility.

* AMUDP was overhauled in June 2016 to greatly improve the scalability of udp-conduit's
internal buffer and metadata utilization. The worker processes now scale nicely
beyond 1000 nodes, and theoretically up to 1M nodes.

However, the master process used by the spawner still defaults to opening 3 TCP
connections with each compute process, which at large scale may exceed file
descriptor limitations for the master process on some OS's (see "compile-time
settings" above). This utilization can be reduced by setting
AMUDP_ROUTE_OUTPUT=0, but this might result in stdout/stderr problems,
depending on the spawner.  Users interested in large-scale applications are
highly recommended to use an appropriate native GASNet conduit, if one is
available.