Index of /dist-ex/ucx-conduit

[ICO]NameLast modifiedSize

[PARENTDIR]Parent Directory  -
[DIR]contrib/2024-05-23 16:28 -
[   ]Makefile.am2024-05-23 16:26 4.6K
[   ]Makefile.in2024-05-23 16:27 31K
[TXT]README2024-05-23 16:26 16K
[   ]conduit.mak.in2024-05-23 16:26 1.9K
[TXT]gasnet_core.c2024-05-23 16:26 88K
[TXT]gasnet_core.h2024-05-23 16:26 7.5K
[TXT]gasnet_core_fwd.h2024-05-23 16:26 8.1K
[TXT]gasnet_core_help.h2024-05-23 16:26 706
[TXT]gasnet_core_internal.h2024-05-23 16:26 22K
[TXT]gasnet_core_sndrcv.c2024-05-23 16:26 41K
[TXT]gasnet_extended.c2024-05-23 16:26 10K
[TXT]gasnet_extended_fwd.h2024-05-23 16:26 3.5K
[TXT]gasnet_ratomic.c2024-05-23 16:26 32K
[TXT]gasnet_ratomic_fwd.h2024-05-23 16:26 5.6K
[TXT]gasnet_ucx_req.h2024-05-23 16:26 3.5K
[TXT]license.txt2024-05-23 16:26 2.5K

GASNet ucx-conduit documentation
Mellanox Technologies LTD Copyright (c) 2019-2021.
Portions copyright 2021-2024, The Regents of the University of California.

Boris I. Karasev <[email protected]>
Artem Y. Polyakov <[email protected]>
Paul H. Hargrove <[email protected]>

@ TOC: @
@ Section: Overview @
@ Section: Where this conduit runs @
@ Section: Build-time Configuration @
@ Section: Job Spawning @
@ Section: Runtime Configuration @
@ Section: Multi-rail Support @
@ Section: On-Demand Paging (ODP) Support @
@ Section: HCA Configuration @
@ Section: Known Problems @
@ Section: Design Overview @
@ Section: Graceful exits @
@ Section: References @

@ Section: Overview @

  Ucx-conduit implements GASNet over the Unified Communication X (UCX) framework
  (see for general information on UCX).
  This conduit is feature-complete and supports Active Messages and
  hardware-offload of RMA and Atomic operations.  Performance tuning is
  is planned as future work.

  Since this conduit is considered "experimental" pending performance-tuning
  work, it is disabled by default.  It can be enabled explicitly at configure
  time using either `--enable-ucx` or `--enable-ucx=probe`.  Either option will
  enable ucx-conduit if `configure` locates the prerequisites.  The difference
  is that the first option makes failure to locate the prerequisites fatal in

@ Section: Where this conduit runs @

  The conduit is based on Unified Communication X (UCX) communication library (see for general information on UCX). UCX is an open-source project
  developed in collaboration between industry, laboratories, and academia to create 
  an open-source production-grade communication framework for data-centric and 
  high-performance applications.  The UCX library can be downloaded from repositories
  (e.g., Fedora/RedHat yum repositories). The UCX library is also part of Mellanox OFED 
  (a.k.a "MLNX_OFED") and the NVIDIA/Mellanox HPC-X Software Toolkit.

  The conduit is tested and known to work with
    - NVIDIA/Mellanox InfiniBand and RoCE devices starting from ConnectX-5
    - UCX library version 1.6 and above
    - Linux platform
  For NVIDIA/Mellanox adapters, UCX conduit supports all transports: RC, UD and DC. For large-scale
  applications, it is recommended to use DC transport for scalability reasons.

@ Section: Build-time Configuration @
  In order to enable the conduit, the option '--enable-ucx' must be passed to configure.
  By default, ucx-conduit will not be built.
  See `configure --help` for additional ucx-related configure options.
  See the extended-ref README for the GASNet general-purpose configuration parameters.

    By default gex_AM_LUB{Request,Reply}Medium() return 3992, which is a
    4KiB buffer minus overheads such as handler arguments.  This configure
    option sets the default value for the GASNET_UCX_MAX_MEDIUM environment
    variable, which determines the LUB size at runtime.
    The value cannot be less than 512.
    The default value is 3992.

@ Section: Job Spawning @

If using UPC++, Chapel, etc. the language-specific commands should be used
to launch applications.  Otherwise, applications can be launched using
the gasnetrun_ucx utility:
+ usage summary:
gasnetrun_ucx -n <n> [options] [--] prog [program args]
  -n <n>                 number of processes to run (required)
  -N <N>                 number of nodes to run on (not supported by all MPIs)
  -E <VAR1[,VAR2...]>    list of environment vars to propagate
  -v                     be verbose about what is happening
  -t                     test only, don't execute anything (implies -v)
  -k                     keep any temporary files created (implies -v)
  -spawner=(ssh|mpi|pmi) force use of a specific spawner (if available)

There are as many as three possible methods (ssh, mpi and pmi) by which one
can launch an ucx-conduit application.  Ssh-based spawning is always
available, and mpi- and pmi-based spawning are available if the respective
support was located at configure time.  The default is established at
configure time (see section "Build-time Configuration").

The choice of spawner only affects the protocol used for parallel job setup and
teardown; in particular it is NOT used to implement any part of the
steady-state GASNet communication operations. As such, the selected protocol
needs to be stable and co-exist with GASNet communication, but its performance
efficiency is usually not a practical consideration.

To select a non-default spawner one may either use the "-spawner=" command-
line argument or set the environment variable GASNET_UCX_SPAWNER to "ssh",
"mpi" or "pmi".  If both are used, then the command line argument takes

@ Section: Runtime Configuration @

Ucx-conduit supports all of the standard GASNet environment variables
and the optional GASNET_EXITTIMEOUT family of environment variables.
See GASNet's top-level README for documentation.

Additional environment variables:


    This Boolean setting determines if the logic for transfer of AM Long
    payloads may assume UCX is using an in-order transport.  This results in
    faster transfers, but could yield data corruption if the payload and header
    are reordered.
    Setting to 1 forces this optimization, while 0 disables it.
    The default is to enable this setting if the conduit can determine that
    GASNet's PSHM (which guarantees ordering) is used for all intra-node
    communication, as opposed to UCX shared-memory transport which does not
    guarantee ordering (see bug 4155).


    This setting determines the size of buffers used for AM Mediums, and
    thus the return value of the gex_AM_LUB{Request,Reply}Medium() queries.
    The value cannot be less than 512.
    Unless a different value was set using --with-ucx-max-medium=[value]
    at configure time, the default value is 3992.

  In order to control UCX parameters (i.e. device or transport selection), environment
  variables are used. 
  Most commonly used variables are described on the UCX Wiki page
  For the full list of tuning knobs supported by a particular UCX version,
  see the output of `ucx_info -f`.
  Previous versions of this document described a "personalized prefix" feature
  of the UCX library which resulted in a recommendation to use "UCX_GASNET_" or
  "GASNET_UCX_", depending on the UCX version in use.  It has since been
  determined that this UCX feature is not reliable.  Therfore, that prior
  recommendation is withdrawn.  One should use such variable names as "UCX_TLS"
  and *not* variants such as "UCX_GASNET_TLS" or "GASNET_UCX_TLS".

  The "UCX_TLS" environment variable controls transport selection and is
  one of the most commonly used UCX parameters. Available options are:

    $ ucx_info -f | grep UCX_TLS -B 23 | head -n 20
      # Comma-separated list of transports to use. The order is not meaningful.
      #  - all     : use all the available transports.
      #  - sm/shm  : all shared memory transports (mm, cma, knem).
      #  - mm      : shared memory transports - only memory mappers.
      #  - ugni    : ugni_smsg and ugni_rdma (uses ugni_udt for bootstrap).
      #  - ib      : all infiniband transports (rc/rc_mlx5, ud/ud_mlx5, dc_mlx5).
      #  - rc_v    : rc verbs (uses ud for bootstrap).
      #  - rc_x    : rc with accelerated verbs (uses ud_mlx5 for bootstrap).
      #  - rc      : rc_v and rc_x (preferably if available).
      #  - ud_v    : ud verbs.
      #  - ud_x    : ud with accelerated verbs.
      #  - ud      : ud_v and ud_x (preferably if available).
      #  - dc/dc_x : dc with accelerated verbs.
      #  - tcp     : sockets over TCP/IP.
      #  - cuda    : CUDA (NVIDIA GPU) memory support.
      #  - rocm    : ROCm (AMD GPU) memory support.
      #  Using a \ prefix before a transport name treats it as an explicit transport name
      #  and disables aliasing.

  In order to ensure ucx-conduit is being used with supported NVIDIA/Mellanox
  devices, the following transport set is recommended to exclude unsupported
  networks: "UCX_TLS=ib,sm,self".  To leverage UCX support for GPU device
  memory for GASNet-EX Memory Kinds, one may append ",cuda" or ",rocm".
  However, please see "Known Problems" for issues with UCX support for GPU
  device memory.

  Ucx-conduit supports all of the standard and the optional GASNET_EXITTIMEOUT GASNet
  environment variables. See GASNet's top-level README for documentation.

@ Section: Multi-rail Support @

  UCX supports multi-rail configurations, and ucx-conduit has been successfully
  tested on such systems.  Please consult the online UCX documentation at for information on configuring the use of multiple
  network rails.

@ Section: On-Demand Paging (ODP) Support @

  The conduit supports ODP through UCX, see "UCX_IB_REG_METHODS" UCX environment variable.

@ Section: Known Problems @

+ This conduit does not support NVIDIA/Mellanox ConnectX-4 adapters, or older.
  This is unlikely to change.

+ Support for segment-everything mode is currently disabled due to unbounded
  memory use in AMLong (especially as used in the "amref" implementation of RMA).
  Restored support for `--enable-segment-everything` is an area for future work.

+ Bug 4165
  The implementation of AM Mediums is known to have anomalous performance
  for payloads above about 1KiB on multiple systems.
  The most up-to-date information on this bug is maintained at:

+ Bug 4172
  In hybrid MPI applications there are conditions which can trigger a crash at
  exit time due to a call to MPI_Barrier() by ucx-conduit after MPI_Finalize()
  by the application.
  The most up-to-date information on this bug is maintained at:

+ Bug 4301
  As descried in the "Runtime Configuration" section, UCX versions 1.9 and
  newer use prefix of "GASNET_UCX" for private (per client of UCX)
  configuration.  However, this same prefix is used by ucx-conduit for its
  own environment settings.  The result is that, by default, the UCX library
  may issue warnings which include text like the following:
   "UCX  WARN  unused env variable: GASNET_UCX_SPAWNER"
   "(set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)"
  The work-around is to set UCX_WARN_UNUSED_ENV_VARS=n as the message
  The most up-to-date information on this bug is maintained at:

+ Bug 4337
  Some Active Message traffic patterns may induce ucx-conduit to perform
  excessive buffering within the UCX library.  This can lead to unexpectedly
  large memory consumption, which in turn may lead to multiple fatal error
  The most up-to-date information on this bug is maintained at:

+ Bug 4384
  Use of GASNet-EX "CUDA_UVA" memory kinds with ucx-conduit may crash due
  to incorrect default algorithm selection inside UCX for RMA transfers
  involving device memory.
  For UCX versions 1.11.0 and newer, there are work-arounds based on
  setting environment variables to modify the algorithm selection to be
  correct for device memory at the expense of slowing of host memory RMA.
  Currently, the most reliable work-around is to set the following
  The MLX5_SCATTER_TO_CQE=0 may or may not be required, depending on the
  underlying InfiniBand software stack.  Omitting it may provide a small
  performance benefit on systems where it is not required for correctness.
  There is no known work-around for versions of UCX prior to 1.11.0.
  The most up-to-date information on this bug is maintained at:

+ Bug 4393
  The use by ucx-conduit of UCX's support for CUDA devices currently lacks
  the manipulation of CUDA contexts required for correct multi-device
  Use of the recommended "UCX_TLS=ib,sm,self" setting is an effective
  work-around.  Others are described in the bug report.
  The most up-to-date information on this bug is maintained at:

* Bug 4474
  When using hugepages together with NVIDIA/Mellanox InfiniBand and RoCE devices,
  it may be necessary to set `RDMAV_HUGEPAGES_SAFE=1` in the environment.
  Failure to do so may result in a self-explanatory error message:
    UCX  ERROR ibv_reg_mr(address=0x7f49c5a02800, length=2048, access=0xf) failed:
    Invalid argument. Application is using HUGE pages. Please set environment
  The most up-to-date information on this bug is maintained at:

+ See the GASNet Bugzilla server for details on other known bugs:

@ Section: Design Overview @

  The UCX conduit implements GASNet core and extended API for UCX-compatible
  network adapters. Where:
  (a) Core API includes conduit resource initialization and cleanup functionality
  along with Active Message communication;
  (b) Extended API includes one-sided and atomic operations support.

  + UCX API documentation: (most recent) (oldest supported)

  Resource allocation and initialization:
    During GASNet initialization, every process creates a UCX worker, participates
    in the exchange of worker addresses (with Allgather communication pattern), and
    creates UCP endpoints representing connections to all other processes in the
    GASNet job. In addition, a set of service buffers for Active Messages is
    allocated and all required UCP memory registrations and memory key exchanges
    are performed.

  Active Message:
    UCX conduit implementation is using pools of pre-allocated AM header buffers
    that are used for all types of AMs.  Sending of Requests and Replies each use
    a distinct pool, while reception of AMs uses a third pool.
    For AM Medium messages, the payload is always copied to the buffer
    holding the header, to reduce the latency in absence of local completion
    options ("lc_opt") support. This behavior will be reconsidered in the future.
    For Long AMs payload transfer is implemented using RDMA Put operation.

    Implementation of RDMA Put and Atomic operations is a thin layer on top of UCP
    primitives. In addition, "ucp_ep_flush_nb" is used to implement remote completion

    UCX directly supports most atomic operations on integer types, while the
    remaining operations are handled by the generic GASNet implementation.  The
    following table indicates the pairs of atomic operation and data type that are
    eligible for execution via UCX.  The UCX atomics will be used when the arguments
    to gex_AD_Create() express only supported pairs (unless suppressed by use of
    spanning a single shared memory nbrhd).

                I32/U32/I64/U64   FLT/DBL
    ---         ---------------   ------
    SET:        Y                 N
    GET:        Y                 N
    SWAP:       Y                 N
    (F)CAS:     Y                 N
    (F)ADD:     Y                 N
    (F)SUB:     Y                 N
    (F)INC:     Y                 N
    (F)DEC:     Y                 N
    (F)MULT:    N                 N
    (F)MIN:     N                 N
    (F)MAX:     N                 N
    (F)AND:     Y                 N
    (F)OR:      Y                 N
    (F)XOR:     Y                 N

    Y = offload is supported
    N = offload is not supported

@ Section: Graceful exits @

  Graceful exits are supported by the conduit. The design is based
  on ibv-conduit description (see ibv-conduit README).

@ Section: Future work @

  + Implement non-trivial support for local completion options ("lc_opt")
  + Implement native support for Negotiated-Payload Active Message (NPAM)
  + Additional round of performance analysis and tuning.
  + Investigate older UCX versions and hardware compatibility and update README.
  + Support Graceful exits if rank 0 has failed.
  + Restore `--enable-segment-everything` mode:
    - Fix robustness problems with AMLong in `--enable-segment-everything` mode.
    - Apply ODP to optimize `--enable-segment-everything` mode.