GASNet ucx-conduit documentation Mellanox Technologies LTD Copyright (c) 2019-2021. Portions copyright 2021-2024, The Regents of the University of California. Boris I. Karasev Artem Y. Polyakov Paul H. Hargrove @ TOC: @ @ Section: Overview @ @ Section: Where this conduit runs @ @ Section: Build-time Configuration @ @ Section: Job Spawning @ @ Section: Runtime Configuration @ @ Section: Multi-rail Support @ @ Section: On-Demand Paging (ODP) Support @ @ Section: HCA Configuration @ @ Section: Known Problems @ @ Section: Design Overview @ @ Section: Graceful exits @ @ Section: References @ @ Section: Overview @ Ucx-conduit implements GASNet over the Unified Communication X (UCX) framework (see https://www.openucx.org/ for general information on UCX). This conduit is feature-complete and supports Active Messages and hardware-offload of RMA and Atomic operations. Performance tuning is is planned as future work. Since this conduit is considered "experimental" pending performance-tuning work, it is disabled by default. It can be enabled explicitly at configure time using either `--enable-ucx` or `--enable-ucx=probe`. Either option will enable ucx-conduit if `configure` locates the prerequisites. The difference is that the first option makes failure to locate the prerequisites fatal in `configure`. @ Section: Where this conduit runs @ The conduit is based on Unified Communication X (UCX) communication library (see https://www.openucx.org/ for general information on UCX). UCX is an open-source project developed in collaboration between industry, laboratories, and academia to create an open-source production-grade communication framework for data-centric and high-performance applications. The UCX library can be downloaded from repositories (e.g., Fedora/RedHat yum repositories). The UCX library is also part of Mellanox OFED (a.k.a "MLNX_OFED") and the NVIDIA/Mellanox HPC-X Software Toolkit. The conduit is tested and known to work with - NVIDIA/Mellanox InfiniBand and RoCE devices starting from ConnectX-5 - UCX library version 1.6 and above - Linux platform For NVIDIA/Mellanox adapters, UCX conduit supports all transports: RC, UD and DC. For large-scale applications, it is recommended to use DC transport for scalability reasons. @ Section: Build-time Configuration @ In order to enable the conduit, the option '--enable-ucx' must be passed to configure. By default, ucx-conduit will not be built. See `configure --help` for additional ucx-related configure options. See the extended-ref README for the GASNet general-purpose configuration parameters. --with-ucx-max-medium=[value] By default gex_AM_LUB{Request,Reply}Medium() return 3992, which is a 4KiB buffer minus overheads such as handler arguments. This configure option sets the default value for the GASNET_UCX_MAX_MEDIUM environment variable, which determines the LUB size at runtime. The value cannot be less than 512. The default value is 3992. @ Section: Job Spawning @ If using UPC++, Chapel, etc. the language-specific commands should be used to launch applications. Otherwise, applications can be launched using the gasnetrun_ucx utility: + usage summary: gasnetrun_ucx -n [options] [--] prog [program args] options: -n number of processes to run (required) -N number of nodes to run on (not supported by all MPIs) -E list of environment vars to propagate -v be verbose about what is happening -t test only, don't execute anything (implies -v) -k keep any temporary files created (implies -v) -spawner=(ssh|mpi|pmi) force use of a specific spawner (if available) There are as many as three possible methods (ssh, mpi and pmi) by which one can launch an ucx-conduit application. Ssh-based spawning is always available, and mpi- and pmi-based spawning are available if the respective support was located at configure time. The default is established at configure time (see section "Build-time Configuration"). The choice of spawner only affects the protocol used for parallel job setup and teardown; in particular it is NOT used to implement any part of the steady-state GASNet communication operations. As such, the selected protocol needs to be stable and co-exist with GASNet communication, but its performance efficiency is usually not a practical consideration. To select a non-default spawner one may either use the "-spawner=" command- line argument or set the environment variable GASNET_UCX_SPAWNER to "ssh", "mpi" or "pmi". If both are used, then the command line argument takes precedence. @ Section: Runtime Configuration @ Ucx-conduit supports all of the standard GASNet environment variables and the optional GASNET_EXITTIMEOUT family of environment variables. See GASNet's top-level README for documentation. Additional environment variables: + GASNET_UCX_AM_ORDERED_TLS This Boolean setting determines if the logic for transfer of AM Long payloads may assume UCX is using an in-order transport. This results in faster transfers, but could yield data corruption if the payload and header are reordered. Setting to 1 forces this optimization, while 0 disables it. The default is to enable this setting if the conduit can determine that GASNet's PSHM (which guarantees ordering) is used for all intra-node communication, as opposed to UCX shared-memory transport which does not guarantee ordering (see bug 4155). + GASNET_UCX_MAX_MEDIUM This setting determines the size of buffers used for AM Mediums, and thus the return value of the gex_AM_LUB{Request,Reply}Medium() queries. The value cannot be less than 512. Unless a different value was set using --with-ucx-max-medium=[value] at configure time, the default value is 3992. In order to control UCX parameters (i.e. device or transport selection), environment variables are used. Most commonly used variables are described on the UCX Wiki page https://github.com/openucx/ucx/wiki/UCX-environment-parameters For the full list of tuning knobs supported by a particular UCX version, see the output of `ucx_info -f`. Previous versions of this document described a "personalized prefix" feature of the UCX library which resulted in a recommendation to use "UCX_GASNET_" or "GASNET_UCX_", depending on the UCX version in use. It has since been determined that this UCX feature is not reliable. Therfore, that prior recommendation is withdrawn. One should use such variable names as "UCX_TLS" and *not* variants such as "UCX_GASNET_TLS" or "GASNET_UCX_TLS". The "UCX_TLS" environment variable controls transport selection and is one of the most commonly used UCX parameters. Available options are: $ ucx_info -f | grep UCX_TLS -B 23 | head -n 20 # # Comma-separated list of transports to use. The order is not meaningful. # - all : use all the available transports. # - sm/shm : all shared memory transports (mm, cma, knem). # - mm : shared memory transports - only memory mappers. # - ugni : ugni_smsg and ugni_rdma (uses ugni_udt for bootstrap). # - ib : all infiniband transports (rc/rc_mlx5, ud/ud_mlx5, dc_mlx5). # - rc_v : rc verbs (uses ud for bootstrap). # - rc_x : rc with accelerated verbs (uses ud_mlx5 for bootstrap). # - rc : rc_v and rc_x (preferably if available). # - ud_v : ud verbs. # - ud_x : ud with accelerated verbs. # - ud : ud_v and ud_x (preferably if available). # - dc/dc_x : dc with accelerated verbs. # - tcp : sockets over TCP/IP. # - cuda : CUDA (NVIDIA GPU) memory support. # - rocm : ROCm (AMD GPU) memory support. # Using a \ prefix before a transport name treats it as an explicit transport name # and disables aliasing. # In order to ensure ucx-conduit is being used with supported NVIDIA/Mellanox devices, the following transport set is recommended to exclude unsupported networks: "UCX_TLS=ib,sm,self". To leverage UCX support for GPU device memory for GASNet-EX Memory Kinds, one may append ",cuda" or ",rocm". However, please see "Known Problems" for issues with UCX support for GPU device memory. Ucx-conduit supports all of the standard and the optional GASNET_EXITTIMEOUT GASNet environment variables. See GASNet's top-level README for documentation. @ Section: Multi-rail Support @ UCX supports multi-rail configurations, and ucx-conduit has been successfully tested on such systems. Please consult the online UCX documentation at https://www.openucx.org/ for information on configuring the use of multiple network rails. @ Section: On-Demand Paging (ODP) Support @ The conduit supports ODP through UCX, see "UCX_IB_REG_METHODS" UCX environment variable. @ Section: Known Problems @ + This conduit does not support NVIDIA/Mellanox ConnectX-4 adapters, or older. This is unlikely to change. + Support for segment-everything mode is currently disabled due to unbounded memory use in AMLong (especially as used in the "amref" implementation of RMA). Restored support for `--enable-segment-everything` is an area for future work. + Bug 4165 The implementation of AM Mediums is known to have anomalous performance for payloads above about 1KiB on multiple systems. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4165 + Bug 4172 In hybrid MPI applications there are conditions which can trigger a crash at exit time due to a call to MPI_Barrier() by ucx-conduit after MPI_Finalize() by the application. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4172 + Bug 4301 As descried in the "Runtime Configuration" section, UCX versions 1.9 and newer use prefix of "GASNET_UCX" for private (per client of UCX) configuration. However, this same prefix is used by ucx-conduit for its own environment settings. The result is that, by default, the UCX library may issue warnings which include text like the following: "UCX WARN unused env variable: GASNET_UCX_SPAWNER" and "(set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)" The work-around is to set UCX_WARN_UNUSED_ENV_VARS=n as the message recommends. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4301 + Bug 4337 Some Active Message traffic patterns may induce ucx-conduit to perform excessive buffering within the UCX library. This can lead to unexpectedly large memory consumption, which in turn may lead to multiple fatal error modes. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4337 + Bug 4384 Use of GASNet-EX "CUDA_UVA" memory kinds with ucx-conduit may crash due to incorrect default algorithm selection inside UCX for RMA transfers involving device memory. For UCX versions 1.11.0 and newer, there are work-arounds based on setting environment variables to modify the algorithm selection to be correct for device memory at the expense of slowing of host memory RMA. Currently, the most reliable work-around is to set the following MLX5_SCATTER_TO_CQE=0 UCX_IB_TX_INLINE_RESP=0 UCX_ZCOPY_THRESH=0 The MLX5_SCATTER_TO_CQE=0 may or may not be required, depending on the underlying InfiniBand software stack. Omitting it may provide a small performance benefit on systems where it is not required for correctness. There is no known work-around for versions of UCX prior to 1.11.0. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4384 + Bug 4393 The use by ucx-conduit of UCX's support for CUDA devices currently lacks the manipulation of CUDA contexts required for correct multi-device operation. Use of the recommended "UCX_TLS=ib,sm,self" setting is an effective work-around. Others are described in the bug report. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4393 * Bug 4474 When using hugepages together with NVIDIA/Mellanox InfiniBand and RoCE devices, it may be necessary to set `RDMAV_HUGEPAGES_SAFE=1` in the environment. Failure to do so may result in a self-explanatory error message: UCX ERROR ibv_reg_mr(address=0x7f49c5a02800, length=2048, access=0xf) failed: Invalid argument. Application is using HUGE pages. Please set environment variable RDMAV_HUGEPAGES_SAFE=1 The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4474 + See the GASNet Bugzilla server for details on other known bugs: https://gasnet-bugs.lbl.gov/ @ Section: Design Overview @ The UCX conduit implements GASNet core and extended API for UCX-compatible network adapters. Where: (a) Core API includes conduit resource initialization and cleanup functionality along with Active Message communication; (b) Extended API includes one-sided and atomic operations support. References: + UCX API documentation: https://openucx.github.io/ucx/api/latest/html/index.html (most recent) https://openucx.github.io/ucx/api/v1.6/html/index.html (oldest supported) Resource allocation and initialization: During GASNet initialization, every process creates a UCX worker, participates in the exchange of worker addresses (with Allgather communication pattern), and creates UCP endpoints representing connections to all other processes in the GASNet job. In addition, a set of service buffers for Active Messages is allocated and all required UCP memory registrations and memory key exchanges are performed. Active Message: UCX conduit implementation is using pools of pre-allocated AM header buffers that are used for all types of AMs. Sending of Requests and Replies each use a distinct pool, while reception of AMs uses a third pool. For AM Medium messages, the payload is always copied to the buffer holding the header, to reduce the latency in absence of local completion options ("lc_opt") support. This behavior will be reconsidered in the future. For Long AMs payload transfer is implemented using RDMA Put operation. RDMA/AMO: Implementation of RDMA Put and Atomic operations is a thin layer on top of UCP primitives. In addition, "ucp_ep_flush_nb" is used to implement remote completion tracking. UCX directly supports most atomic operations on integer types, while the remaining operations are handled by the generic GASNet implementation. The following table indicates the pairs of atomic operation and data type that are eligible for execution via UCX. The UCX atomics will be used when the arguments to gex_AD_Create() express only supported pairs (unless suppressed by use of GEX_FLAG_AD_FAVOR_MY_RANK or GEX_FLAG_AD_FAVOR_MY_NBRHD flags, or in a run spanning a single shared memory nbrhd). I32/U32/I64/U64 FLT/DBL --- --------------- ------ SET: Y N GET: Y N SWAP: Y N (F)CAS: Y N (F)ADD: Y N (F)SUB: Y N (F)INC: Y N (F)DEC: Y N (F)MULT: N N (F)MIN: N N (F)MAX: N N (F)AND: Y N (F)OR: Y N (F)XOR: Y N Y = offload is supported N = offload is not supported @ Section: Graceful exits @ Graceful exits are supported by the conduit. The design is based on ibv-conduit description (see ibv-conduit README). @ Section: Future work @ + Implement non-trivial support for local completion options ("lc_opt") + Implement native support for Negotiated-Payload Active Message (NPAM) + Additional round of performance analysis and tuning. + Investigate older UCX versions and hardware compatibility and update README. + Support Graceful exits if rank 0 has failed. + Restore `--enable-segment-everything` mode: - Fix robustness problems with AMLong in `--enable-segment-everything` mode. - Apply ODP to optimize `--enable-segment-everything` mode.