GASNet ofi-conduit documentation Copyright 2015-2017, Intel Corporation Portions copyright 2018-2023, The Regents of the University of California. User Information: ---------------- This is the README for ofi-conduit. OpenFabrics Interfaces (OFI) is a framework focused on exporting fabric communication services to applications. OFI is best described as a collection of libraries and applications used to export fabric services. See more details at: https://ofiwg.github.io/libfabric/ This conduit is feature-complete and passes tests in recommended configurations. Performance tuning is planned as future work. Use of ofi-conduit is only recommended on certain networks (see next section). Therefore, it is disabled by default in most systems. It can be enabled explicitly at configure time using either `--enable-ofi` or `--enable-ofi=probe`. Either option will enable ofi-conduit if `configure` locates the prerequisites. The difference is that the first option makes failure to locate the prerequisites fatal in `configure`. Where this conduit runs: ----------------------- The GASNet OFI conduit requires libfabric 1.5 or newer. Use with some providers requires higher minimum versions (see below). The GASNet OFI conduit can run on any OFI provider that matches its requirements. However, at this time, it is recommended *only* for the following networks. + Intel or Cornelis Networks Omni-Path fabric, via the "psm2" provider + HPE Slingshot-10 (100 Gb/s ConnectX-5 NIC), via the "verbs;ofi_rxm" provider + HPE Slingshot-11 (200 Gb/s Cassini NIC), via the "cxi" provider The following providers have been used successfully on networks where other GASNet conduits are a more appropriate choice. Therefore we do not recommend use of these providers in general. + TCP, via the "sockets" provider + UDP, via the "udp;ofi_rxd" provider + Cray Aries, via the "gni" provider The following providers have not met our expectations (with respect to performance or stability) when tested on the listed networks and therefore are not currently recommended at all. + TCP, via the "tcp;ofi_rxm" provider + InfiniBand, via "verbs;ofi_rxm" provider Independent of recommendations above for or against use of any given provider, the following are minimum libfabric versions below which ofi-conduit will reject certain providers due to known correctness defects in older versions impacting GASNet stability: + "udp;ofi_rxd" 1.7 + "verbs;ofi_rxm" 1.12 (1.11 with HPE Slingshot-10) + "gni" 1.14 + "tcp;ofi_rxm" 1.15 Providers not listed above are accepted from the higher of 1.5 or the first libfabric release in which they appear. The following providers have passed correctness and performance sanity checks at some point in time. However, due to lack of access for periodic testing, we cannot consider these to be supported at this time: + AWS Elastic Fabric Adapter, via "efa" provider The ofi-conduit is currently known to lack support for the following providers as of libfabric v1.14.0. If these providers are important to you, please contact us at gasnet-staff@lbl.gov . + "psm" provider for the (end-of-life) True Scale products + "opx" provider, an alternative to "psm2" for the Omni-Path fabric The fi_info command installed with libfabric can be used to query the available OFI providers on your system. It is worth noting that the available providers on a cluster compute node may differ from those on a front-end node. Users with NVIDIA/Mellanox InfiniBand hardware should use the GASNet ibv-conduit, rather than ofi-conduit with the verbs OFI provider, as the former may provide better performance. Users with InfiniBand hardware with an InfiniPath HCA (PathScale or Qlogic) or True Scale HCA (Intel) should use the GASNet mpi-conduit, rather than ofi-conduit with the verbs OFI provider, as the former may provide better performance. Users with Ethernet hardware are encouraged to consider the GASNet udp-conduit as an alternative to ofi-conduit with the sockets or udp;ofi_rxd providers, as the former may provide more favorable performance. Users with Cray XC hardware are encouraged to consider the GASNet aries-conduit as an alternative to ofi-conduit with the gni provider, since the former may provide better performance. Additionally, while the GASNet maintainers have seen good results from preliminary testing of this provider with GASNet's ofi-conduit, nobody has thoroughly validated it. Therefore, use of the gni provider with ofi-conduit should be considered experimental. The ofi-conduit includes support for memory kinds (`GEX_MK_CLASS_CUDA_UVA` and `GASNET_HAVE_MK_CLASS_HIP`) for certain providers and subject to minimum versions of libfabric. See docs/memory_kinds.md in the GASNet-EX sources for details. OFI provider requirements to run this conduit: ------------------------------------- The provider MUST support all of the following: * API version 1.5 or newer * API version 1.9 or newer if the client will use `gex_MK_Create()` * endpoint type FI_EP_RDM * capability FI_RMA * capability FI_MSG for at least two endpoints per process * capability FI_HMEM if the client will use `gex_MK_Create()` * capability FI_MULTI_RECV unless GASNET_OFI_RECEIVE_BUFF_SIZE=recv * threading mode FI_THREAD_SAFE and/or FI_THREAD_DOMAIN (more info below) The provider MUST NOT require any of the following: * FI_MR_LOCAL * FI_MR_ALLOCATED in GASNET_SEGMENT_EVERYTHING mode * FI_CONTEXT for FI_RMA endpoints * FI_CONTEXT2 * FI_MSG_PREFIX * FI_RESTRICTED_COMP The provider MAY require any of the following: * address vector type FI_AV_TABLE and/or FI_AV_MAP * FI_MR_ENDPOINT * FI_MR_VIRT_ADDR (performance may differ with or without) * FI_MR_PROV_KEY (performance may differ with or without) * FI_RX_CQ_DATA (no operations for which this matters) * FI_ASYNC_IOV (no operations for which this matters) * that at most one endpoint be associated with any given completion queue * (this list is incomplete) Threading mode: FI_THREAD_SAFE and/or FI_THREAD_DOMAIN * In GASNET_SEQ and GASNET_PARSYNC modes, ofi-conduit uses FI_THREAD_DOMAIN since these modes ensure serialization of calls into libfabric. * By default in GASNET_PAR mode, FI_THREAD_SAFE is used with most providers to allow concurrent calls into libfabric. The sole exception is the psm2 provider which defaults to FI_THREAD_DOMAIN due to a lack of FI_THREAD_SAFE support. See the description of --{en,dis}able-ofi-thread-domain, below, for the means to override these defaults. Building ofi-conduit -------------------- Libfabric is a core component of OFI. To use ofi-conduit, the user firsts needs to locate or build an install of libfabric. The source code of libfabric can be found at: https://github.com/ofiwg/libfabric To build GASNet with ofi-conduit enabled, the minimum requirement is [path to]/configure --enable-ofi The following configure options are also available: * --with-ofi-home=INSTALL_PREFIX Used to provide the libfabric install prefix. If not used, the default is based on the location of fi_info in $PATH. If fi_info is not found in $PATH, then this option is required. * --with-ofi-provider=PROVIDER_NAME Since different providers expose different feature sets, ofi-conduit will use compile-time knowledge of the intended provider to elide unneeded code. This option can be used to name the provider to optimize for. If this option is not present, the configure script will attempt to detect an available provider using the fi_info utility, if available. If fi_info is not available, or no supported provider is listed by fi_info, or if the argument to this option is "generic", then provider features will be detected at runtime. While this is flexible as to which providers it supports, it will add branches in critical paths that may increase software overheads. This case is referred to as "generic provider" below. Note that some providers, such as "verbs;ofi_rxm" include a semi-colon in their name. To avoid issues with shell quoting, the configure script will allow substitution of a colon for the semi-colon (e.g. "verbs:ofi_rxm") as well as use of just the first word (e.g. "verbs") if no ambiguity would result. * --with-ofi-num-completions=VALUE Specifies the maximum number of transmit completions to be read from the transmit CQ during each call to the polling function. Default: 64 * --with-ofi-max-medium=VALUE This option specifies the default value of the environment variable GASNET_OFI_MAX_MEDIUM, which in turn determines the value returned by gex_AM_LUB{Request,Reply}Medium(). The value cannot be less than 512. Default: 8192 The following configure flags override options set by provider selection via auto-detection or by the --with-ofi-provider=PROVIDER_NAME option. These are not recommended to be used as tuning options. * --enable-ofi-thread-domain Forces the use of FI_THREAD_DOMAIN in GASNET_PAR mode. This results in one global lock to protect all calls into libfabric. This is the default for psm2 provider. * --disable-ofi-thread-domain Forces the use of FI_THREAD_SAFE in GASNET_PAR mode. This is the default for most providers and for the generic provider case. In GASNET_SEQ and GASNET_PARSYNC modes, the two options above have no effect. * --enable-ofi-mr-virt-addr Indicates that FI_MR_VIRT_ADDR memory registration support will be compiled into the conduit. * --disable-ofi-mr-virt-addr Indicates that FI_MR_VIRT_ADDR memory registration support not will be compiled into the conduit. Currently only "verbs;ofi_rxm" and "gni" providers default to use of FI_MR_VIRT_ADDR. For the generic provider case, the determination is made at runtime. * --enable-ofi-mr-prov-key Indicates that FI_MR_PROV_KEY memory registration support will be compiled into the conduit. * --disable-ofi-mr-prov-key Indicates that FI_MR_PROV_KEY memory registration support not will be compiled into the conduit. Currently only "verbs;ofi_rxm" and "gni" providers default to use of FI_MR_PROV_KEY. For the generic provider case, the determination is made at runtime. * --enable-ofi-mr-scalable [DEPRECATED] This deprecated option is an alias for --disable-ofi-mr-virt-addr --disable-ofi-mr-prov-key and corresponds to the legacy FI_MR_SCALABLE memory registration mode. * --disable-ofi-mr-scalable [DEPRECATED] This deprecated option is an alias for --enable-ofi-mr-virt-addr --enable-ofi-mr-prov-key and corresponds to the legacy FI_MR_BASIC memory registration mode. These options are IGNORED in the presence of any of the following options: --disable-ofi-mr-virt-addr --enable-ofi-mr-virt-addr --disable-ofi-mr-prov-key --enable-ofi-mr-prov-key These options may be removed in a future release. * --enable-ofi-av-map Indicates that only support for address vector type FI_AV_MAP will be compiled into the conduit. * --disable-ofi-av-map Indicates that only support for address vector type FI_AV_TABLE will be compiled into the conduit. Currently all providers default to use of FI_AV_TABLE. For the generic provider case, the determination is made at runtime. * --enable-ofi-multi-cq Indicates that the provider requires the use of distinct completion queues for the transmit completions of each endpoint. This is the default option for the cxi provider. * --disable-ofi-multi-cq Indicates that the provider supports the use of a single completion queue for the transmit completions of multiple endpoints. This is the default (and recommended) option for most providers. For the generic provider case, the logic for Cq creation and polling adapts at runtime to use a single completion queue if possible, and uses multiple Cqs only if necessary. * --enable-ofi-retry-recvmsg Enables logic in ofi-conduit to handle the provider returning `-FI_EAGAIN` from `fi_recvmsg()`. This is the default option for the cxi and udp;ofi_rxd providers and for the generic provider case. * --disable-ofi-retry-recvmsg Disable logic to handle `-FI_EAGAIN` returns from `fi_recvmsg()`. This is the default (and recommended) option for most providers. The default spawner to be used by the gasnetrun_ofi utility can be selected by configuring '--with-ofi-spawner=VALUE', where VALUE is one of 'mpi', 'pmi' or 'ssh'. If this option is not used, mpi is the default when available, and ssh otherwise. Here are some things to consider when selecting a default spawner: + The choice of spawner only affects the protocol used for parallel job setup and teardown; in particular it is NOT used to implement any part of the steady-state GASNet communication operations. As such, the selected protocol needs to be stable and co-exist with GASNet communication, but its performance efficiency is usually not a practical consideration. + mpi-spawner is the default when MPI is available precisely because it is so frequently present on systems where GASNet is to be installed. Additionally, very little (if any) configuration is required and the behavior is highly reliable. + pmi-spawner uses the same "Process Management Interface" which forms the basis for many mpirun implementations. When support is available, this spawner can be as easy to use and as reliable as mpi-spawner, but without the overheads of initializing an MPI runtime. + ssh-spawner depends only on the availability of a remote shell command such as ssh. For this reason ssh-spawner support is always compiled. However, it can be difficult (or impossible) to use on a cluster which was not setup to allow ssh to (and among) its compute nodes. For more information on configuration and use of these spawners, see README-{ssh,mpi,pmi}-spawner (installed) or other/{ssh,mpi,pmi}-spawner/README (source). Depending on the libfabric provider in use, there may be restrictions on how mpi-based spawning is used. For instance, the psm2 provider has the property that each process may only open the network adapter once. Additionally, if the MPI implementation also uses libfabric for communication, then there is a risk it may adjust settings in a way incompatible with ofi-conduit. If you wish to use mpi-spawner, please consult its README for advice on how to set your MPIRUN_CMD to use native TCP/IP to avoid these potential problems. Job Spawning ------------ If using UPC++, Chapel, etc. the language-specific commands should be used to launch applications. Otherwise, applications can be launched using the gasnetrun_ofi utility: + usage summary: gasnetrun_ofi -n [options] [--] prog [program args] options: -n number of processes to run (required) -N number of nodes to run on (not supported by all MPIs) -E list of environment vars to propagate -v be verbose about what is happening -t test only, don't execute anything (implies -v) -k keep any temporary files created (implies -v) -spawner=(ssh|mpi|pmi) force use of a specific spawner (if available) There are as many as three possible methods (ssh, mpi and pmi) by which one can launch an ofi-conduit application. Ssh-based spawning is always available, and mpi- and pmi-based spawning are available if the respective support was located at configure time. The default is established at configure time (see section "Building ofi-conduit", above). To select a non-default spawner one may either use the "-spawner=" command- line argument or set the environment variable GASNET_OFI_SPAWNER to "ssh", "mpi" or "pmi". If both are used, then the command line argument takes precedence. Recognized environment variables: --------------------------------- * GASNET_OFI_SPAWNER To override the default spawner for ofi-conduit jobs, one may set this environment variable as described in the section "Job Spawning", above. There are additional settings which control behaviors of the various spawners, as described in the respective READMEs (listed in section "Building ofi-conduit", above). * GASNET_QUIET - set to 1 to silence the startup warning indicating the provider in use may deliver suboptimal performance. * GASNET_OFI_RMA_POLL_FREQ - In order to ensure efficient progress, the conduit polls the RMA transmit completion queue once for every GASNET_OFI_RMA_POLL_FREQ RMA injections. Default: 32 * GASNET_OFI_NUM_BBUFS, GASNET_OFI_BBUF_SIZE, GASNET_OFI_BBUF_THRESHOLD - See the "Non-bulk, Non-blocking Put Functions" section for detail on these environment variables. * GASNET_OFI_NUM_RECEIVE_BUFFS and GASNET_OFI_RECEIVE_BUFF_SIZE - Active Message receive resources. These two settings control, respectively, the number and size of the "multi-receive" buffers allocated for the reception of in-bound AMs. Depending on the given provider's implementation of `FI_MULTI_RECV`, increasing these may improve AM throughput or have no impact. On some providers, reducing these may severely reduce AM throughput or even lead to crashes. Defaults: 8 and 1M See also: Bug 4461 and Bug 4517 under "Known problems" * GASNET_OFI_AM_INJECT_LIMIT - limit on size of fi_inject() payload. GASNet Active Messages of total size (at the ofi level) no larger than this parameter value will be sent using fi_inject(), and larger messages are sent using fi_send(). Use of fi_inject() reduces the overall complexity of the operation by ensuring the data is consumed synchronously, usually at the cost of an extra copy. Meanwhile, fi_send() consumes the data asynchronously, eliminating the extra copy at the expense of signaling an asynchronous local completion. This setting does not affect payload local completion semantics at the GASNet API level. It too large a value is requested, it will be reduced to the maximum which the OFI provider can support. Default: largest size supported by the provider. NOTE: If this parameter is not set, then "GASNET_OFI_INJECT_LIMIT" is accepted as an alias. * GASNET_OFI_RMA_INJECT_LIMIT - limit on size of RMA with FI_INJECT GASNet RMA put operations no larger than this size may be issued using the `FI_INJECT` flag to ensure the data is consumed synchronously. This can reduce the overall complexity of operations using `GEX_EVENT_NOW`, typically at cost of an extra copy. This setting does not affect local completion semantics at the GASNet API level. It too large a value is requested, it will be reduced to the maximum which the OFI provider can support. Default: largest size supported by the provider. * GASNET_OFI_TX_CQ_SIZE and GASNET_OFI_RX_CQ_SIZE - limits on completion queue length. These settings control the length of the completion queues (CQs) used for transmit and receive operations, respectively. The default of zero allows the provider to determine the length. Use of this default is recommended unless debugging a possible Cq overrun. * GASNET_OFI_MAX_REQUEST_BUFFS and GASNET_OFI_MAX_REPLY_BUFFS - limits on the number of buffers allocated for the construction of out-bound AM Requests and Replies, respectively. The default for both settings is 1024. * GASNET_OFI_NUM_INITIAL_REQUEST_BUFFS and GASNET_OFI_NUM_INITIAL_REPLY_BUFFS - the number of buffers to allocate at startup for the construction of out-bound AM Requests and Replies, respectively. If these values are lower than the corresponding "MAX" values (described immediately above) then the difference *may* be allocated dynamically, but only as needed. The default for each setting is the smaller of 256 or the respective MAX. In other words: with all defaults, 256 buffers are allocated to each pool and each poll is permitted to grow as large as 1024 buffers if the demand exists. * GASNET_OFI_MAX_MEDIUM - maximum AM Medium payload size This setting determines the size of buffers used for AM Mediums, and thus the return value of the gex_AM_LUB{Request,Reply}Medium() queries. The value cannot be less than 512. Unless a different value was set using --with-ofi-max-medium=[value] at configure time, the default value is 8192. * GASNET_OFI_SET_UNIVERSE_SIZE - enable automatic setting of FI_UNIVERSE_SIZE Setting GASNET_OFI_SET_UNIVERSE_SIZE=1 *allows* ofi-conduit to set FI_UNIVERSE_SIZE to the job size *if* it is not already set. Under no circumstances does ofi-conduit overwrite a pre-existing value. The default value is 1, unless using a libfabric version older than 1.15.2.0 together with the "cxi" or "generic" provider (due to bug 4413). * GASNET_OFI_DEVICE By default, ofi-conduit will open and use the first device enumerated by libfabric as matching the specified provider and required capabilities. This setting can be used to specify a device to be used in place of this default. See 'GASNET_OFI_LIST_DEVICES' and 'GASNET_OFI_LIST_DEVICES_NODES' for how to enumerate available devices. See also 'GASNET_OFI_DEVICE_*', immediately below. * GASNET_OFI_DEVICE_* The environment variable 'GASNET_OFI_DEVICE', described immediately above, provides only a single setting and unless one uses some external means to give per-process settings, this cannot provide per-process control. This can make it difficult to get the best performance from multi-rail systems with multiple processes per node and architectural locality properties that affect PCI/adapter access efficiency. See "topology-aware environment variable families" in GASNet's top-level README for a description of how 'GASNET_OFI_DEVICE_TYPE' can be used to establish process-specific values for 'GASNET_OFI_DEVICE'. As a concrete example, on OLCF's Crusher there are four NICs -- one per NUMA Node. Examining the corresponding Quick Start Guide one can construct the following settings to ensure processes bound to cores in a single NUMA Node will use the topologically nearest NIC: GASNET_OFI_DEVICE_TYPE=Node GASNET_OFI_DEVICE_0=cxi2 GASNET_OFI_DEVICE_1=cxi1 GASNET_OFI_DEVICE_2=cxi3 GASNET_OFI_DEVICE_3=cxi0 These specific recommendations are appropriate to the specific composition of a node of OLCF's Crusher, and should not be considered as generic advice for use of all multi-NIC systems. Of course, even on the same system, your mileage may vary. By default 'GASNET_OFI_DEVICE_TYPE' is "Socket" and all other variables in this family are unset. * GASNET_OFI_LIST_DEVICES The value is a boolean: "0" to disable or "1" to enable the reporting of all supportable provider/device pairs. In this context, "supportable" may be constrained by such factors as the configured provider or use of environment variables which constrain provider enumeration. See 'GASNET_OFI_LIST_DEVICES_NODES' for how to limit which nodes report. The default is "0" (no report). * GASNET_OFI_LIST_DEVICES_NODES If GASNET_OFI_LIST_DEVICES is enabled, then this setting may be used to limit which nodes report providers/devices pairs. The value is a list which may contain one or more integers or ranges separated by commas, such as "0,2-4,6". If unset, empty, or equal to "*" then all nodes will report (if enabled by GASNET_OFI_LIST_DEVICES). The default is no limit on which nodes report. * FI_PROVIDER - set to a provider name to override the default OFI provider selection Note that only in the "generic provider" case (described with the configure options) can this actually be used to change the provider to one different than selected at configure time. * FI_UNIVERSE_SIZE - This libfabric variable conveys the number of ranks / peers a given process endpoint expects to communicate with (default: 256). This setting is not directly consumed by GASNet ofi-conduit, but affects the behavior of the underlying providers it relies upon. Libfabric documentation encourages users to set this high enough to prevent internal CQ overruns for non-scalable communication patterns, but not so high as to waste provider-internal memory. See "GASNET_OFI_SET_UNIVERSE_SIZE" documentation for information on when this value is set automatically by ofi-conduit. Most notably, ofi-conduit will not overwrite a value set by the user. * FI_MR_CACHE_MAX_{SIZE,COUNT} - These libfabric variables control the cache of memory registrations used in some providers. These settings are not directly consumed by GASNet ofi-conduit, but affect the behavior of some underlying providers it relies upon. When using a provider known to experience degraded RMA performance with the defaults, ofi-conduit may set these values to "-1" to remove all limits on the memory registration cache. See Bug 4676 under "Known problems" for information on when these variables are set by ofi-conduit. Most notably, ofi-conduit will not overwrite a value set by the user. * FI_PSM2_LOCK_LEVEL - This variable only applies to the psm2 provider where it controls the internal locking state of the provider and can be set to the following three values: - FI_PSM2_LOCK_LEVEL=0: All locks inside of the provider will be disabled. - FI_PSM2_LOCK_LEVEL=1: Some locks inside of the provider will be disabled, and is suitable for programs that limit the access to each PSM2 context to a single thread. - FI_PSM2_LOCK_LEVEL=2: All locks inside of the provider are enabled. The default setting of this variable is 2. The original author recommends using a value of 0 when using GASNet in GASNET_SEQ mode and a value of 1 when running in GASNET_PARSYNC mode. For GASNET_PAR mode, a value of 1 should only be used if GASNet was configured with --enable-ofi-thread-domain, which only will allow a single thread to make calls into libfabric at a time. Otherwise, the default value of 2 should be used. This variable should be used by power users only and should be used at your own risk. * All the environment variables provided by libfabric (see `fi_info -e`) Note that many of these influence provider-specific behaviors, and in some cases setting non-default values may negatively affect the operation of ofi-conduit. * The environment variables described in the "Non-bulk, Non-blocking Put Functions" section of this file. * All the standard GASNet environment variables (see top-level README) Non-bulk, Non-blocking Put Functions ------------------------------------ For some calls to gex_RMA_Put{NB,NBI}() GASNet-EX requires the source buffer to be able to be reused as soon as the function returns. The ofi-conduit implements this by a hybrid approach that uses the FI_INJECT flag, bounce buffers, and blocking operations. * If the nbytes parameter is less than or equal to the chosen provider's inject_size (see the fi_endpoint(3) man page), the FI_INJECT flag will be used. * If the nbytes parameter is greater than the inject size but less than or equal to a user specifiable threshold (default is 4 times the bounce buffer size) then bounce buffers will be used. * If the nbytes parameter is greater than the threshold, the operation will be implemented as a fully-blocking operation. The following GASNet statistics counters show the number of times each of these code paths are entered: NB_PUT_INJECT, NB_PUT_BOUNCE, NB_PUT_BLOCK. The following environment variables may be used to tune this behavior: * GASNET_OFI_BBUF_SIZE - The size of each bounce buffer. Default is GASNET_PAGESIZE. * GASNET_OFI_NUM_BBUFS - The number of bounce buffers to be allocated at initialization. Default: 64 * GASNET_OFI_BBUF_THRESHOLD - Payload sizes above GASNET_OFI_BBUF_THRESHOLD will be transferred as a blocking operation. This is useful for when the overheads of bounce buffering become too great. Default: (4 * GASNET_OFI_BBUF_SIZE). * The following condition must hold: GASNET_OFI_NUM_BBUFS >= GASNET_OFI_BBUF_THRESHOLD/GASNET_OFI_BBUF_SIZE * These defaults are chosen to optimize for a 256K L2 cache, assuming 4K pages. It is recommended to modify these variables so that GASNET_PAGESIZE * GASNET_OFI_NUM_BBUFS == L2-cache-size. * These parameters only apply to gex_RMA_Put{NB,NBI}() with (lc_opt != GEX_EVENT_DEFER). They do not apply to gets, nor to blocking or value-based puts. Known problems: --------------- * Bug 4267 Multi-threaded applications in which one or more threads are using GASNet concurrent with another thread calling gasnet_exit() may experience "random crashes" as the exiting thread destroys resources in use by the others. This includes otherwise benign calls to gasnet_AMPoll(). The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4267 * Bugs 4420 and 4503 At least verbs;ofi_rxm and cxi providers are known to fail when using read-only memory as the source of an RMA Put and/or AM Long operation. The most up-to-date information on these bugs are maintained at the follow: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4420 (verbs) https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4503 (cxi) * Bug 4422 By default, multi-threaded applications run on a single core when using psm2 provider (for Omni-Path networks). There are two known work-arounds available: + The undesired pinning to a single core will not take place if the process is already pinned to a strict subset of the host's CPU cores. In many cases there are other motivations (such as NUMA affinity) for such pinning, making this our preferred recommendation when practical. However, the best means to accomplish this may depend on the batch system. An approach using hwloc-bind (independent of batch system) is demonstrated in the bug report (see link below). + If pinning to a strict subset of the host's CPU cores is either undesirable or impractical, one may instead set HFI_NO_CPUAFFINITY=1 in the environment to disable the undesired single-core pinning behavior. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4422 * Bug 4461 Under certain AM traffic patterns, the cxi provider (for the Slingshot-11 network on some HPE Cray EX systems) has been observed to fail in various ways, including crashes and dropped messages. At the time of this writing, the best known work-around is to set GASNET_OFI_RECEIVE_BUFF_SIZE=recv in your environment. This setting disables use of the "multi-receive" feature of libfabric, and the default for GASNET_OFI_NUM_RECEIVE_BUFFS is adjusted to compensate. Empirical evidence shows this prevents the failures in applications which exhibit them (which is *not* every application). However, this measurably increases the latency of small AM operations. So, this should only be used if it is suspected that bug 4461 is impacting your application. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4461 * Bug 4517 In libfabric 1.11 and earlier, the verbs;ofi_rxm provider used on the Slingshot-10 network of some HPE Cray EX systems has been observed to experience corruption of AMs, which (at least sometimes) are diagnosed with a fatal error message like one of the following: GASNet node [...] received an AM message from node [...] for a handler index with no associated AM handler function registered OR *** FATAL ERROR: Assertion failure (proc 0): in gasnetc_ofi_handle_am() at [...]/ofi-conduit/gasnet_ofi.c:1341: isreq == header->isreq op1 : 1 (0x00000001) == isreq op2 : 0 (0x00000000) == header->isreq At the time of this writing, the best known work-around is to update libfabric to version 1.12 or newer. However, replacing the vendor's version on a system like the HPE Cray EX is generally not advisable. So, for such systems an alternative work-around is to set GASNET_OFI_RECEIVE_BUFF_SIZE=recv in your environment. This setting disables ofi-conduit's use of libfabric's "multi-receive" feature, and adjusts the default value for GASNET_OFI_NUM_RECEIVE_BUFFS to some unspecified larger value to compensate. Empirical evidence shows this prevents the failures in applications which exhibit them (which is *not* every application). However, this measurably increases the latency of small AM operations. So, this should only be used if it is suspected that bug 4517 is impacting your application. NOTE: On networks other than Slingshot-10, ofi-conduit enforces a minimum libfabric version of 1.12 for this provider and this bug should not arise. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4517 * Bug 4554 On an HPE Cray EX system with Nvidia GPUs, Slingshot-11 network, and the 2.0.0 release of the Slingshot host software, it has been observed that tests using `GEX_MK_CLASS_CUDA_UVA` would crash in RMA Put operations of 224 bytes or less. An example backtrace would include something like the following, where the "cuda_gdrcopy_*" frames are the key ones: [3] #7 [3] #8 cuda_gdrcopy_impl (handle=0, devptr=0x148e53200000, hostptr=0x17d27f0, len=4, dir=dir@entry=GDRCOPY_FROM_DEVICE) at src/hmem_cuda_gdrcopy.c:265 [3] #9 0x0000148e9e82e9c4 in cuda_gdrcopy_from_dev (handle=, hostptr=, devptr=, len=) at src/hmem_cuda_gdrcopy.c:291 [3] #10 0x0000148e9e82d5dd in cuda_copy_from_dev (device=, dst=, src=, size=) at src/hmem_cuda.c:176 [3] #11 0x0000148e9e82c1ed in ofi_copy_from_hmem (size=4, src=, If one experiences crashes similar to the backtrace above with NVIDIA GPUs and the Slingshot-11 network, check if the command `rpm -qa libfabric` reports a version containing "SSHOT2.0.0". If it does, then this bug is present and the only known work-around for a user is to set `FI_HMEM_CUDA_USE_GDRCOPY=no` in one's environment. We also recommend asking your system administrator to upgrade to Slingshot 2.0.1 or newer. The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4554 * Bug 4565 Use of explicit --with-ofi-provider=generic with the Cray Aries network (gni provider) selects a memory registration mode for which the provider claims support, but which does not work in practice. There are currently two known work-arounds at configure time: + The following overrides the dynamic selection of memory registration mode: --with-ofi-provider=generic --enable-ofi-virt-addr --enable-ofi-mr-prov-key The resulting GASNet will only work with the gni and verbs providers, despite otherwise preserving use of the generic provider. + The following specializes for gni, disabling dynamic feature detection: --with-ofi-provider=gni The resulting GASNet will only work with gni provider. The most up-to-date information on this bug, including known work-arounds, is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4565 * Bug 4581 Currently, ofi-conduit's implementation of `gasnet_getMaxLocalSegmentSize()` does not take into consideration any limits a given provider may impose on the maximum length of a memory registration. This may result in warnings like the following (often followed by a fatal error in the client): *** WARNING (proc 0): Unexpected error -12 (Cannot allocate memory) from fi_mr_regattr() when binding segment [0x.., 0x...) to EP 0 The only known work-around at this time is to allocate a smaller segment. The most up-to-date information on this bug, including known work-arounds, is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4581 * Bug 4652 There exist versions and/or configurations of Slurm on HPE Cray EX systems which prevent the launch of single-node ofi-conduit jobs with the cxi provider. Failure messages (one per-process) look like the following: *** FATAL ERROR (proc 0): in gasnetc_ofi_init() at [...]/gasnet_ofi.c:1164: fi_domain failed: -38(Function not implemented) For certain versions/configurations, one can work around this issue either by adding `--network=single_node_vni` to one's `srun` command or by setting `SLURM_NETWORK=single_node_vni` in the environment. However, there are also versions/configurations for which there is no known work-around. The most up-to-date information on this bug, including known work-arounds, is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4652 * Bug 4676 It is known that the default values of the FI_MR_CACHE_MAX_COUNT and FI_MR_CACHE_MAX_SIZE environment variables can adversely impact RMA performance when using the cxi or verbs providers. Therefore, by default when configured to use either of these two providers (or the "generic" provider with FI_PROVIDER set to "cxi" or "verbs") ofi-conduit will set these variables to "-1" in order to remove the limits on the memory registration cache. However, the conduit will never overwrite a user-provided setting. To request the libfabric default values for either variable, one may set them to an empty string, which instructs ofi-conduit to unset the variable. Note that if libfabric is initialized prior to ofi-conduit, such as by an libfabric-based MPI used for job launch, then neither the automatic behavior nor the use of an empty string to suppress it will occur early enough to have the desired effect. In such cases, it is recommended to set these two environment variables to "-1" manually (to remove the limits) or to leave them unset (to use libfabric's default values). The most up-to-date information on this bug is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4676 * CXI provider and large procs-per-node (includes bugs 4478 and 4480) There have been observations of multiple failure modes early in startup when using the "cxi" provider with 128 or more processes per node. These include fatal errors (see bug 4480) and in some cases hangs (see bug 4478). The most up-to-date information on these issues is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4478 and https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4480 * See docs/memory_kinds.md in the GASNet-EX sources for current information regarding known issues with respect to memory kinds support in ofi-conduit. * See the GASNet Bugzilla server for details on additional known bugs: https://gasnet-bugs.lbl.gov/ * Limits to MPI interoperability (includes bugs 4455 and 4638) Depending on the libfabric provider in use, it may not be possible to have both MPI and GASNet using the native network API in the same application. There are known issues with the psm2 and cxi providers, though the latter's issue also represents a general concern. The psm2 provider has the property that each process may only open the network adapter once. If you wish to use MPI and GASNet in the same application on the Omni-Path (OPA) fabric, then there are two options: 1. GASNet may be configured to use an alternative transport. Options include mpi- and udp-conduits. 2. MPI may be configured to use an alternative transport (most likely TCP). The relative performance implications of these options depends strongly on the properties of each application. The cxi provider requires use of `FI_` environment variables to configure the provider to reliably handle Active Messages under heavy loads. It has been observed that HPE Cray MPI uses some of the same settings, but with values which differ from ofi-conduit's validated values. As a result, initializing MPI prior to GASNet *might* yield less reliable Active Messages and possible hangs or crashes. Therefore, by default, when an ofi-conduit application is launched using `srun`, care is taken to avoid initializing MPI as the spawner. This is the recommended practice for applications which do not also use MPI. The most up-to-date information on this issue is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4455 While there is a conflict between the cxi provider settings preferred by HPE Cray MPICH and GASNet-EX's ofi-conduit, it *is* possible to run hybrid applications which use both. However, if GASNet-EX is initialized before MPI, one may see failures of MPI_init() with messages including text like the following: create_endpoint(1311).......: OFI EP enable failed (ofi_init.c:1311:create_endpoint:Address already in use) This can be resolved by ensuring that MPI initializes first. That can be accomplished either explicitly in the application source code, or implicitly by setting `GASNET_OFI_SPAWNER=mpi` in the environment at application run time. Unfortunately, that will lead to lengthy warnings which begin with the following text: *** WARNING: ofi-conduit failed to configure FI_CXI_* environment variables due to prior conflicting settings. The warning message ends with text directing the reader here. The most up-to-date information on this issue, including recommendations regarding the "conflicting settings" warning, is maintained at: https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4638 With the gni provider, when Cray MPI is initialized prior to GASNet (including via use of MPI for job launch), a fatal error is seen at startup which includes the following text: fi_domain failed: -16(Device or resource busy) Meanwhile, hybrid applications which initialize GASNet before MPI have been observed to run correctly. Therefore, the following recommendations apply to gni provider: + Do not use MPI for job launch: Configure using `--with-ofi-spawner=pmi --with-pmi-version=cray` to ensure the `gasnetrun_ofi` script defaults to use of PMI-based job spawning, and that configure will fail if the necessary Cray PMI library cannot be located. This latter point is important even if one is not using the `gasnetrun_ofi` script. + Hybrid applications should initialize GASNet before MPI. As noted above, the specific issue with the cxi provider also represents a more general concern that an MPI implementation may configure libfabric in a way which is incompatible with GASNet's ofi-conduit OR that GASNet's settings could cause problems for the MPI implementation. However, there are currently no known cases of either such interaction other than with the cxi and gni providers as described above. Future work: ------------ The OFI community is working on increasing the number of providers that can support GASNet. * Implement AMLong via target-side reassembly using fi_writedata() * Implement asynchronous LC support for Long, most likely as part of the work to implement target-side reassembly. * Improve non-trivial support for `GEX_FLAG_IMMEDIATE` with AM Long, once target-side reassembly removes the "sync" of "put-sync-send". * Improve asynchronous LC support for Put. Like libibverbs, libfabric does not provide distinct events for local versus remote completion. Therefore, the current support for `GEX_EVENT_{DEFER,GROUP}` signals local completion immediately before remote/operation completion. However, use of `FI_INJECT` could be used to achieve synchronous completion of small Puts in the same manner that ibv-conduit utilizes inline sends. * Implement native support for Negotiated-Payload Active Message (NPAM) * Eliminate a buffer copy for AM Medium (and AM Long below the RMA threshold) by using (iov_count > 1) on providers which support it, along with the corresponding asynchronous LC support for AM Medium. * Implement native atomics via FI_ATOMIC * Implement native collectives via FI_COLLECTIVE * Seek a better barrier than RDMADISSEM via FI_COLLECTIVE