GASNet Extended API Notes for Conduit Writers

The GASNet Extended API is a means of leveraging high-level (and eventually collective) operations from specialized network hardware. The following document presents an overview of the requirements GASNet puts and gets impose on networks and attempts to dissambiguate the many flavours of puts/gets currently in the interface. The remainder of this document assumes familiarity with the following:

GASNet spec (specifically sections on the GASNet segment, GASNet puts and of course familiarity with the GASNet core)
Network-specific API documentation (LAPI, GM, Elan, SCI, ...)
Pinning networks should consider using the firehose interface (link to be posted...)

The most important aspect of GASNet puts and gets, which unfortunately uncovers restrictions on many of the network hardware available today, concerns the ability for the network to complete one-sided operations without requiring programmers to set up memory available for RDMA at runtime. For this reason, the following distinction in networks is necessary:

Hardware assisted RDMA. Puts and gets for any reasonable size can be completed without interrupting the remote host processor. This approach requires assistance from the NIC to handle page faults and replicate some of the data structures present in the host's Memory Management Unit. These networks are typically more expensive but are preferred for Global Address Space languages.
Explicit Pinning RDMA. The network-level programmer must explicitly set up virtual memory regions that are pinned (locked from being paged out). For other programming models, pinning small portions of memory may be a sufficient approach. Since GAS languages typically read and write to remote memory regions that may be larger than the amount of physical memory, explicit pinning must be handled carefully -- mainly because of the limitations imposed on pinning all of physical memory but also because of the implications in interrupting remote host processors.

The ideal GASNet put/get operation should exhibit the following characteristics:

Non-blocking and zero-copy, primitives which should be in any high performance communication system.
Fully one-sided, in that the target host processor is not interrupted no mechanism requires the programmer to explicitly pin areas of virtual memory at runtime.
Expose the entire Virtual Address Space, so puts/gets can be completed without explicit pinning, or any ad-hoc mechanism to be managed by the conduit programmer.
Low software overhead, meaning the combined costs of the runtime system, the communications layer and the network layer should be at a minimum when issuing communication calls -- especially small operations.
Low Latency, High Bandwidth, where high bandwidth is preponderant in many bulk synchronous messaging paradigms, small message latency is just as important for GASNet.
Efficient completion mechanism, such that completions for remote operations are both expressive and low-cost in software overhead in order to sync non-blocking and implicit non-blocking GASNet operations.
Support for RDMA puts/gets of any size and alignment without the need of any read-modify-write mechanisms or other consistency problems.

Explicit Pinning Idiosyncrasies

Since one-sided operation over pinning networks cannot be possibly guaranteed for every put/get, the alternatives for sending puts and gets to a remote node are the following:

Rendezvous: A request to pin a remote region is sent to the target node. Once the initiator receives the reply confirming that the remote region is pinned, the one-sided operation can be initiated.
- Pros: Resulting one-sided operation has good large message bandwidth and doesn't interrupt remote host processor
- Cons: The rendez-vous request does interrupt the remote host processor and each put effectively requires two network roundtrips (3 if the remote region must be unpinned).
Active Messages: The GASNet core provides AM messaging capability which is sufficiently general for puts/gets although discouraged for efficiency.
- Pros: Easy to use, general enough for functional operation
- Cons: Not one-sided, not zero-copy (possible RDMA in AM Long Asyncs still require an adhoc mechanism for pinning remote memory -- see Firehose).
Firehose: Firehose clients have the ability to expose onesided, zero-copy communication as a common case, while minimizing the number of host-level synchronizations (similar to the rendez-vous approach). Additionally, the cost of synchronization and pinning is amortized over multiple remote memory operations.
- Pros: One-sided zero-copy in the common case, low synchronization overhead and has a flexible API.
- Cons: Something else to implement for a conduit writer.

If the network supports more than one send mechanism, it is always useful to know what performance can be expected from each mechanism (possibly as a function of message size).

GASNet Puts/Gets

Puts non-bulk

Typically used for small aligned lengths, in the range of 1 byte to 512 bytes with a special emphasis on powers of two.
Performance critical for GAS languages (low latency is a must!).
Source memory for put can be changed once the function returns. Unless the underlying network guarantees that the source region for a put can be reused once it returns, the conduit must make a copy of the data to send (possibly to preallocated bounce buffers). Sometimes this may be possible for small sizes where the source data is sent to the NIC using PIO. In contrast, when DMA is used, the memory transfer can happen after the function returns, which requires the memory to be copied and sent from a bounce buffer.
- See comments below on memory copying
Pinning networks: If the network only supports sending data from pinned memory, the conduit may have to allocate a set of prepinned bounce buffers. Conduit writers should examine the cost of pinning remote memory for small messages: if the performance benefits of using a rendez-vous approach to pin memory for large messages is apparent, it is likely not the case for small messages.
Firehose clients: Following temporal and spatial locality principles, firehose clients are encouraged to use firehose_remote_pin() as much as possible unless there is a good reason for not doing so, in which case firehose_try_remote_pin() and firehose_try_local_pin() may be preferred. A possible good reason for may be the ability for the underlying network to support very small puts to be sent without pinning.

Gets Non-bulk

gasnet_get_nb(), gasnet_get_nbi() non-blocking and gasnet_get() blocking calls.
Typically used for small aligned lengths, in the range of 1 byte to 512 bytes with a special emphasis on powers of two.
Performance critical for GAS languages (low latency is a must!).
Pinning networks: Comments are deferred to firehose clients.
Firehose clients: GASNet clients may make use of stack variables as the destination for gets, making the stack a good candidate for a pinning optimization. For the same locality principles explained in non-bulk puts, gets are encouraged to make use of firehose_remote_pin() and firehose_local_pin(). However, if the network supports ordered delivery of DMA with respect to core Active Messages, a firehose_remote_callback may be run on the remote node once the remote node has pinned the requested memory and before the firehose reply is sent. This allows an RDMA put to be sent to the requester of the get operation.

Puts/Gets bulk

gasnet_put_nb_bulk(), gasnet_put_nbi_bulk() non-blocking and gasnet_put_bulk() blocking put calls.
gasnet_get_nb_bulk(), gasnet_get_nbi_bulk() non-blocking and gasnet_get_bulk() blocking get calls.
Typically used for large unligned sizes, over 1K.
Source memory for put cannot be changed once function returns. This implies that the conduit can let the memory copy be deferred to the DMA engine and need not make a copy of the source data.
Pinning networks: RDMA operation, including the overhead of pinning source and destination, should be used.
Firehose clients: Encouraged to call both firehose_local_pin() and firehose_remote_pin() in order to expose one-sided zero-copy operations in the common case.

Memory copying

1 to word size: If the memory is aligned, a store may be sufficient. If not, see below.
word size - ...: For larger sizes, various approaches may be possible.
- Standard libc memcpy() (or bcopy()) is sufficient and may be the only alternative on some systems.
- Architecture-level optimizations of memcpy() may be available and may provide better bandwidth, such as MMX-aware memcpy() used in video-playback software (this is currently absent from GASNet).
- Some networks may be on high-bandwidth buses that allow copies to be carried out over the adapter. If RDMA is involved, this approach has the advantage of not dirtying the CPU cache although it may be difficult to guarantee write completion ordering (which memcpy() implicitly provides upon return).

Back to the GASNet home page