GASNet Extended API Notes for Conduit Writers
The
GASNet Extended API
is a means of leveraging high-level (and eventually
collective) operations from specialized network hardware. The following
document presents an overview of the requirements GASNet puts and gets impose
on networks and attempts to dissambiguate the many flavours of puts/gets
currently in the interface. The remainder of this document assumes familiarity
with the following:
- GASNet spec
(specifically sections on the GASNet segment, GASNet puts and of course
familiarity with the GASNet core)
- Network-specific API documentation (LAPI, GM, Elan, SCI, ...)
- Pinning networks should consider using the firehose interface
(link to be posted...)
The most important aspect of GASNet puts and gets, which unfortunately uncovers
restrictions on many of the network hardware available today, concerns the
ability for the network to complete one-sided operations without requiring
programmers to set up memory available for RDMA at runtime. For this reason,
the following distinction in networks is necessary:
- Hardware assisted RDMA. Puts and gets for any reasonable size can be
completed without interrupting the remote host processor. This approach
requires assistance from the NIC to handle page faults and replicate some
of the data structures present in the host's Memory Management Unit.
These networks are typically more expensive but are preferred for Global
Address Space languages.
- Explicit Pinning RDMA. The network-level programmer must explicitly set up
virtual memory regions that are pinned (locked from being paged out).
For other programming models, pinning small portions of memory may be a
sufficient approach. Since GAS languages typically read and write to
remote memory regions that may be larger than the amount of physical
memory, explicit pinning must be handled carefully -- mainly because of
the limitations imposed on pinning all of physical memory but also
because of the implications in interrupting remote host processors.
The ideal GASNet put/get operation should exhibit the following
characteristics:
- Non-blocking and zero-copy, primitives which should be in any high
performance communication system.
- Fully one-sided, in that the target host processor is not interrupted
no mechanism requires the programmer to explicitly pin areas of virtual
memory at runtime.
- Expose the entire Virtual Address Space, so puts/gets can be
completed without explicit pinning, or any ad-hoc mechanism to be managed
by the conduit programmer.
- Low software overhead, meaning the combined costs of the runtime
system, the communications layer and the network layer should be at a
minimum when issuing communication calls -- especially small operations.
- Low Latency, High Bandwidth, where high bandwidth is preponderant in
many bulk synchronous messaging paradigms, small message latency is just
as important for GASNet.
- Efficient completion mechanism, such that completions for remote
operations are both expressive and low-cost in software overhead in order
to sync non-blocking and implicit non-blocking GASNet operations.
- Support for RDMA puts/gets of any size and alignment without the need
of any read-modify-write mechanisms or other consistency problems.
Since one-sided operation over pinning networks cannot be possibly guaranteed
for every put/get, the alternatives for sending puts and gets to a remote node
are the following:
- Rendezvous: A request to pin a remote region is sent to the target node.
Once the initiator receives the reply confirming that the remote region
is pinned, the one-sided operation can be initiated.
- Pros: Resulting one-sided operation has good large message bandwidth
and doesn't interrupt remote host processor
- Cons: The rendez-vous request does interrupt the remote host
processor and each put effectively requires two network
roundtrips (3 if the remote region must be unpinned).
- Active Messages: The GASNet core provides AM messaging capability which
is sufficiently general for puts/gets although discouraged for
efficiency.
- Pros: Easy to use, general enough for functional operation
- Cons: Not one-sided, not zero-copy (possible RDMA in AM Long Asyncs
still require an adhoc mechanism for pinning remote memory --
see Firehose).
- Firehose: Firehose clients have the ability to expose onesided, zero-copy
communication as a common case, while minimizing the number of host-level
synchronizations (similar to the rendez-vous approach). Additionally,
the cost of synchronization and pinning is amortized over multiple remote
memory operations.
- Pros: One-sided zero-copy in the common case, low synchronization
overhead and has a flexible API.
- Cons: Something else to implement for a conduit writer.
If the network supports more than one send mechanism, it is always useful to
know what performance can be expected from each mechanism (possibly as a
function of message size).
GASNet Puts/Gets
- Typically used for small aligned lengths, in the range of 1 byte to 512
bytes with a special emphasis on powers of two.
- Performance critical for GAS languages (low latency is a must!).
- Source memory for put can be changed once the function returns.
Unless the underlying network guarantees that the source region for a put
can be reused once it returns, the conduit must make a copy of the data to
send (possibly to preallocated bounce buffers). Sometimes this may be
possible for small sizes where the source data is sent to the NIC using
PIO. In contrast, when DMA is used, the memory transfer can happen after
the function returns, which requires the memory to be copied and sent from
a bounce buffer.
- See comments below on memory copying
- Pinning networks: If the network only supports sending data from pinned
memory, the conduit may have to allocate a set of prepinned bounce buffers.
Conduit writers should examine the cost of pinning remote memory for small
messages: if the performance benefits of using a rendez-vous approach to
pin memory for large messages is apparent, it is likely not the case for
small messages.
- Firehose clients: Following temporal and spatial locality principles,
firehose clients are encouraged to use
firehose_remote_pin()
as much as
possible unless there is a good reason for not doing so, in which case
firehose_try_remote_pin()
and firehose_try_local_pin()
may be
preferred. A possible good reason for may be the ability for the
underlying network to support very small puts to be sent without pinning.
-
gasnet_get_nb(), gasnet_get_nbi()
non-blocking and gasnet_get()
blocking calls.
- Typically used for small aligned lengths, in the range of 1 byte to 512
bytes with a special emphasis on powers of two.
- Performance critical for GAS languages (low latency is a must!).
- Pinning networks: Comments are deferred to firehose clients.
- Firehose clients: GASNet clients may make use of stack variables as the
destination for gets, making the stack a good candidate for a pinning
optimization. For the same locality principles explained in non-bulk
puts, gets are encouraged to make use of
firehose_remote_pin()
and
firehose_local_pin()
. However, if the network supports ordered delivery
of DMA with respect to core Active Messages, a firehose_remote_callback
may be run on the remote node once the remote node has pinned the
requested memory and before the firehose reply is sent. This allows an
RDMA put to be sent to the requester of the get operation.
-
gasnet_put_nb_bulk(), gasnet_put_nbi_bulk()
non-blocking and
gasnet_put_bulk()
blocking put calls.
-
gasnet_get_nb_bulk(), gasnet_get_nbi_bulk()
non-blocking and
gasnet_get_bulk()
blocking get calls.
- Typically used for large unligned sizes, over 1K.
- Source memory for put cannot be changed once function returns. This
implies that the conduit can let the memory copy be deferred to the DMA
engine and need not make a copy of the source data.
- Pinning networks: RDMA operation, including the overhead of pinning
source and destination, should be used.
- Firehose clients: Encouraged to call both firehose_local_pin() and
firehose_remote_pin() in order to expose one-sided zero-copy operations in
the common case.
- 1 to word size: If the memory is aligned, a store may be sufficient.
If not, see below.
- word size - ...: For larger sizes, various approaches may be possible.
- Standard libc
memcpy()
(or bcopy()
) is sufficient and may be the only
alternative on some systems.
- Architecture-level optimizations of
memcpy()
may be available and may
provide better bandwidth, such as MMX-aware memcpy()
used in
video-playback software (this is currently absent from GASNet).
- Some networks may be on high-bandwidth buses that allow copies to be
carried out over the adapter. If RDMA is involved, this approach has
the advantage of not dirtying the CPU cache although it may be
difficult to guarantee write completion ordering (which
memcpy()
implicitly provides upon return).
Back to the GASNet home page