GASNet extended-ref documentation
Dan Bonachea <bonachea@cs.berkeley.edu>

User Information:
-----------------

The extended-ref directory is not intended for direct use by end users.
It contains common code used to implement parts of the GASNet Extended 
API within many conduits.

Recognized environment variables:
---------------------------------

* All the standard GASNet environment variables (see top-level README)

Optional compile-time settings:
------------------------------

* GASNETE_GETPUT_MEDIUM_LONG_THRESHOLD - size threshold where gets/puts stop
 using medium messages and start using longs (if/when enabled, see below)

* GASNETE_USE_LONG_GETS - true if we should try to use Long replies in gets
 (only possible if dest falls in segment)

* GASNETE_USE_LONG_PUTS - true if we should try to use Long requests in puts

* GASNETI_THREADINFO_OPT - optimize thread discovery using hidden local variable
 
* GASNETI_LAZY_BEGINFUNCTION - postpone thread discovery to first use

Known problems:
---------------

* See the GASNet Bugzilla server for details on known bugs:
  https://gasnet-bugs.lbl.gov/

Future work:
------------

===============================================================================

Design Overview:
----------------

The extended-ref directory provides a reference implementation of the extended
API, written solely in terms of the GASNet core API. This provides an easy
porting path for new networks - once the core API is implemented, the
extended-ref files provide working implementations of everything else in
GASNet. Conduits for networks with good hardware support can subsequently
customize and optimize selected functions from the extended API using hardware
support.

gex_Event_t is a pointer to a gasnet_op_t, which can be either an "eop"
type (handle for explicit non-blocking op) or an "iop" type (handle for an
implicit non-blocking op or access region). eops are allocated from a
thread-specific pool to reduce allocation and locking costs, and basically just
contain a flag block to track completion state. iops contain initiation and
completion counters for puts and gets that are incremented atomically as
operations complete. iops are also kept in a free list once freed to avoid
reallocation costs (only created/freed for access regions). Each thread has an
iop which is active for any new iop operations. 

Puts and gets are implemented on AM as follows:

gasnet_put(_bulk) is translated to a gasnete_put_nb(_bulk) + sync
gasnet_get(_bulk) is translated to a gasnete_get_nb(_bulk) + sync

gasnete_put_nb(_bulk) translates to
   if nbytes < GASNETE_GETPUT_MEDIUM_LONG_THRESHOLD
     AMMedium(payload) 
   else if nbytes < AMMaxLongRequest
     AMLongRequest(payload) 
   else
     gasnete_put_nbi(_bulk)(payload) 

gasnete_get_nb(_bulk) translates to
   if nbytes < GASNETE_GETPUT_MEDIUM_LONG_THRESHOLD
     AMSmall request + AMMedium(payload) reply
   else
     gasnete_get_nbi(_bulk)() 

gasnete_put_nbi(_bulk) translates to
   if nbytes < GASNETE_GETPUT_MEDIUM_LONG_THRESHOLD
     AMMedium(payload) 
   else if nbytes < AMMaxLongRequest
     AMLongRequest(payload) 
   else
     chunks of AMMaxLongRequest with AMLongRequest() 
     AMLongRequestAsync is used instead of AMLongRequest for put_bulk

gasnete_get_nbi(_bulk) translates to
   if nbytes < GASNETE_GETPUT_MEDIUM_LONG_THRESHOLD
     AMSmall request + AMMedium(payload) reply
   else
     chunks of AMMaxMedium with AMSmall request + AMMedium() reply

The current implementation uses AMLongs for large puts because the 
destination is guaranteed to fall within the registered GASNet segment.
The spec allows gets to be received anywhere into the virtual memory space,
so we can only use AMLong when the destination happens to fall within the 
segment - GASNETE_USE_LONG_GETS indicates whether or not we should try to do
this.  (conduits which can support AMLongs to areas outside the segment
could improve on this through the use of this conduit-specific information).


Barrier:

There are two implementations available to choose between at runtime via
the GASNET_BARRIER environment variable.  The compile-time constant
GASNETE_BARRIER_DEFAULT can be used to set the default barrier.

 "AMCENTRAL" - a silly, centralized barrier implementation:
  everybody sends notifies to a single node, where we count them up
  central node eventually notices the barrier is complete (probably
  when it calls wait) and then it broadcasts the completion to all the nodes
 The main problem is the need for the master to call wait before the barrier can
  make progress - we really need a way for the "last thread" to notify all 
  the threads when completion is detected, but AM semantics don't provide a 
  simple way to do this.
 The centralized nature also makes it non-scalable - we really want to use 
  a tree-based barrier or pairwise exchange algorithm for scalability
  (but these impose even greater potential delays due to the lack of
  attentiveness to barrier progress)

 "AMDISSEM" - an AM-based Dissemination barrier implementation:
  With N nodes, the barrier takes ceil(lg(N)) steps (lg = log-base-2).
 At step i (i=0..):
  node n first sends to node ((n + 2^i) mod N)
  then node n waits to receive (from node ((n + N - 2^i) mod N))
  once we receive for step i, we can move the step i+1 (or finish)
 The distributed nature makes this barrier more scalable than a centralized
  barrier, but also more sensitive to any lack of attentiveness to the
  network.
 We use a static allocation, limiting us to 2^GASNETE_AMBARRIER_MAXSTEP nodes.
 Algorithm is described in section 3.3 of
  John M. Mellor-Crummey and Michael L. Scott. "Algorithms for scalable
  synchronization on shared-memory multiprocessors." ACM ToCS, 9(1):21 65, 1991.