Overview
========

RDMA support for LAPI conduit is currently implemented with #ifdef
regions providing the required functionality.  It is automatically
detected at configure time.

Setup (mostly in gasnet_core.c:gasnetc_attach())
================================================

In this routine we take the regular segment (SEGMENT_FAST only)
and split it up into a number of 16MB regions, each with its
own PVO (LAPI_Util with LAPI_XLATE_ADDRESS).  These PVOs are
then exchanged (loop of LAPI_Address_init64() calls).  In addition
lapi_remote_ctxt_t objects are obtained (LAPI_Util with
LAPI_RDMA_ACQUIRE).  At this point a node has enough to identify
and manipulate data in the shared segment on any other node.


GET and PUT Operations
==============

Shared segment to shared segment
--------------------------------

The only issue here is decomposing the transfer so that each RDMA
operation does not cross PVO boundaries.  The strategy taken here
is to loop over source regions, splitting the transfer into one
or two, depending on whether the source region crosses a PVO boundary
at the target.


Out of shared segment to shared segment
---------------------------------------

Bounce buffers are used for transfers shorter than 
gasnete_pin_threshold.  For longer transfers, one of two schemes
is used depending on whether firehose is enabled:
1) The regions are lined up, i.e. the source PVO boundaries are matched
to target PVO boundaries.  The source regions are then pinned using 
this decomposition and the LAPI_Xfer() calls are made.
2) Local firehose is used to pin the source region.

For segment everything, both local and remote firehose is used.

Completion
----------

Here the origin counter is sufficient.  Upon completion all the source
out of segment registrations are freed (if firehose isn't used).

Managing GASNet Handles
=======================

Only the origin counter and list of locally pinned regions are needed.
In the non-firehose case, the send completion handler unpins what needed
to be pinned.


Managing Network Buffers
========================

For short transfers we use pre-pinned network buffers.  They are of
size gasnete_pin_threshold and we have gasnete_num_nb of them.
They are managed with a free list (lock free on PPC).  Initiators get
an entry from the free list, blocking until one is available, and use
the buffer for the transfer.  Upon completion of a transfer, the
entry is placed back on the free list.

Tuning parameters
=================

GASNET_LAPI_RCTXTS_PER_NODE - INTEGER (default: 4)
A (small) power of 2 specifying how many remote contexts to open up to
each peer.

GASNET_LAPI_USE_RDMA - BOOLEAN (default: yes if available)
Whether or not to use RDMA with this conduit.  If RDMA is detected
at configure time, it is enabled by default.  Note that the job must
have the required RDMA resource allocated (see "Examples", below) or
else the conduit will degrade to the older conduit code and print a
warning.

GASNET_LAPI_PIN_THRESHOLD - INTEGER (default: 4096)
The message size (in bytes) where get and put operations crossover
from using network buffers to dynamic pinning.  Must be <= 16384.

GASNET_LAPI_USE_FIREHOSE - BOOLEAN (default: yes)
Whether or not to use the firehose dynamic memory registration scheme for
SEGMENT_FAST.  Firehose is unconditionally enabled for SEGMENT_EVERYTHING.
 
Examples
========

When running jobs via POE or LL, certain additional arguments/settings are
required to allocate RDMA resources to the job.  Otherwise, the lapi-rdma
support will be disabled automatically and a warning is printed.

For POE the following command line options are recommended, in addition to
any others you may normally require:
  "-rdma_count 2"
     This is required to allocate RDMA resources.  Large values are possible,
     but we've yet to determine what value is optimal.
  "-msg_api LAPI" or "-msg_api LAPI,MPI"
     One of these is required for lapi-conduit, regardless of whether RDMA
     support is used.  Use "-msg_api LAPI" when a job uses only GASNet, or use
     "-msg_api LAPI,MPI" if the job uses both GASNet and MPI.

For LoadLeveler job scripts, the following line is required, in addition to
any others you may normally require:
  #@ network.LAPI = sn_all,not_shared,us,,,rcxtblocks=2
You should modify the first three fields of this example to match what is
normally used at your site.  However the ",,,rcxtblocks=2" should not be
omitted.  As with the POE example, values larger than 2 may also be used
for rcxtblocks.
If using both GASNet and MPI in a given LL job, use this line in ADDITION to
any "#@ network.MPI = ..." line you may normally use.

Additional information may be found in IBM's POE and LoadLeveler documentation
or in the online documentation at some computing centers with IBM SP machines.

Known Bugs
==========

