GASNet inter-Process SHared Memory (PSHM) design
---------------------------------------------

Document by:
    Dan Bonachea <bonachea@cs.berkeley.edu>
    Paul H. Hargrove <PHHargrove@lbl.gov>
    Filip Blagojevic <FBlagojevic@lbl.gov>
Implementation by:
    Jason Duell
    Filip Blagojevic <FBlagojevic@lbl.gov>
    Paul H. Hargrove <PHHargrove@lbl.gov>

Goal:
----
Provide GASNet with a mechanism to communicate through shared memory among
processes on the same compute node.  This is expected to be more robust than
pthreads (which greatly complicates the Berkeley UPC runtime, and prevents
linking to any numeric libraries that that are not thread-safe).  It is also
expected to display lower latency than use of a network API's loopback
capabilities (though the network hardware might provide other benefits such
as asynchronous bulk memory copy w/o cache pollution).

We appreciate your feedback related to PSHM (both positive and
negative) and would be happy to work with you to improve PSHM.

To use:
------
In the current release, GASNet's PSHM support is enabled by default only on
Linux.  On all other platforms, one must pass --enable-pshm if PSHM support
is desired.  On Linux PSHM can be disabled by passing --disable-pshm at
configure time.

The PSHM support in GASNet can operate via three generic mechanisms: POSIX
shared memory, SystemV shared memory, or mmap()ed disk files.  When PSHM is
enabled, the default is for the configure step to probe only for support
via POSIX shared memory (except on MacOS, where it is known to be broken).
If no POSIX shared memory support if found, there is no automatic fallback
to any other mechanism.  So, if one wishes to use SystemV shared memory or
mmap()ed files, one should explicitly disable the POSIX support and enable
the desired mechanism:

	Usage Summary (flags to be passed to the configure script):
	----------------------------------------------------------
          OFF: --disable-pshm
        POSIX: --enable-pshm
         SYSV: --enable-pshm --disable-pshm-posix --enable-pshm-sysv
         FILE: --enable-pshm --disable-pshm-posix --enable-pshm-file

        On Linux "--enable-pshm" is the default.
        On all other platforms "--disable-pshm" is the default.

There are also implementations which are specifc to certain platforms:

For Cray XE, XK and XC systems, PSHM-over-XPMEM is enabled automatically
by the corresponding cross-configure scripts.  However, one can also
optionally enable PSHM-over-XPMEM on an SGI Altix as follows:
        XPMEM: --enable-pshm --disable-pshm-posix --enable-pshm-xpmem

For BlueGene/Q, PSHM via the "common heap" is enabled automatically by
the corresponding cross-configure scripts.  However, one can also use:
        GHEAP: --enable-pshm --disable-pshm-posix --enable-pshm-gheap

For Linux systems with libhugetlbfs and a hugetlbfs mount, there is also
experimental support for mmap()ed files on hugetlbfs, which can be
enabled with:
    HUGETLBFS: --enable-pshm --disable-pshm-posix --enable-pshm-hugetlbfs

PSHM includes an environment variable for controlling use of memory for
intra-memory AM traffic:

  GASNET_PSHMNET_QUEUE_DEPTH
     Minimum number of PSHM-AMs a process must be capable of sending before
     it may stall (default 32).  The shared memory allocated on a compute
     node with P processes is roughly (2 * P * Depth * MaxMsgSz).

Parameters Setting:
------------------
Although recommended as the first option, POSIX shared memory is not
available on all systems, even systems running Linux may not be configured
to support it.  In the absence of the POSIX shared memory, users are
advised to use the SystemV shared memory as the next-best option.

In the absence of both, POSIX and SystemV shared memory, a user may try
using mmap()ed disk files. However, on some systems we see significant
performance degradation when using files (apparently due to committing the
changes from memory to disk).

On most operating systems the amount of available SystemV shared memory
and the number of shared memory segments is controlled by the kernel
parameters: shmmax, shmall and shmmni.
   shmmax = largest size of a shared memory segment (in bytes)
   shmall = total amount of memory allocatable as shared (in pages)
   shmmni = maximum number of shared memory segments

Insufficient amount of SystemV shared memory will lead to failures at
start-up of any application using a runtime configured to use PSHM over
SystemV.  Setting these parameters is system-specific and requires
administrator privileges.

* Examples to set:

   - Linux:	
	sudo /sbin/sysctl -w kernel.shmmax=<large value>
	sudo /sbin/sysctl -w kernel.shmall=<large value>
	sudo /sbin/sysctl -w kernel.shmmni=<larger than number of
					    PSHM processes + 1>
   - MacOS:
	sudo /sbin/sysctl -w kern.sysv.shmmax=<large value>
	sudo /sbin/sysctl -w kern.sysv.shmall=<large value>
	sudo /sbin/sysctl -w kern.sysv.shmmni=<larger than number of
					 PSHM processes + 1>

   - Various BSD flavors:
        FreeBSD: follow MacOS example, replacing "sysv" with "ipc".
        NetBSD: follow MacOS example, replacing "sysv" with "ipc".
        OpenBSD: follow MacOS example, replacing "sysv" with "shminfo".

   - Solaris:
        Depends on version.  Please see the Sun/Oracle documentation.

   - Cygwin:
        See Cygwin under "change parameters permanently", below.

* To change parameters permanently:

   - Linux:
	Add the following lines to /etc/sysctl.conf (sudo required):
	kernel.shmmax=<large value>
	kernel.shmall=<large value>
	kernel.shmmni=<larger than number of
	               PSHM processes + 1>

	To reload the new settings:
	sudo /sbin/sysctl -p /etc/sysctl.conf

   - MacOS:
	Add the following lines to /etc/sysctl.conf (sudo required):
	kern.sysv.shmmax=<large value>
	kern.sysv.shmall=<large value>
	kern.sysv.shmmni=<larger than number of
	                  PSHM processes + 1>

	To reload the new settings: reboot the machine.

   - Various BSD flavors:
        FreeBSD: follow MacOS example, replacing "sysv" with "ipc".
        NetBSD: follow MacOS example, replacing "sysv" with "ipc".
        OpenBSD: follow MacOS example, replacing "sysv" with "shminfo".

   - Solaris:
        Depends on version.  Please see the Sun/Oracle documentation.

   - Cygwin:
        If you have not done so yet, please Read the Cygwin documentation
        on "cygserver-config" to create an initial /etc/cygsever.conf
        and start the server as a Windows service (optional).
        You may then edit /etc/cygserver to edit
          kern.ipc.shmmaxpgs
          kern.ipc.shmmni
          kern.ipc.shmseg
        Except for shmmaxpgs, the defaults are often large enough.
        These configuration values are only read when cygserver starts.
        So, read the Cygwin documentation to determine if/how to restart
        the cygserver service.
        Under Cygwin-1.5 you may also need to add "server" to the value
        of the CYGWIN environment variable.  Again, you should see the
        Cygwin documentation for more information on this subject.

IMPORTANT, SYSTEM CLEANING:
--------------------------
If a GASNet application using PSHM is terminated before ending the
initialization phase, there is a possibility that the shared memory objects
will remain in the system.  A large amount of memory or disk space can
remain allocated, preventing users from fully utilizing all available
hardware resources.

In the SystemV case, the allocated (but not released) shared memory
segments can be listed via the "ipcs" command, and can be removed via the
"ipcrm" command.  Note that on the systems with a batch scheduler, the
"ipcs" and "ipcrm" instructions need to be run on the compute nodes.

In the mmap()ed file case, the allocated but not released shared memory files
can be found in the directory pointed by the TMPDIR environment variable
(default: /tmp). These files are named with the prefix GASNT (the lack of an
'E' is not a typo), and can be deleted using the "rm" command.

The case of mmap()ed file on hugetlbfs is like the regular mmap()ed file
case except that the directory will be the hugetlbfs mount point.  This is
typically something like /dev/hugepages.  If uncertain of the mount point,
the command "mount|grep hugetlbfs" should locate it for you.

In the case of POSIX shared memory, the implementation is system-specific.
In the case of Linux and Solaris, POSIX shared memory objects are visible
in the file system.  For Linux the default location is /dev/shm and on
Solaris the default is /tmp.

Scope:
-----
* GASNet segment via PSHM only supported for SEGMENT_FAST or SEGMENT_LARGE
  (not meaningful for SEGMENT_EVERYTHING mode)
* May eventually support AM-over-PSHM for SEGMENT_EVERYTHING (but not yet)
* Applicable both w/ and w/o pthreads

Terminology:
-----------
* node: each UNIX process running GASNet
* supernode: 1 or more nodes with cross-mapped segments using PSHM support
* supernode peers: nodes which share a supernode

Interface notes:
---------------
* All node processes call gasnet_init(), each is a separate GASNet node
* PSHM is enabled/disabled at configure time and GASNETI_PSHM_ENABLED is
  #defined to either 1 or 0.  Each conduit can then #define GASNET_PSHM
  to 1 if it implements PSHM support.
* gasnetc_init() performs super-node discovery, using OS-appropriate (or
  conduit-specific) mechanisms to figure out which nodes are capable of
  sharing memory with which other nodes:
   - unconditionally calls gasneti_nodemapInit() (to drive "discovery")
   - calls gasneti_pshm_init() only if PSHM support enabled (to setup data)
* MaxLocal/Global return values reflecting the amount of segment space divided
  evenly among the supernode peers, and each node passes a size to
  gasnet_attach reflecting the per-node segment size they want. 
* gasnet_attach takes care of mapping each processor's segments as usual, but
  also maps the segments of supernode peers into each nodes VM space using
  OS-appropriate mechanisms. (shm_open()+mmap(), shmget()+shmat(), etc.).
* Nodes on a supernode typically have different virtual address map of the
  segments on that supernode.  They are typically not contiguous either.
* Client calls gasnet_getSegmentInfo() to get the location of his segment and
  those of other nodes (as always)
* Client calls gasnet_getNodeInfo() to get the compute node rank, and the
  supernode rank (which need not be the same when GASNET_SUPERNODE_MAXSIZE
  is non-zero) for each node.  Only nodes with the same supernode rank as
  the caller are addressable via PSHM.  When PSHM is enabled the "offset"
  field givies the difference between X's address for its own segment and the
  address (if any) at which that segment is mapped in the caller's address
  space.
* Client may directly load/store into the segments of any node sharing their
  supernode (currently implemented in Berkeley UPC runtime library)
* remotely-addressable segment restrictions on gasnet_put/get/AMLong apply to
  the individual segments - ie gasnet_put() to an address in the segment of
  node X must give node X as the target node, not some other supernode peer

Restrictions:
------------
* gasnet_hsl_t's are node-local and while they might reside in the segment,
  they may not be accessed by more than one node in a supernode
  - we can/should add a debug-mode check for this (also applies to shmem-conduit)
* Use of GASNet atomics in the segment is allowed, but they must not be weak
  atomics (which means using the explicitly "strong" ones in client code).

Closed (previously "Open") questions:
------------------------------------
Q1) Do we need a separate build or separate configure of libgasnet and/or
    libupcr with PSHM enabled/disabled?
A1) Since the set of conduits supported by PSHM was initially a small subset
    of the total list, we chose not to complicate the UPC compiler with this.
    Thus we've chosen to configure everything (UPCR+gasnet) w/ --enable-pshm
    or w/o.  The number of conduits supporting PSHM is now irrelevant since in
    a PSHM-enabled build of GASNet any conduits not supporting PSHM are simply
    built w/o it (as opposed to not built at all as was once the case).

Q2) If we want to use the same build, then how should GASNET_ALIGNED_SEGMENTS
    definition behave?  Never true when any supernode contains more than one
    node, but don't know that until runtime.
A2) We assume that you don't use PSHM unless also using > 1 proc/node.
    May also revisit if we don't configure PSHM as a distinct build.

Q3) Can we get away with always connecting segments after all processes are
    created, or do we need to fork after setting up shared memory segments?
    Will drivers & spawners even allow that?
    If we decide that a fork is required after job launch, then it should
    definitely be done by the conduit, not the client code. But how would the
    interface look? (this would very likely break MPI interoperability)
A3) All supported conduits are attaching to segments in gasnet_attach().  We
    don't need to work about fork() at all (except that smp-conduit now has a
    fork-based spawner inside gasnetc_init()).

Q4) Does the client code between init/attach need to know the supernode
    associations? (eg to make segsize decision)
A4) So far we have not seen a need for this (though internal to GASNet we do).

Q5) Can/do we still get allocate on first write mapping for the segment?
    - If so, who's responsible for establishing processor/memory affinity
      with first touch? (probably the client)
A5) We have each node mmap() its own segment before any cross-mapping is done
    which should ensure locality if the OS does allocation at mmap() time.
    We currently have the client doing first-touch to deal with the case that
    the OS does page frame allocation on touch, rather than mmap().

Q6) Do we ever want to allow supernodes to share a physical node?
    (eg to increase segment size or to leverage NUMA affinity)
    - if so, need an interface to specify this (probably environment variables)
A6) The GASNET_SUPERNODE_MAXSIZE env var bounds the number of processes which
    will be joined into a single supernode (with 0 meaning no bound).  When
    there are more than this number of processes per node, then multiple
    supernodes will be constructed.

Open questions:
--------------
* How do we handle 8 or 16-way SMPs on 32-bit platforms where VM space is
  already tight, or OS's where the limit on sharable memory is small? This
  design would make our per-node segsizes rather small. Do we want a mode
  where segments are not cross-mapped, but the gasnet_put/get can bypass the
  NIC using a two-copy scheme through bounce buffers?
  - This bounce buffer mode could potentially also help for EVERYTHING mode
    (without pshm segments), although due to attentiveness issues, it may be
    slower than using loopback RDMA
  - Is this mode just the extended-ref using AM-over-PSHM?
* Will there be contention with MPI for resources (and should we care)?
  
Known Problems / To do:
----------------------
* The mechanism we are using to probe for maximum segment size works fine on a
  system with plenty of memory, but dies on systems with less.  The work
  around is to set the GASNET_MAX_SEGSIZE small enough for a given system.
* There are still error cases that will leak shared memory.

Status:
------
* The entire GASNet and Berkeley UPC test suites are run on the platforms
  which support PSHM, and it is considered stable on Linux, AIX, Solaris,
  {Free,Net,Open}BSD and BG/Q.
* IBM BG/Q platform-specific notes:
  - The PAMI conduit supports PSHM, and the provided cross-configure script
    will --enable-pshm-gheap
  - One must set BG_MAPCOMMONHEAP=1 in the compute node environment.
    This is handled automatically by the gasnetrun_mpi and gasnetrun_pami
    scripts, but must be handled manually when using runjob directly (or
    a wrapper like qsub from ANL's Cobalt suite).
* Cray XE, XK and XC platform-specific notes:
  - The Gemini and Aries conduits support PSHM, and the provided cross-
    configure scripts --enable-pshm-xpmem
* MacOS X platform-specific notes:
  - MacOS X with POSIX shared memory is NOT supported because we appear to 
    trigger a kernel memory leak.
  - SystemV shared memory is a valid choice:
       --enable-pshm --enable-pshm-sysv
  - Use of mmap()ed files has been seen to cause VERY slow start-up.
* {Free,Open,Net}BSD platform-specific notes:
  - FreeBSD supports POSIX shared memory and has been well tested
  - OpenBSD and NetBSD do not support POSIX shared memory, but do
    support SystemV:
       --enable-pshm --enable-pshm-sysv

* GASNet conduits known NOT to work:
  - SHMEM conduit does not support PSHM, but there is no reason to think
    that doing so would be constructive.
  Keep in mind that if you use one of these conduits on a platform with the
  necessary support for PSHM, you may still configure with --enable-pshm to
  get PSHM support in other conduits (e.g. SMP and MPI), and non-PSHM
  conduits will still build (they will simply be missing PSHM support).
