Working with numerical data in shared memory (memmaping)
========================================================

By default the workers of the pool are real Python processes forked using the
``multiprocessing`` module of the Python standard library when ``n_jobs != 1``.
The arguments passed as input to the ``Parallel`` call are serialized and
reallocated in the memory of each worker process.

This can be problematic for large arguments as they will be reallocated
``n_jobs`` times by the workers.

As this problem can often occur in scientific computing with ``numpy``
based datastructures, :class:`joblib.Parallel` provides a special
handling for large arrays to automatically dump them on the filesystem
and pass a reference to the worker to open them as memory map
on that file using the ``numpy.memmap`` subclass of ``numpy.ndarray``.
This makes it possible to share a segment of data between all the
worker processes.

.. note::

  The following only applies with the default ``"multiprocessing"`` backend. If
  your code can release the GIL, then using ``backend="threading"`` is even
  more efficient.


Automated array to memmap conversion
------------------------------------

The automated array to memmap conversion is triggered by a configurable
threshold on the size of the array::

  >>> import numpy as np
  >>> from joblib import Parallel, delayed
  >>> from joblib.pool import has_shareable_memory

  >>> Parallel(n_jobs=2, max_nbytes=1e6)(
  ...     delayed(has_shareable_memory)(np.ones(int(i)))
  ...     for i in [1e2, 1e4, 1e6])
  [False, False, True]

By default the data is dumped to the ``/dev/shm`` shared-memory partition if it
exists and writeable (typically the case under Linux). Otherwise the operating
system's temporary folder is used. The location of the temporary data files can
be customized by passing a ``temp_folder`` argument to the ``Parallel``
constructor.

Passing ``max_nbytes=None`` makes it possible to disable the automated array to
memmap conversion.


Manual management of memmaped input data
----------------------------------------

For even finer tuning of the memory usage it is also possible to
dump the array as an memmap directly from the parent process to
free the memory before forking the worker processes. For instance
let's allocate a large array in the memory of the parent process::

  >>> large_array = np.ones(int(1e6))

Dump it to a local file for memmaping::

  >>> import tempfile
  >>> import os
  >>> from joblib import load, dump

  >>> temp_folder = tempfile.mkdtemp()
  >>> filename = os.path.join(temp_folder, 'joblib_test.mmap')
  >>> if os.path.exists(filename): os.unlink(filename)
  >>> _ = dump(large_array, filename)
  >>> large_memmap = load(filename, mmap_mode='r+')

The ``large_memmap`` variable is pointing to a ``numpy.memmap``
instance::

  >>> large_memmap.__class__.__name__, large_array.nbytes, large_array.shape
  ('memmap', 8000000, (1000000,))

  >>> np.allclose(large_array, large_memmap)
  True

We can free the original array from the main process memory::

  >>> del large_array
  >>> import gc
  >>> _ = gc.collect()

It it possible to slice ``large_memmap`` into a smaller memmap::

  >>> small_memmap = large_memmap[2:5]
  >>> small_memmap.__class__.__name__, small_memmap.nbytes, small_memmap.shape
  ('memmap', 24, (3,))

Finally we can also take a ``np.ndarray`` view backed on that same
memory mapped file::

  >>> small_array = np.asarray(small_memmap)
  >>> small_array.__class__.__name__, small_array.nbytes, small_array.shape
  ('ndarray', 24, (3,))

All those three datastructures point to the same memory buffer and
this same buffer will also be reused directly by the worker processes
of a ``Parallel`` call::

  >>> Parallel(n_jobs=2, max_nbytes=None)(
  ...     delayed(has_shareable_memory)(a)
  ...     for a in [large_memmap, small_memmap, small_array])
  [True, True, True]

Note that here we used ``max_nbytes=None`` to disable the auto-dumping
feature of ``Parallel``. The fact that ``small_array`` is still in
shared memory in the worker processes is a consequence of the fact
that it was already backed by shared memory in the parent process.
The pickling machinery of ``Parallel`` multiprocessing queues are
able to detect this situation and optimize it on the fly to limit
the number of memory copies.


Writing parallel computation results in shared memory
-----------------------------------------------------

If you open your data using the ``w+`` or ``r+`` mode in the main program, the
worker will have ``r+`` mode access hence will be able to write results
directly to it alleviating the need to serialization to communicate back the
results to the parent process.

Here is an example script on parallel processing with preallocated
``numpy.memmap`` datastructures:

.. literalinclude:: ../examples/parallel_memmap.py
   :language: python
   :linenos:

.. warning::

  Having concurrent workers write on overlapping shared memory data segments
  for instance by using inplace operators and assignments on a `numpy.memmap`
  instance can lead to data corruption as numpy does not offer atomic
  operations. The previous example does not risk that issue as each task is
  updating an exclusive segment of the shared result array.

  Some C/C++ compilers offer lock-free atomic primitives such as add-and-fetch
  or compare-and-swap that could be exposed to Python via CFFI_ for instance.
  However providing numpy-aware atomic constructs is outside of the scope
  of the joblib project.


.. _CFFI: https://cffi.readthedocs.org


A final note: don't forget to clean up any temporary folder when you are done
with the computation::

  >>> import shutil
  >>> try:
  ...     shutil.rmtree(temp_folder)
  ... except OSError:
  ...     pass  # this can sometimes fail under Windows
