.. default-domain:: chpl

.. module:: HDFS
   :synopsis: Support for Hadoop Distributed Filesystem

HDFS
====


Support for Hadoop Distributed Filesystem

This module implements support for the
`Hadoop <http://hadoop.apache.org/>`_
`Distributed Filesystem <http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html>`_ (HDFS).

Dependencies
------------

The HDFS functionality in Chapel is dependent upon both Hadoop and Java being
installed.  Your ``HADOOP_INSTALL``, ``JAVA_INSTALL`` and ``CLASSPATH``
environment variables must be set as described below in
:ref:`setting-up-hadoop`.  Without this it will not compile with HDFS, even if
the flags are set. As well, the HDFS functionality is also dependent upon the
``CHPL_AUXIO_INCLUDE`` and ``CHPL_AUXIO_LIBS`` environment variables being set
properly. For more information on how to set these properly, see
doc/technotes/auxIO.rst in a Chapel release.


Enabling HDFS Support
---------------------

Once you have ensured that Hadoop and Java are installed and have the
five variables above, defined, set the environment variable
CHPL_AUX_FILESYS to 'hdfs' to enable HDFS support:

.. code-block:: sh

  export CHPL_AUX_FILESYS=hdfs

Then, rebuild Chapel by executing 'make' from $CHPL_HOME.

.. code-block:: sh

  make

.. note::

  If HDFS support is not enabled (which is the default), all
  features described below will compile successfully but will result in
  an error at runtime such as: "No HDFS Support".


Using HDFS Support in Chapel
----------------------------

There are three ways provided to open HDFS files within Chapel.

Using an HDFS filesystem with open(url="hdfs://...")
****************************************************

.. code-block:: chapel

  // Open a file on HDFS connecting to the default HDFS instance
  var f = open(mode=iomode.r, url="hdfs://host:port/path");

  // Open up a reader and read from the file
  var reader = f.reader();

  // ...

  reader.close();

  f.close();


Explicitly Using Replicated HDFS Connections and Files
******************************************************

.. code-block:: chapel

  use HDFS;

  // Connect to HDFS via the default username (or whichever you want)
  //
  var hdfs = hdfsChapelConnect("default", 0);

  //
  // Create a file per locale
  //
  var gfl  = hdfs.hdfsOpen("/user/johnDoe/isThisAfile.txt", iomode.r);

  ...
  //
  // On any given locale, you can get the local file for the locale that
  // the task is currently running on via:
  //
  var fl = gfl.getLocal();

  // This file can be used as with a traditional file in Chapel, by
  // creating reader channels on it.

  // When you are done and want to close the files and disconnect from
  // HDFS, use:

  gfl.hdfsClose();
  hdfs.hdfsChapelDisconnect();

Explicitly Using Local HDFS Connections and Files
*************************************************

The HDFS module file also supports non-replicated values across
locales. So if you only wanted to connect to HDFS and open a file on
locale 1 you could do:

.. code-block:: chapel

  on Locales[1] {
    var hdfs = hdfs_chapel_connect("default", 0);
    var fl = hdfs.hdfs_chapel_open("/user/johnDoe/myFile.txt", iomode.cw);
    ...
    var read = fl.reader();
    ...
    fl.close();
    hdfs.hdfs_chapel_disconnect();
  }

The only stipulations are that you cannot open a file in both read and
write mode at the same time. (i.e iomode.r and iomode.cw are the only
modes that are supported, due to HDFS limitations).


.. _setting-up-hadoop:

Setting up Hadoop
-----------------

If you have a working installation of Hadoop already, you can skip
this section, other than to set up your CLASSPATH environment
variable.  This section is written so that people without sudo
permission can install and use HDFS.  If you do have sudo permissions,
you can usually install all of these via a package manager.

The general outline for these instructions is:

  1. Install and point to the jdk to provide code Chapel needs to
     compile against libhdfs (:ref:`Step 1 <setup-hadoop-1>`)
  2. Install Hadoop (:ref:`Step 2 <setup-hadoop-2>`)
  3. Set up Hadoop on (a) the local host or (b) a cluster of hosts
     (:ref:`Step 3 <setup-hadoop-3>`)
  4. Start up HDFS (:ref:`Step 4 <setup-hadoop-4>`)
  5. Stop HDFS when you're done (:ref:`Step 5 <setup-hadoop-5>`)
  6. Set up Chapel to run in distributed mode (:ref:`Step 6 <setup-hadoop-6>`)

First reflect your directory structure and version numbers (etc) in the
:ref:`sample .bashrc <setup-hadoop-bashrc>` and put it in your .bashrc (or
.bash_profile -- your choice) and source whichever one you put it into.

.. _setup-hadoop-1:

1. Make sure you have a SERVER edition of the jdk installed and
   point JAVA_INSTALL to it (see the same .bashrc below)

.. _setup-hadoop-2:

2. Install Hadoop

   * Download the latest version of Hadoop and unpack it

   * Now in the unpacked directory, open conf/hadoop-env.sh and edit:

     * set ``JAVA_INSTALL`` to be the part before ``bin/``... when you do:

        .. code-block:: sh

          which java

     * set ``HADOOP_CLASSPATH=$HADOOP_HOME/""*:$HADOOP_HOME/lib/""*:``

   * Now in conf/hdfs-site.xml put the replication number that you
     want for the field ``dfs.replication`` (this will set the
     replication of blocks of the files in HDFS)

   * Now set up passwordless ssh, if you haven't yet:

     .. code-block:: sh

       ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
       cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

.. _setup-hadoop-3:

3. Set up Hadoop

   a. For the local host - See the
      `Hadoop website <http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html>`_
      for good documentation on how to do this.

   b. For a cluster of hosts. If you want to run Hadoop over a cluster, there
      are good tutorials online. Although it is usually as easy as making
      edits to the following files in ``$HADOOP_HOME/conf``:

      * adding the name of the nodes to ``slaves``
      * putting what you want to be the namenode in ``masters``
      * putting the master node in ``core-site.xml`` and ``mapred-site.xml``
      * running:

        .. code-block:: sh

         hadoop-daemon.sh start datanode
         hadoop-daemon.sh start tasktracker

      After this go to your datanode site and you should see a new
      datanode.

      A good online tutorial for this as well can be found here:
      http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html

.. _setup-hadoop-4:

4. Start HDFS

   * Now all we need to do is format the namenode and start things up:

     .. code-block:: sh

       hadoop namenode -format
       start-all.sh  # (This will start hdfs and the tasktracker/jobtracker)

   * In general, hadoop has the same type of commands as bash,
     usually in the form:

     .. code-block:: sh

         hadoop dfs -<command> <regular args to that command>

   * At this point, you can compile and run Chapel programs using HDFS

   * You can check the status of Hadoop via http, for example on a local
     host (e.g., for :ref:`3a above <setup-hadoop-3>`), using:

       *  ``http://localhost:50070/``
       *  ``http://localhost:50030/``

     For cluster mode (:ref:`3b <setup-hadoop-3>`), you'll use the name of the
     master host in the URL and its port (see the web for details).

.. _setup-hadoop-5:

5. Shut things down:

   .. code-block:: sh

     stop-all.sh   # (This will stop hdfs and mapreduce)

.. _setup-hadoop-6:

6. Set up Chapel to run in distributed mode:

   * You'll need to set up your Chapel environment to target multiple
     locales in the standard way (see multilocale.rst and the
     "Settings to run Chapel on multiple nodes" section of the
     .bashrc below).

   * After this you should be able to run Chapel code with HDFS over
     a cluster of computers the same way as you normally would.


.. _setup-hadoop-bashrc:

Here is a sample .bashrc for using Hadoop within Chapel:


.. code-block:: sh

  #
  # For Hadoop
  #
  export HADOOP_INSTALL=<Place where you have Hadoop installed>
  export HADOOP_HOME=$HADOOP_INSTALL
  export HADOOP_VERSION=<Your Hadoop version number>
  #
  # Note that the following environment variables might contain more paths than
  # those listed below if you have more than one IO system enabled. These are all
  # that you will need in order to use HDFS (only)
  #
  export CHPL_AUXIO_INCLUDE="-I$JAVA_INSTALL/include -I$JAVA_INSTALL/include/linux  -I$HADOOP_INSTALL/src/c++/libhdfs"
  export CHPL_AUXIO_LIBS="-L$JAVA_INSTALL/jre/lib/amd64/server -L$HADOOP_INSTALL/c++/Linux-amd64-64/lib"

  #
  # So we can run things such as start-all.sh etc. from anywhere and
  # don't need to be in $HADOOP_INSTALL
  #
  export PATH=$PATH:$HADOOP_INSTALL/bin

  #
  # Point to the JDK installation
  #
  export JAVA_INSTALL=<Place where you have the jdk installed>

  #
  # Add Hadoop directories to the Java class path
  #
  export CLASSPATH=$CLASSPATH:$HADOOP_HOME/""*:$HADOOP_HOME/lib/""*:$HADOOP_HOME/conf/""*:$(hadoop classpath):

  #
  # So we don't have to "install" these things
  #
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/c++/Linux-amd64-64/lib:$HADOOP_HOME/src/c++/libhdfs:$JAVA_INSTALL/jre/lib/amd64/server:$JAVA_INSTALL:$HADOOP_HOME/lib:$JAVA_INSTALL/jre/lib/amd64:$CLASSPATH

  #
  # Settings to run Chapel on multiple nodes
  #
  export GASNET_SPAWNFN=S
  export SSH_SERVERS=<the names of the computers in your cluster>
  export SSH_CMD=ssh
  export SSH_OPTIONS=-x
  export GASNET_ROUTE_OUTPUT=0

HDFS Support Types and Functions
--------------------------------


 

.. record:: hdfsChapelFile

   Holds a file per locale 


.. record:: hdfsChapelFileSystem

   Holds a connection to HDFS per locale 


.. function:: proc hdfsChapelConnect(out error: syserr, path: c_string, port: int): c_void_ptr

   Make a connection to HDFS for a single locale 

.. function:: proc hdfsChapelConnect(path: string, port: int): hdfsChapelFileSystem

   Connect to HDFS and create a filesystem ptr per locale 

.. method:: proc hdfsChapelFileSystem.hdfsChapelDisconnect()

   Diconnect from the configured HDFS filesystem on each locale 

.. method:: proc hdfsChapelFileSystem.hdfsOpen(path: string, mode: iomode, hints: iohints = IOHINT_NONE, style: iostyle = defaultIOStyle()): hdfsChapelFile

   Open a file on each locale 

.. method:: proc hdfsChapelFile.hdfsClose()

   Close a file opened with :proc:`hdfsChapelFileSystem.hdfsOpen` 

.. method:: proc hdfsChapelFileSystem_local.hdfs_chapel_open(path: string, mode: iomode, hints: iohints = IOHINT_NONE, style: iostyle = defaultIOStyle()): file

   Open a file on an HDFS filesystem for a single locale 

.. function:: proc hdfs_chapel_connect(path: string, port: int): hdfsChapelFileSystem_local

   Connect to an HDFS filesystem on a single locale 

