Blaze AIR (Array Intermediate Representation)
=============================================

 * [Blaze Execution System](blaze-execution.md)

Blaze AIR represents deferred computations for the purposes of transformation,
optimization and execution. These deferred operations are generated by the
user, through the application of blaze functions to data descriptions (arrays).

What follows is a description of the intermediate representation, the
transformations and the execution system.

We will start out in a local context and later describe how distributed
and out-of-core computation may fit in.


 * Transport blaze expressions across the network, aka "moving code to data".
 * Represent blaze's array programming abstractions explicitly. This means
   to convert implicit broadcasting to explicit, picking particular type
   signatures of kernels within blaze functions, etc.
 * Be understandable by third party systems, and participate in the
   development/discussion (https://groups.google.com/forum/#!topic/numfocus/tX7fRfwiFkI).

Intermediate Representation
---------------------------
The intermediate representation AIR uses is similar to a simple linear
three-address code. However, our operations may take an arbitrary number
of operands.

Each operation initially corresponds to a sub-expression in the expression
graph. The encoding is simply the result of a post-order traversal of the
graph, starting at the root. The root is determined by the expression that
was passed to blaze.eval:

```python
    root = a + (b * c)
    result = blaze.eval(root)
```

The initial AIR generated by the above expression would be similar to
the below:

```
    function expr(%a, %b, %c) {
        %0 = kernel(mul, %b, %c)
        %1 = kernel(add, %a, %0)
        ret %1
    }
```

As can be seen above, each operation produces a value in a virtual register,
e.g. `%0`. Such a register might be referenced multiple times without
re-evaluating the subexpression. This allows us to encode arbitrary DAGs.

The representation does not understand control flow, which is all internal to
the functions themselves. We merely represent a high-level function composition
and along with the data flow between computations. A (naive) evaluator could
simply walk through these instructions and apply these kernels successively:

```
    for each op in expr do
        kernel = op.args[0]
        args = [values[arg] for arg in op.args[1:]]
        values[op] = kernel(*args)
```

However, we want to apply optimizations such as fusion, and we want semantics
such as broadcasting, which complicates our story.

We first describe our IR in more detail.

pykit
-----
We use pykit to encode our intermediate representation. All operations are
instances of `pykit.ir.Operation`. This is a class that tracks:

    * a result register (`result`)
    * a type for the result (`type`)
    * an arbitrary string opcode (`opcode`)
    * a (nested) list of arguments (`args`)

Types may be anything, in blaze we use datashape types. The only part of pykit
that is likely to be of interest for the purposes of AIR is `pykit.ir`.

All these operations are sequenced into basic blocks. We only ever use a single
basic block, since we do not support control flow at this level. These basic
blocks are arranged into functions, such as our function `expr` above.

We generate one such function for each compound expression that is evaluated
by the user.


Transformations
---------------
After correctness our main goal is efficiency. Hence we do not want to use
Python functions to execute over large amounts of data. Furthermore, we must
support open-ended extension where new backends may be registered externally.

The job of the transformations are to leverage the implementation kernels
that were registered for the blaze functions. E.g. to support local execution,
we may have to JIT a function, and to support an operation over data in a
relational database we may have to generate SQL.

This burden is placed on the kernels. This allows a clean separation of
function specification, kernel implementation and backend composition and
optimization.

Our goal is to have backend passes generate from these kernels efficient
compositions, encapsulated in backend-specific data. Passes consume sub-graphs
of the sub-expressions they want to handle, based on heuristics, cost models,
assumptions or simply because they can. Sub-graphs are contracted to single
nodes, which are either execution-ready or processed by some later pass.
Decisions are often based on on sub-expression specific metadata, such as type
or expected data location (covered below).


Pipeline and Environment
------------------------
Our transformations are composed in a single pipeline, which is a series of
transformations to be applied in succession. Each transformation takes two
arguments: the function (which encodes the expression) and the environment.

The function we have already discussed above, it holds the latest encoding
of the expression. The environment is a simple dictionary mapping string keys
to values. This allows passes to communicate any meta data down the pipeline.
This metadata can be local to each sub-expression, for instance by mapping
each Operation to some value, or global to the expression. What this data
means is specific to a sub-set of passes and is unrestricted by the system.
However, it is good practice to document environment keys and their purpose
in `air/environment.py`.


Backends
--------
Different backends perform different operations and need produce different
execution kernels. To allow mixing and matching of different backends, the
result of all transformations is a simple sequence of `"pykernel"` operations,
e.g. assuming we were able to fuse our `expr` example:

```
    function expr(%a, %b, %c) {
        %0 = pykernel(fma, %a, %b, %c)
        ret %0
    }
```

The result can be interpreted straightfowardly. Depending on the backend,
these kernels may be produced by a series of transformations or by a single
transformation. For instance, the JIT infrastructure currently produces
ckernels, which are composed in a later pass. A final pass can produce a
py-kernel from conglomerated ckernels.


Out-of-core
-----------
Out-of-core computations can be handled in many ways, depending on the
nature of the kernel. Notably, we can identify two approaches:

    * implement the OOC feature outside of the execution pipeline, wrapping
      the entire process
    * implement OOC feature inside the pipeline

In the simplest case trivially data-parallel kernels operating over
in-memory data can be wrapped using either mechanism, i.e. by wrapping
the entire pipeline or individual kernels with OOC kernels.

In the most general case however, wrapping the entire pipeline in an opaque
manner may not work, since have no knowledge of data access patterns.
In this case Blaze functions must be ready to accept a data descriptor for
consumption. How this is handled depends on how the kernel is implemented.
This may work well if kernels and data descriptors are jitted.


Distributed Computation
-----------------------
Distributed computation is again a different beast. We have the same problem
as with OOC computation, namely that we cannot assume knowlegde of data access
requirements of an arbitrary kernel. It makes sense to treat "out-of-core" and
"distributed" as their own separate "implementation kind", requiring a different
implementation to support execution over the respective kinds of data sources.
Of course, many cases can be generalized at the kernel implementation level,
such as OOC or distributed execution for elementwise ufuncs, reductions, etc.


Metadata
--------
To have some notion of kernel properties, such as commutativity or associativity,
we support per-op metadata. This can be represented directly on the operation
or in a mapping in the expression-global environment.

Such properties can be expressed in the blaze function, by passing in
additional keyword arguments which become part of the sub-expression's metadata.

We further allow kernel-specific metadata to be set for individual kernels.
This may be specific to the backend in question that is supposed to handle
the assemblage of kernels of that kind.
