Laplace 3D

This is a simple 7 point stencil on a 3D grid. It is bandwidth-bound.
The C version gets 60 GB/s on a 2 socket SNB-EP. The code doesn't
handle corner cases at the grid edges, so each dimension must be 4n+2
for SSE or 8n+2 for AVX.

E.g.:

$ ./laplace3d 258 258 258 1000 avx 0

The Matlab/Octave versions are enough slower than the C and Julia
versions that they use only 1/10 the iterations.

