* Add support for TPP pivot_mode
* Handle middle of tree better (adapt small subtree kernel to allow inputs?)
* Optimize small subtree factorization code
* Figure out how to improve root node performance on many cores
* Optimize/parallelize TPP code [or pass straight to parent?]
* Other optimizations around delayed pivots
* Write report on code
* Sort out test deck
* Sort out documentation
* Parallel solve?
* Add note that hwloc needs cuda support at compile time to work right for us?
* Document rb_write and add C inteface
