--------------------------------------------------------------------
DONE: When PROFILING is enabled.
    Printf timing breakdown of INC at ncmpio_close()
    Printf timing breakdown of two-phase I/O at ad_close()

--------------------------------------------------------------------
DONE: sequential test - add use mpi-io or not

--------------------------------------------------------------------
DONE * Move ROMIO files into a separate folder
X * Move all MPI-IO calls to a single file, so ot can be used to
    call internal ROMIO or external ROMIO
X * Use API prefix names of PNC_, inside use filesystem to decide which
    ROMIO to call
DONE * For internal ROMIO, set_fileview can be made to be independent, so
        PnetCDF collective-independent data mode can no longer needed.

* Wish list: To support Lustre and GPFS only

* One big advantage of using internal ROMIO
  + choose more than one process per node as an aggregator
  + On lustre, this show a significant performance improvement
  + On GPFS, this needs to run evaluation

* set default intra-node aggregation (automatic mode)
* set Lustre mode default. User can configure to disable it
  --disable-romio


WRF
        intra-node aggregation does not improve (no longer June 21, 2025)
        One possible reason is the cost of heap-merge/sort is not as high as
        E3SM. In E3SM F case, colletive write cost is 10x pwrite costs. This
        may be mainly spent on memcpy and sorting. NEED to mesasue the comm.
        For example,
        collw 17.31 - pwrite  9.36 - comm  1.83 = 6.12 seconds

intra-node  0 0.00 0.00 0.00 0.00 0.00 = 0.00 nsort        0 collw 17.31 pwrite  9.36 comm  1.83 nsenders   255 nprocs 256
intra-node 64 0.00 2.89 0.72 1.47 1.19 = 6.27 nsort 29212364 collw 17.52 pwrite  9.26 comm  1.52 nsenders   127 nprocs 256
intra-node 16 0.00 2.35 1.40 1.48 1.64 = 6.87 nsort 94809012 collw 18.25 pwrite  9.64 comm  1.41 nsenders    31 nprocs 256
intra-node  0 0.00 0.00 0.00 0.00 0.00 = 0.00 nsort        0 collw 12.62 pwrite  7.73 comm  0.99 nsenders   255 nprocs 256
intra-node 64 0.00 2.90 0.70 1.48 1.19 = 6.27 nsort 29212364 collw 13.48 pwrite  7.81 comm  1.13 nsenders   127 nprocs 256
intra-node 16 0.00 2.35 1.41 1.48 1.64 = 6.88 nsort 94809012 collw 14.66 pwrite  8.09 comm  1.91 nsenders    31 nprocs 256


June 28, 2025
After some re-run, the performance for problem size of 2600 x 3800 shows
that INA improves in all cases.

--------------------------
June 21, 2025

Done July 4, 2025
    PnetCDF only call MPI-IO API with explict offset.
    All implementations check for ADIO_INDIVIDUAL can be removed.
    #define ADIO_EXPLICIT_OFFSET     100
    #define ADIO_INDIVIDUAL          101

PnetCDF does not enable I/O atomicity
All implementations check for atomicity can be removed.


--------------------------------------------------------------------

Add a module that makes use of my own ROMIO for lustre

* currently PnetCDF makes use of the following MPI-IO APIs only
    MPI_File_open MPI_File_close

    MPI_File_write_all MPI_File_write_at_all MPI_File_write MPI_File_write_at
    MPI_File_read_all MPI_File_read_at_all MPI_File_read MPI_File_read_at

    MPI_File_f2c MPI_File_c2f

    MPI_File_set_view MPI_File_seek
    MPI_File_sync MPI_File_delete MPI_File_get_info
    MPI_File_get_position MPI_File_set_size

* No need a new driver in PnetCDF, as I can add option to call MPI-IO or
  PNC-IO APIs (if is Lustre)

* Most critical work is to construct ADIO_File, which contains necessary meta.


--------------------------------------------------------------------
Check the following scenarios

1. When using MPI-IO
    a. When intra-node aggregation is enabled
       Create a new MPI communicator
       non-aggregators fh is MPI_FIILE_NULL
    b. When intra-node aggregation is NOT enabled
       OLD PnetCDF

2. When using internal ADIO driver
    a. When intra-node aggregation is enabled
       offsets, lengths are created (Check those groups INA is disabled !!!)
       pass offsets, lengths to ADIO, without setting fileview
    b. When intra-node aggregation is NOT enabled
       offsets, lengths are NOT created.
       Must construct filetype and set fileview
    2.1 When using Lustre
    2.2 When NOT using Lustre






--------------------------------------------------------------------



https://github.com/ufs-community/ufs-weather-model/issues/2347
For serial I/O, someone said NetCDF is faster than PnetCDF.
This may be because of MPI-I/O collective overhead, or
hash table size being too small.
* Check if communicator size is 1, then automatically switch to independent I/O
  mode.

----- done in PR 149

hash table size has been increased to 256

