Chapter 13. THE VULCAN PARALLELIZATION ALGORITHM
The VULCAN CFD code was parallelized using generic MPI (Message Passing Interface) libraries in a data-parallel fashion. The structured multi-block formulation of the software offered a natural domain decomposition methodology which was directly exploited in the parallel algorithm. Function-parallel capabilities using shared memory directives (OpenMP, etc.) can be combined at some later time with the current data-parallel algorithm, but this level of parallelism was beyond the scope of the present effort. In the current data-parallel paradigm, MPI library calls are utilized to handle inter-processor communication, as well as control the flow of the program.
The serial version of the VULCAN software stored data globally via arrays of the form:
V(nj,nk,ni,nv,nblk,nlev) -> (npts*nv*nblk*nlev)
where nj, nk, ni are the grid dimensions in the "j", "k", and "i" directions, nv is the number of variables in the array, nblk is the number of grid blocks, and nlev is the number of grid levels present. At first glance, this 1-D array storage ordering appears less than ideal (from a load balancing perspective) for parallel computing because the grid level is the outermost dimension rather than the block number. However, this shortcoming can be overcome (and actually turned into an advantage) by redefining what "nblk" represents.
The idea behind the parallelization strategy was to redefine nblk as the number of blocks to be stored on a given processor. A new parameter, nbtot, was introduced to represent the total number of grid blocks, i.e.
where np is the number of processors used in the simulation. By defining a processor dependent block number, the core solver did not have to be altered. The mapping of the processor dependent block ordering to the global block ordering is only required when swapping data between processors and when performing data input/output. This greatly reduced the parallelization effort while maintaining all of the advantages of the original array ordering
at the processor (computational) level.
An outline of how this strategy was implemented into the VULCAN software is given below:
-
Given the number of processors and the total number of grid blocks (both of which are specified in the VULCAN input file), the VULCAN pre-processor determines the "best" distribution of grid blocks for each processor in order to satisfy the load balancing constraint of minimizing the grid cell count difference between processors. Since VULCAN regions are solved sequentially, this algorithm was invoked on a region by region basis. The algorithm also assumes that all processors employed in a given simulation are equivalent to one another. Two mapping arrays result from this load balancing operation. The first mapping array maps each grid block (using global block ordering indices) to a processor. The second mapping array maps the global grid block number to a local (processor dependent) block number. Note that this second mapping array is a processor dependent mapping (i.e. processor "A" has no knowledge of the local grid block mapping of processor "B"). These mapping arrays, as well as various load balancing statistics, are written to the file MPI_MAP.DAT and are read later by the solver and post-processor.
-
Once the grid blocks have been distributed across the processors, the VULCAN pre-processor then determines the parameters that define memory allocation on a given processor rather than in a "global" sense. The values of each parameter are compared across processors, and the largest values that result from this comparison are the values written out to the temporary files used for run time memory allocation by the executable code.
-
Appropriate MPI calls were placed at various locations in the source code. Places where these calls are required are at any point where block to block communication/coordination is required, (e.g. I/O, cut conditions, residual error output, minimum global time step calculation, minimum distance to walls, etc.) MPI calls were also placed at the beginning of the main program elements (VULCAN.f, PREPRC.f, etc.) to set up various global MPI parameters used to establish communication between grid blocks. A description of each parallel modification can be found in the source code by searching for the string ROB-MPI.
-
Restart files were modified to allow the I/O to be done in parallel. This required rather extensive modifications to convert the flow data (_F) files from a region by region basis to a block by block basis. The original (serial) format can still be read in if the user specifies the option 'SERIAL INPUT RESTART FORMAT' in the VULCAN input deck. This was done to maintain backwards compatibility.
The following input parameters were added to the VULCAN input file to control the parallel version of the code.
-
'PROCESSORS' X.X : Defines the number of processors (X.X) to be used in the simulation. It is possible to have more processors than blocks (this may occur for instance in multi-region simulations).
-
'MESSAGE MODE' X.X : Defines the message passing mode. A value of 0.0 will invoke the standard blocking MPI sends and receives. A value of 1.0 will invoke a buffered send with standard receive strategy. Typically, the standard mode will work best (fastest) for parallel_mpi (true MPI) installs, while the buffered mode tends to give better performance for non-proprietary MPI (e.g. MPICH) installs. The buffered send option is available only when all regions in the simulation are elliptic. The code will invoke standard sends and receives if a space marching region exists.
-
'SERIAL INPUT RESTART FORMAT' : Designates that the input restart format is that of the serial version of the code (numbered by regions rather than blocks). If this option is specified, then the output restart files must have a different name than the input restart files.