A computational problem is traditionally solved step by step, by means of a unique processor; we say in this case that the problem is solved sequentially. In contrast, when a set of processors are connected (e.g., through a network) in order to solve a problem collectively, we talk about distributed or parallel computation. The development of a program which operates under distributed computation comprise three basic steps.
The first one consists in examining minutely the problem at hand,
searching for and separating those (generally small) parts suitable for
simultaneous processing. The specific way this is worked out is
intimately related not only to the physical problem to be solved, but
also to the kind of numerical algorithms to be used. In the case of an
N-body problem, the splitting may be made quite simply. Each
processor is given the data of the entire set of particles, so they can
each generate the whole tree structure. However, each processor computes
the accelerations and updates the positions and velocities of only a
subset of particles. At the end
of each time-step, the processors interchange information, receive the
new data of all the particles, construct the new tree structure, and
continue the integration of the fresh set of particles each one was
assigned to work with. This way of parallelization may not seem optimal
(in fact, it is not); however, the CPU time wasted generating a new
whole tree is a small fraction of the CPU time invested in the
computation of the accelerations. For example, for
,
this fraction is approximately 1/120. It is worth to note that, if a
leap-frog integration algorithm is used, the time
needed for updating the positions and velocities is also negligible with
respect to the computation of the accelerations.
Besides that, the environment in which our processors work does not conform an ideal situation, indeed: a few PC-ix86's, not dedicated exclusively to the integration, and connected by a 10 Mbits/s ethernet net shared with more than 100 other PCs. Therefore, the load over each machine shifts continuously. Unfortunately, there is no easy way to improve the distribution of tasks under such circumstances; refined algorithms trying to balance the load based solely on the dynamics of the problem will not benefit the overall performance. However, to get the best, our program does a dynamical balancing of the distribution of particles based on the actual load of each processor, trying to reach an even distribution of real times (as opposed to CPU times) on each one.
The second step in developing a parallel program is the choice of the programming methodology, which in turn determines the logical structure of the program. There are two basic models: the master-slave and the fork-join schemes (Geist et al. [1994]). In the first one, a master program supervises the running of a group of slave programs, controlling also their interchange of information; the slaves do the actual computation. Thus, two essentially distinct programs must be maintained. On the other hand, the fork-join scheme makes use of a unique program, replicated on each processor. Depending on how it was first called, this program acts as a parent program on one of the processors, or as a child in the others. The children receive data from the parent and compute; the parent not only distributes tasks, but may also compute if desired. We have chosen the latter alternative, mainly because it allows to maintain and upgrade the software more efficiently than with the former method.
The third step is to select the tool with which the processors will comunicate themselves. Up to now, there are two paradigms: the Parallel Virtual Machine (PVM), and the Message Passing Interface (MPI) (Geist et al. [1996]). Although the MPI is recognized as a future standard, at the time our program started to be assembled there were several versions of it at hand, differing appreciably with respect to installation procedures and operation. The PVM, on the other hand, had a unique version, permanently mantained and upgraded until now. Moreover, its operation is simple and flexible, allowing a more comfortable implementation and use. Thus the PVM was our choice.
Here we describe the logical structure of our program poctgrav, pointing out those aspects concerning the parallel features.
When running over one processor only, poctgrav works fine as a normal sequential program. However, the PVM sofware must be installed properly even in this case for the program to run, because the user might want to add other processors later.
When running over more than one processor, poctgrav is launched from one of them: the mainhost. This task, having contacted the PVM daemon, learns it is the parent. Once the parent is running, the PVM searches for the presence of any other processor requested by the parent, and launches on each of them a previously stored copy of poctgrav, i.e., the children. Should a processor fail to be contacted at this starting phase, the PVM sends a warning and everything is stopped. Next, each child communicates with its local PVM daemon to recognize whether it is the parent or a child (learning, of course, they are children). Thus, every task knows whether it has to give orders or it has to wait and execute orders. Once the integration has begun, the failure of a processor other than the parent (e.g., a machine breaks down), does not stop the program. Instead, poctgrav resumes execution from the last time step, taking into account which child was lost.
The following is a pseudo-code skeleton of the main loop of the program.
int main(int argc, char* argv[])
{
/*---Initialisation------*/
/* Contact the local PVM (mytid is an internal variable) */
mytid=pvm_mytid();
/* Stop if the PVM is not running */
if(mytid<0)return -1;
/* Learn whether it is the parent
(mygid=0) or a child (mygid>0).
The string GROUPNAME is defined by
poctgrav, and it is the same for
all the tasks involved. */
mygid=pvm_joingroup(GROUPNAME);
/* if this task is the parent */
if(mygid==0){
/* Read control parameters for the
simulation. Also, instruct the
task whether it is a parallel
run and which processors will
be its children. */
read_params();
/* Verify that a PVM daemon is alive
in each host. Start any which
is not. Then start children
processes. */
activate_hosts_and_task();
/* Read initial conditions. */
read_bodies();
/* If in parallel mode, transmit
data to the children. */
if(childrenactive){
/* control parameters */
send_params();
/* initial conditions */
send_bodies();
}
}
/* if this task is a child */
else{
/* receive control parameters */
receive_params();
/* receive initial conditions */
receive_bodies();
}
/*----End initialisation----*/
/*----Begin integration-----*/
/*--of the equations of motion--*/
/* Create and go over the tree;
compute potential and accel.
only for particles assigned to
this processor. */
calculate_phi();
/* If operating in parallel mode,
interchange data */
if(childrenactive){
/* if parent... */
if(mygid==0)
/* ...collect information. */
receive_bodies();
/* If child... */
else
/* ...send information.
send_bodies();
}
/* Print data */
if(mygid==0) print_statistic();
/* Do the actual integration */
leapfrog();
}
The leapfrog routine deserves a little more inspection:
int leapfrog()
{
time=initialtime;
/* Procceed until integration is
complete */
while(time<=finaltime){
/* Update positions and velocities.
If in parallel mode, do it
only for the local particles. */
new_vel(deltatime/2);
new_pos(deltatime);
/* Each processor needs to know
the new coordinates of all the
particles,in order to update
its own tree */
if(childrenactive){
/* The parent... */
if(mygid==0){
/* ...receives lists from
each child, */
receive_bodies(); */
/* and sends the complete list
to every child. */
send_bodies();
/* Here is where bodies are
distributed according to
the actual load on each
processor. */
}
/* Each child... */
else{
/* ...sends its list to the
parent, */
send_bodies();
/* and receives the complete
list. */
receive_bodies();
}
}
/* Rebuild the tree, and compute
potential and acceleration for the local particles */
calculate_phi();
/* Synchronize positions and
velocities. The parent does all
the statistics, so interchange
information for this purpose */
new_vel(deltatime/2);
if(childrenactive){
if(mygid==0)
receive_bodies();
else
send_bodies();
}
/* Print statistics and
update time */
if(mygid==0) print_statistic();
time+=deltatime;
}
}
Copyright The European Southern Observatory (ESO)