3 Description of the parallelized code

A computational problem is traditionally solved step by step, by means of a unique processor; we say in this case that the problem is solved sequentially. In contrast, when a set of processors are connected (e.g., through a network) in order to solve a problem collectively, we talk about distributed or parallel computation. The development of a program which operates under distributed computation comprise three basic steps.

The first one consists in examining minutely the problem at hand, searching for and separating those (generally small) parts suitable for simultaneous processing. The specific way this is worked out is intimately related not only to the physical problem to be solved, but also to the kind of numerical algorithms to be used. In the case of an N-body problem, the splitting may be made quite simply. Each processor is given the data of the entire set of particles, so they can each generate the whole tree structure. However, each processor computes the accelerations and updates the positions and velocities of only a subset of particles. At the end of each time-step, the processors interchange information, receive the new data of all the particles, construct the new tree structure, and continue the integration of the fresh set of particles each one was assigned to work with. This way of parallelization may not seem optimal (in fact, it is not); however, the CPU time wasted generating a new whole tree is a small fraction of the CPU time invested in the computation of the accelerations. For example, for $N=100\,000$ , this fraction is approximately 1/120. It is worth to note that, if a leap-frog integration algorithm is used, the time needed for updating the positions and velocities is also negligible with respect to the computation of the accelerations.

Besides that, the environment in which our processors work does not conform an ideal situation, indeed: a few PC-ix86's, not dedicated exclusively to the integration, and connected by a 10 Mbits/s ethernet net shared with more than 100 other PCs. Therefore, the load over each machine shifts continuously. Unfortunately, there is no easy way to improve the distribution of tasks under such circumstances; refined algorithms trying to balance the load based solely on the dynamics of the problem will not benefit the overall performance. However, to get the best, our program does a dynamical balancing of the distribution of particles based on the actual load of each processor, trying to reach an even distribution of real times (as opposed to CPU times) on each one.

The second step in developing a parallel program is the choice of the programming methodology, which in turn determines the logical structure of the program. There are two basic models: the master-slave and the fork-join schemes (Geist et al. [1994]). In the first one, a master program supervises the running of a group of slave programs, controlling also their interchange of information; the slaves do the actual computation. Thus, two essentially distinct programs must be maintained. On the other hand, the fork-join scheme makes use of a unique program, replicated on each processor. Depending on how it was first called, this program acts as a parent program on one of the processors, or as a child in the others. The children receive data from the parent and compute; the parent not only distributes tasks, but may also compute if desired. We have chosen the latter alternative, mainly because it allows to maintain and upgrade the software more efficiently than with the former method.

The third step is to select the tool with which the processors will comunicate themselves. Up to now, there are two paradigms: the Parallel Virtual Machine (PVM), and the Message Passing Interface (MPI) (Geist et al. [1996]). Although the MPI is recognized as a future standard, at the time our program started to be assembled there were several versions of it at hand, differing appreciably with respect to installation procedures and operation. The PVM, on the other hand, had a unique version, permanently mantained and upgraded until now. Moreover, its operation is simple and flexible, allowing a more comfortable implementation and use. Thus the PVM was our choice.

3.1 Implementation details

Here we describe the logical structure of our program poctgrav, pointing out those aspects concerning the parallel features.

When running over one processor only, poctgrav works fine as a normal sequential program. However, the PVM sofware must be installed properly even in this case for the program to run, because the user might want to add other processors later.

When running over more than one processor, poctgrav is launched from one of them: the mainhost. This task, having contacted the PVM daemon, learns it is the parent. Once the parent is running, the PVM searches for the presence of any other processor requested by the parent, and launches on each of them a previously stored copy of poctgrav, i.e., the children. Should a processor fail to be contacted at this starting phase, the PVM sends a warning and everything is stopped. Next, each child communicates with its local PVM daemon to recognize whether it is the parent or a child (learning, of course, they are children). Thus, every task knows whether it has to give orders or it has to wait and execute orders. Once the integration has begun, the failure of a processor other than the parent (e.g., a machine breaks down), does not stop the program. Instead, poctgrav resumes execution from the last time step, taking into account which child was lost.

The following is a pseudo-code skeleton of the main loop of the program.


int main(int argc, char* argv[]) 
		 {
/*---Initialisation------*/ 
/* Contact the local PVM (mytid is an 
      internal variable) */

mytid=pvm_mytid(); 

/* Stop if the PVM is not running */ 

if(mytid<0)return -1; 

/* Learn whether it is the parent 
(mygid=0) or a child (mygid>0). 
The string GROUPNAME is defined by 
poctgrav, and it is the same for 
all the tasks involved. */ 

mygid=pvm_joingroup(GROUPNAME); 

/* if this task is the parent */ 

if(mygid==0){ 

/* Read control parameters for the 
simulation. Also, instruct the 
task whether it is a parallel 
run and which processors will 
be its children. */ 

read_params(); 

/* Verify that a PVM daemon is alive 
in each host. Start any which 
is not. Then start children 
processes. */ 

activate_hosts_and_task(); 

/* Read initial conditions. */ 

read_bodies(); 

/* If in parallel mode, transmit 
data to the children. */ 

if(childrenactive){ 
/* control parameters */ 

send_params();  
/* initial conditions */ 

send_bodies(); 
		            } 
		 } 
/* if this task is a child */ 

else{ 
/* receive control parameters */ 

receive_params(); 
/* receive initial conditions */ 

receive_bodies(); 
		 } 
/*----End initialisation----*/ 
/*----Begin integration-----*/ 
/*--of the equations of motion--*/ 


/* Create and go over the tree; 
compute potential and accel. 
only for particles assigned to 
this processor. */ 

calculate_phi(); 


/* If operating in parallel mode, 
interchange data */ 

if(childrenactive){ 

/* if parent... */ 

if(mygid==0) 

/* ...collect information. */

receive_bodies(); 

/* If child... */ 

else 
/* ...send information. 

send_bodies(); 
		 } 

/* Print data */ 

if(mygid==0) print_statistic(); 

/* Do the actual integration */ 

leapfrog(); 
}

The leapfrog routine deserves a little more inspection:


int leapfrog() 
		 { 

time=initialtime; 
/* Procceed until integration is 
		    complete */ 

while(time<=finaltime){  

/* Update positions and velocities. 
If in parallel mode, do it 
only for the local particles. */ 

new_vel(deltatime/2); 

new_pos(deltatime); 


/* Each processor needs to know 
the new coordinates of all the 
particles,in order to update 
its own tree */ 

if(childrenactive){ 
/* The parent... */ 

if(mygid==0){ 
/* ...receives lists from 
each child, */ 

receive_bodies(); */ 
/* and sends the complete list 
to every child. */ 

send_bodies(); 
/* Here is where bodies are 
distributed according to 
the actual load on each 
processor. */ 
			 } 
/* Each child... */ 

else{ 
/* ...sends its list to the 

parent, */ 

send_bodies();  
/* and receives the complete 

list. */ 

receive_bodies();  
				 } 
			 } 


/* Rebuild the tree, and compute 
potential and acceleration for
the local particles */ 

calculate_phi(); 


/* Synchronize positions and 
velocities. The parent does all 
the statistics, so interchange 
information for this purpose */ 

new_vel(deltatime/2); 

if(childrenactive){ 

if(mygid==0) 

receive_bodies(); 

else 

send_bodies(); 
	 } 


/* Print statistics and 
update time */ 

if(mygid==0) print_statistic(); 

time+=deltatime; 
		 } 
}

Up: Parallelized tree-code for clusters