Contribute  :  Advanced Search  :  Site Statistics  :  Directory  :  Background  :  Links  :  Polls  :  FAQ  :  My Downloads  :  Forum  :  Calendar  
RSPt Site The best RSPt in the world!
Welcome to RSPt Site
Thursday, September 09 2010 @ 10:45 PM CEST
 

 
 Home »  Parallell forum »  How to stop a parall..
Prev Topic Next  Printable Version
How to stop a parallel program? Views: 611
 Tuesday, February 16 2010 @ 10:56 PM CET
Hello everybody!

I have discovered a problem with the parallel code as written. The main troublemaker here is the routine schsky0.F, but there may well be many other places where we are potentially in trouble.

The problem is that if one process gets into so much trouble that it has to stop stop, it calls stoprspt, which then calls mpi_finalize. BUT! The routine mpi_finalize does not kill all processes, it merely ends the current process gracefully (see description here). So if one process stops (for example by a linear dependency of the basis in schsky0 in one k-point) the other processes will go on whatever they are doing until the next collective call and there they will wait forever for the dead process. Not good.

A quick thing that I thought of in the case of schsky0 is that a thread that gets a linear dependency opens a new file with its number on it (cholesky_error.3 for process 3 and so on). That way we can at least detect what has happened, but it does not stop the program.

Questions:
1. What is the good way to handle things like this?
2. Are we in trouble in many places? This problem is potentially there as soon as there is a call to stoprspt in some part where the processes actually perform different calculations.

/Torbjörn

  Profile    PM    Email   
 Quote 
 
 Thursday, February 18 2010 @ 01:07 PM CET
Followup on the previous problem.

I have now fixed the immediate problem with the eigensolver by removing the "local" calls to stoprspt and sending error flags from the Cholesky factorization and regular eigenvalue solvers back up in the call tree. Then the error flags for all k-points are collected in eigen.F, analyzed and if anything has gone wrong the program prints an error message and stops. So far so good.

Meanwhile Patrik has worked out that the MPI-routine that does stop everything is MPI_ABORT. This raises the question: Should we change the MPI_FINALIZE in stoprspt to MPI_ABORT?

I suggest that we don't, and that we identify problematic places as they appear and construct solutions similar to the one described above for the eigensolver. It forces you to gather a lot more information before stopping for an error, so if we just output said information we will get a lot more informative error messages, especially for parallel runs.
I think that it is preferable to not have a nice utility routine that stops everything, because it makes it too easy to be lazy and use that instead of keeping track of your error messages...

Opinions anyone?

Torbjörn

  Profile    PM    Email   
 Quote 
 

 
Topic Legend:
Normal Topic Normal Topic
Locked Topic Locked Topic
Sticky Topic Sticky Topic
New Post New Post
Sticky Topic W/ New Post Sticky Topic W/ New Post
Locked Topic W/ New Post Locked Topic W/ New Post
Subscribe to this topic Subscribe to this topic
You may not post messages
Full HTML is allowed
Words are not censored