User Functions
Don't have an account yet? Sign up as a New User
Lost your password?
Events
There are no upcoming events
Older Stories
Friday25-Apr
Tuesday01-Apr
Wednesday05-Dec
Monday19-Nov
Tuesday13-Nov
Monday05-Nov
Wednesday26-Sep
Monday17-Sep
|
|
|
 |
|
|
| How to stop a parallel program? |
Views: 611 |
|
|
Torbjorn
Registered: 11/03/06
Posts: 75
|
Tuesday, February 16 2010 @ 10:56 PM CET |
|
Hello everybody!
I have discovered a problem with the parallel code as written. The main troublemaker here is the routine schsky0.F, but there may well be many other places where we are potentially in trouble.
The problem is that if one process gets into so much trouble that it has to stop stop, it calls stoprspt, which then calls mpi_finalize. BUT! The routine mpi_finalize does not kill all processes, it merely ends the current process gracefully (see description here). So if one process stops (for example by a linear dependency of the basis in schsky0 in one k-point) the other processes will go on whatever they are doing until the next collective call and there they will wait forever for the dead process. Not good.
A quick thing that I thought of in the case of schsky0 is that a thread that gets a linear dependency opens a new file with its number on it (cholesky_error.3 for process 3 and so on). That way we can at least detect what has happened, but it does not stop the program.
Questions:
1. What is the good way to handle things like this?
2. Are we in trouble in many places? This problem is potentially there as soon as there is a call to stoprspt in some part where the processes actually perform different calculations.
/Torbjörn |
|
|
|
|
| |
Torbjorn
Registered: 11/03/06
Posts: 75
|
Thursday, February 18 2010 @ 01:07 PM CET |
|
Followup on the previous problem.
I have now fixed the immediate problem with the eigensolver by removing the "local" calls to stoprspt and sending error flags from the Cholesky factorization and regular eigenvalue solvers back up in the call tree. Then the error flags for all k-points are collected in eigen.F, analyzed and if anything has gone wrong the program prints an error message and stops. So far so good.
Meanwhile Patrik has worked out that the MPI-routine that does stop everything is MPI_ABORT. This raises the question: Should we change the MPI_FINALIZE in stoprspt to MPI_ABORT?
I suggest that we don't, and that we identify problematic places as they appear and construct solutions similar to the one described above for the eigensolver. It forces you to gather a lot more information before stopping for an error, so if we just output said information we will get a lot more informative error messages, especially for parallel runs.
I think that it is preferable to not have a nice utility routine that stops everything, because it makes it too easy to be lazy and use that instead of keeping track of your error messages...
Opinions anyone?
Torbjörn |
|
|
|
|
| |
|
|
 |
 |
|
| Topic Legend: |
 |
Normal Topic |
 |
Locked Topic |
 |
Sticky Topic |
|
 |
New Post |
 |
Sticky Topic W/ New Post |
 |
Locked Topic W/ New Post |
|
|
|
 |
Subscribe to this topic |
|
|
|
|
|
You may not post messages
|
|
Full HTML is allowed
|
|
Words are not censored
|
|
|