Clean OpenMPI exit
Moved from SVN. Adrien Michel, Jan 16, 2019
When an exception is raised in only one MPI worker, the other workers are not killed. As a result, the app continue to run and reach a race condition (infinitely wait on the killed worker). This consume CPU hours, which is highly problematic on HPC facilities, where we pay and/or have only a limited amount of core hours. It would be important to find a way to cleanly kill everything (which means catching all exceptions from snowpack or meteoio, to be not too slow, we should have really large catch blocks).
Comment 1 by Adrien Michel, Nov 26, 2020
Actually this is cased by AlpineMain, which catches all exceptions and then uses "exit(1)". But exit() is apparently not seen by the other MPI workers so if only the master dies (e.g. due to input reading error), the other workers continue to wait. The solution is to change all the "exit(1)" to "throw" in AlpineMain. Then we exit with an uncaught exception (which is printed) and all the MPI workers are cleanly killed, which saves a lot of node hours on clusters... I'll commit the fix soon
Status: Started
Comment 2 by Adrien Michel, Jan 6, 2021
The commit is now on the git version, will be moved soon to the SLF gitlab. However, with this implementation cleanDestroyAll() is not called on all the MPI instances. Some more work is required for a clean implementation.