Corefile Resume
The execution of a G95-compiled programs can be suspended and resumed.
If you interrupt a program by sending it the QUIT signal, which is usually bound to control-backslash the program will write an executable file named 'dump' to the current directory.
Running this file causes the execution of your program to resume from when the dump was written.
andy@fulcrum:~/g95/g95 % cat tst.f90 b = 0.0 do i=1, 10 do j=1, 3000000 call random_number(a) a = 2.0*a - 1.0 b = b + sin(sin(sin(a))) enddo print *, i, b enddo end
andy@fulcrum:~/g95/g95 % g95 tst.f90 andy@fulcrum:~/g95/g95 % a.out 1 70.01749 2 830.63153 3 987.717 4 316.48703 5 -426.53815 6 25.407673 (control-\ hit) Process dumped 7 -694.2718 8 -425.95465 9 -413.81763 10 -882.66223 andy@fulcrum:~/g95/g95 % ./dump Restarting ............Jumping 7 -694.2718 8 -425.95465 9 -413.81763 10 -882.66223 andy@fulcrum:~/g95/g95 %
Any open files must be present and in the same places as in the original process. If you link against other languages, this may not work.
While the main use is allowing you to preserve the state of a run across a reboot, other possibilities include pushing a long job through a short queue or moving a running process to another machine.
Automatic checkpointing of your program can be done by setting the environment variable G95_CHECKPOINT with the number of seconds to wait between dumps.
New checkpoint files overwrite old checkpoint files.
A program can cause a checkpoint to be written with the following code:
subroutine dump
interface function getpid() bind(c, name='getpid') integer :: getpid end function getpid
subroutine kill(signal, pid) bind(c, name='kill') integer, value :: signal, pid end subroutine kill end interface
call kill(getpid(), 3) ! SIGQUIT = 3
end subroutine dump
|