Resume long-running jobs after interruptions
checkpoint-resume-long-jobskillsetup L3★64
Tibsfox/gsd-skill-creator ↗What it does
Pause and resume long-running jobs without losing state
Best for
Batch data processing, model training, or simulation when full re-run cost exceeds checkpoint overhead.
Inputs
- · job state
- · checkpoint interval
- · failure handler
Outputs
- · resumed computation
- · final result
Requires
- · filesystem or database for checkpoint
- · job queue or background worker
Preconditions
Job state serializable; idempotent at resume boundary
Failure modes
Checkpoint cost exceeds job efficiency; corrupted checkpoint undetectable; orphaned processes occupy resources
Trust signals
- · Checkpoint-restart pattern used in HPC; proven at petabyte scale
- · Idempotency requirement prevents duplicate side effects