cyberneticlibrary

Resume long-running jobs after interruptions

checkpoint-resume-long-jobskillsetup L364
Tibsfox/gsd-skill-creator
What it does

Pause and resume long-running jobs without losing state

Best for

Batch data processing, model training, or simulation when full re-run cost exceeds checkpoint overhead.

Inputs
  • · job state
  • · checkpoint interval
  • · failure handler
Outputs
  • · resumed computation
  • · final result
Requires
  • · filesystem or database for checkpoint
  • · job queue or background worker
Preconditions

Job state serializable; idempotent at resume boundary

Failure modes

Checkpoint cost exceeds job efficiency; corrupted checkpoint undetectable; orphaned processes occupy resources

Trust signals
  • · Checkpoint-restart pattern used in HPC; proven at petabyte scale
  • · Idempotency requirement prevents duplicate side effects