1572477 : Preemption feststellen

Created: 2026-01-30T09:31:22Z - current status: new

Hello Philipp,

Here’s a summarized and anonymized version of your query:


Summary of the Issue

A user wants to make their job scripts more resilient to preemption (e.g., due to higher-priority jobs) by implementing a mechanism to restart the job automatically if terminated via SIGTERM. However, they face two challenges: 1. Requeue is not always possible (e.g., on certain partitions where --requeue is disabled). 2. Distinguishing between SIGTERM from preemption vs. manual cancellation (e.g., via scancel), as both trigger the same signal, risking unintended restarts.


Possible Solution

To avoid an endless restart loop when manually canceling a job, you can use the following approaches:

1. Check the Reason for Termination

Slurm provides metadata about why a job was terminated. You can inspect the job’s exit code or environment variables to distinguish between: - Preemption (e.g., due to higher-priority jobs). - Manual cancellation (e.g., via scancel).

Example in a job script:

term_handler() {
    # Check if the job was canceled manually (exit code 15 = SIGTERM from scancel)
    if [[ $SLURM_JOB_EXIT_CODE -eq 15 ]]; then
        echo "Job was manually canceled. Exiting without restart."
        exit 0
    fi

    # Otherwise, assume preemption and restart
    echo "Job was preempted. Attempting restart..."
    scontrol requeuehold $SLURM_JOB_ID
    scontrol release $SLURM_JOB_ID
}

trap 'term_handler' TERM

2. Use --no-requeue for Manual Cancellation

If you cancel a job with scancel --no-requeue, Slurm will not requeue it, even if the script tries to restart. This ensures manual cancellations are final.

3. Limit Restart Attempts

To prevent infinite loops, track the number of restarts (e.g., via Restarts in scontrol show job) and stop after a threshold:

max_restarts=3
restarts=$(scontrol show job $SLURM_JOB_ID | grep -o 'Restarts=.' | cut -d= -f2)

if [[ $restarts -ge $max_restarts ]]; then
    echo "Max restarts reached. Exiting."
    exit 0
fi

4. Alternative: Use sacct to Check Job State

Query the job’s history to see if it was canceled manually:

job_state=$(sacct -j $SLURM_JOB_ID --format=State --noheader | tail -n1)
if [[ "$job_state" == "CANCELLED" ]]; then
    echo "Job was canceled. Exiting."
    exit 0
fi

Key Takeaways

  • Preemption vs. manual cancellation: Use job metadata (exit codes, sacct, or scontrol) to distinguish between the two.
  • Avoid loops: Limit restart attempts and respect --no-requeue flags.
  • Test thoroughly: Ensure your script behaves as expected in both scenarios.

Sources Used

  1. Slurm Preemption Documentation
  2. Slurm scancel Command Reference
  3. Automatic Job Requeue Example
  4. Slurm sacct Documentation (external)