1572477 : Preemption feststellen¶
Created: 2026-01-30T09:31:22Z - current status: new¶
Hello Philipp,
Here’s a summarized and anonymized version of your query:
Summary of the Issue¶
A user wants to make their job scripts more resilient to preemption (e.g., due to higher-priority jobs) by implementing a mechanism to restart the job automatically if terminated via SIGTERM. However, they face two challenges:
1. Requeue is not always possible (e.g., on certain partitions where --requeue is disabled).
2. Distinguishing between SIGTERM from preemption vs. manual cancellation (e.g., via scancel), as both trigger the same signal, risking unintended restarts.
Possible Solution¶
To avoid an endless restart loop when manually canceling a job, you can use the following approaches:
1. Check the Reason for Termination¶
Slurm provides metadata about why a job was terminated. You can inspect the job’s exit code or environment variables to distinguish between:
- Preemption (e.g., due to higher-priority jobs).
- Manual cancellation (e.g., via scancel).
Example in a job script:
term_handler() {
# Check if the job was canceled manually (exit code 15 = SIGTERM from scancel)
if [[ $SLURM_JOB_EXIT_CODE -eq 15 ]]; then
echo "Job was manually canceled. Exiting without restart."
exit 0
fi
# Otherwise, assume preemption and restart
echo "Job was preempted. Attempting restart..."
scontrol requeuehold $SLURM_JOB_ID
scontrol release $SLURM_JOB_ID
}
trap 'term_handler' TERM
2. Use --no-requeue for Manual Cancellation¶
If you cancel a job with scancel --no-requeue, Slurm will not requeue it, even if the script tries to restart. This ensures manual cancellations are final.
3. Limit Restart Attempts¶
To prevent infinite loops, track the number of restarts (e.g., via Restarts in scontrol show job) and stop after a threshold:
max_restarts=3
restarts=$(scontrol show job $SLURM_JOB_ID | grep -o 'Restarts=.' | cut -d= -f2)
if [[ $restarts -ge $max_restarts ]]; then
echo "Max restarts reached. Exiting."
exit 0
fi
4. Alternative: Use sacct to Check Job State¶
Query the job’s history to see if it was canceled manually:
job_state=$(sacct -j $SLURM_JOB_ID --format=State --noheader | tail -n1)
if [[ "$job_state" == "CANCELLED" ]]; then
echo "Job was canceled. Exiting."
exit 0
fi
Key Takeaways¶
- Preemption vs. manual cancellation: Use job metadata (exit codes,
sacct, orscontrol) to distinguish between the two. - Avoid loops: Limit restart attempts and respect
--no-requeueflags. - Test thoroughly: Ensure your script behaves as expected in both scenarios.