1572632 : Jobs being preempted on Maxwell cluster (cfel queue)

Created: 2026-01-30T16:16:54Z - current status: new

Here is the anonymized and summarized report:


Summary of Issue

A user reports recurring job preemptions on the Maxwell cluster, specifically in the cfel partition. Jobs start normally but are terminated before completion due to preemption, despite adhering to walltime and resource limits. The user seeks clarification on whether this behavior is expected or if adjustments to job configuration could mitigate it.

Key Observations:

  • Jobs were submitted to the cfel partition with flags like --no-requeue (preventing requeueing after preemption).
  • Preemption occurred at varying times (early or after several hours).
  • Error logs confirm cancellation due to preemption (e.g., DUE TO PREEMPTION).
  • No violations of walltime or memory limits were detected.

Analysis & Solution

Why Preemption Occurred:

  1. Partition-Specific Preemption Rules:
  2. The cfel partition has preemption enabled with the cancel mode (not requeue).
  3. Jobs in higher-priority partitions (cfel-cdi, cfel-cmi, cfel-ux) can terminate jobs in cfel (and allcpu/allgpu).
  4. The --no-requeue flag in the submission script ensures preempted jobs are cancelled rather than requeued.

  5. Expected Behavior:

  6. Preemption is intentional for the cfel partition to prioritize jobs in cfel-cdi, cfel-cmi, or cfel-ux.
  7. This aligns with the cluster’s design to allocate resources dynamically based on partition priority.
  1. Check Partition Access:
  2. Verify if the user’s account/group has access to higher-priority partitions (e.g., cfel-cdi). If so, submit jobs there to avoid preemption.
  3. Use sacctmgr show user [USERNAME] format=partition%30 to confirm allowed partitions.

  4. Adjust Preemption Mode:

  5. If requeueing is acceptable, remove --no-requeue to allow jobs to restart automatically after preemption.
  6. Example: bash sbatch --requeue --partition=cfel job_script.sh

  7. Use Reservations for Critical Jobs:

  8. For jobs requiring uninterrupted execution, request a reservation (or use a non-preemptable partition if available).
  9. Reservations bypass preemption entirely.

  10. Monitor Higher-Priority Jobs:

  11. Preemption is triggered by jobs in cfel-cdi, cfel-cmi, or cfel-ux. Check for scheduled high-priority workloads during the job’s runtime.

  12. Alternative Partitions:

  13. If preemption is frequent, consider submitting to non-preemptable partitions (e.g., cfel-cdi if permitted) or partitions with longer walltimes.

Sources

  1. Maxwell Documentation: Preemption
  2. Maxwell Documentation: Partitions
  3. SLURM Preemption Reference