1572632 : Jobs being preempted on Maxwell cluster (cfel queue)¶
Created: 2026-01-30T16:16:54Z - current status: new¶
Here is the anonymized and summarized report:
Summary of Issue¶
A user reports recurring job preemptions on the Maxwell cluster, specifically in the cfel partition. Jobs start normally but are terminated before completion due to preemption, despite adhering to walltime and resource limits. The user seeks clarification on whether this behavior is expected or if adjustments to job configuration could mitigate it.
Key Observations:¶
- Jobs were submitted to the
cfelpartition with flags like--no-requeue(preventing requeueing after preemption). - Preemption occurred at varying times (early or after several hours).
- Error logs confirm cancellation due to preemption (e.g.,
DUE TO PREEMPTION). - No violations of walltime or memory limits were detected.
Analysis & Solution¶
Why Preemption Occurred:¶
- Partition-Specific Preemption Rules:
- The
cfelpartition has preemption enabled with thecancelmode (notrequeue). - Jobs in higher-priority partitions (
cfel-cdi,cfel-cmi,cfel-ux) can terminate jobs incfel(andallcpu/allgpu). -
The
--no-requeueflag in the submission script ensures preempted jobs are cancelled rather than requeued. -
Expected Behavior:
- Preemption is intentional for the
cfelpartition to prioritize jobs incfel-cdi,cfel-cmi, orcfel-ux. - This aligns with the cluster’s design to allocate resources dynamically based on partition priority.
Recommended Actions:¶
- Check Partition Access:
- Verify if the user’s account/group has access to higher-priority partitions (e.g.,
cfel-cdi). If so, submit jobs there to avoid preemption. -
Use
sacctmgr show user [USERNAME] format=partition%30to confirm allowed partitions. -
Adjust Preemption Mode:
- If requeueing is acceptable, remove
--no-requeueto allow jobs to restart automatically after preemption. -
Example:
bash sbatch --requeue --partition=cfel job_script.sh -
Use Reservations for Critical Jobs:
- For jobs requiring uninterrupted execution, request a reservation (or use a non-preemptable partition if available).
-
Reservations bypass preemption entirely.
-
Monitor Higher-Priority Jobs:
-
Preemption is triggered by jobs in
cfel-cdi,cfel-cmi, orcfel-ux. Check for scheduled high-priority workloads during the job’s runtime. -
Alternative Partitions:
- If preemption is frequent, consider submitting to non-preemptable partitions (e.g.,
cfel-cdiif permitted) or partitions with longer walltimes.