1581246 : Unusually slow Maxwell nodes

Created: 2026-03-03T14:43:28Z - current status: new

Here is the anonymized, summarized report and suggested solution:


Summary of Issue

A user reports performance inconsistencies when running multiple jobs using the pycalibration package (used for detector calibration at [RESEARCH_FACILITY]) on the Maxwell cluster. While some jobs complete within the expected timeframe (~30–60 minutes), others on specific nodes take significantly longer (5–8 hours). The issue persists regardless of memory allocation (tested with 500 GB and 700 GB) and across different datasets.

The user observed that slow performance correlates with specific nodes (visible in attached screenshots, now anonymized as [NODE_1], [NODE_2], etc.).


Possible Causes & Solutions

  1. Hardware Heterogeneity
  2. The Maxwell cluster includes nodes with varying hardware (e.g., different CPU models, memory speeds, or interconnects). Slower nodes may have older CPUs (e.g., Intel V4 E5-2640, which are known to cause MPI errors) or less performant memory subsystems.
  3. Solution:

    • Explicitly request nodes with uniform hardware using --constraint in the Slurm script. For example: bash #SBATCH --constraint='Gold-6240|EPYC-7402' # Target newer, homogeneous hardware
    • Verify node specifications with sinfo -o "%N %c %m %f" to identify outliers.
  4. Memory Contention

  5. While the user requested 700 GB, the cluster does not enforce consumable memory. Other jobs on the same node may compete for memory bandwidth, causing slowdowns.
  6. Solution:

    • Use the script template from random-samples->Ensuring minimum memory per core to dynamically adjust cores based on available memory: bash mem_per_core=$((40*1024)) # 40 GB per core for node in $(srun hostname -s | sort -u); do slots=$(( $(sinfo -n $node --noheader -o '%m') / $mem_per_core )) echo "$node slots=$slots" >> $HOSTFILE done mpirun --hostfile $HOSTFILE ...
  7. I/O Bottlenecks

  8. Pycalibration may involve heavy I/O (e.g., reading/writing large datasets). Slower nodes might have degraded storage performance or network latency.
  9. Solution:

    • Ensure jobs use /beegfs (high-performance storage) and avoid /tmp or home directories.
    • Add --exclusive to the Slurm script to prevent resource sharing: bash #SBATCH --exclusive
  10. MPI/Threading Issues

  11. If pycalibration uses MPI or multithreading, misconfiguration (e.g., oversubscribing cores) could cause slowdowns.
  12. Solution:

    • Limit threads per MPI rank (e.g., 2–4 threads per rank) and ensure total cores match the node’s physical cores: bash total_cores=$(nproc) np=$(( $total_cores / 4 )) # Example: 4 threads per rank mpirun -np $np --map-by node ...
  13. Node-Specific Problems

  14. Some nodes may have underlying hardware issues (e.g., failing memory, overheating).
  15. Solution:
    • Report the slow nodes ([NODE_1], [NODE_2], etc.) to Maxwell support for diagnostics.
    • Exclude problematic nodes using --exclude in the Slurm script: bash #SBATCH --exclude=[NODE_1],[NODE_2]

  1. Test with Homogeneous Hardware Submit a test job with --constraint='Gold-6240' to isolate hardware-related slowdowns.

  2. Monitor Resource Usage Use sacct or seff <JOBID> to check CPU/memory utilization during slow runs. Look for:

  3. High %mem or %cpu usage by other jobs on the same node.
  4. I/O wait times (iowait in top).

  5. Consult pycalibration Documentation Verify if the package has known issues with specific MPI implementations or threading models.

  6. Contact Support If the issue persists, provide:

  7. The exact Slurm script used.
  8. Output of scontrol show job <JOBID> for slow jobs.
  9. Logs showing resource usage (e.g., top snapshots).

Sources

  1. Maxwell Documentation: Ensuring Minimum Memory per Core
  2. Maxwell Documentation: Recommended Batch Script
  3. Maxwell Documentation: Hardware Constraints (for node specifications)