1581886 : Slow clone() on some Maxwell nodes

Created: 2026-03-05T10:51:36Z - current status: new

Here is the anonymized and summarized report:


Summary of Issue

A user reports that process creation (clone() system call) is unusually slow (1.4 seconds per call) on certain Maxwell cluster nodes, while other nodes run the same code normally. The delay occurs when launching a worker pool (1 worker per core), leading to ~3 minutes of overhead just for process startup (vs. 35 minutes for normal execution). The issue appears random and affects specific nodes (e.g., max-exfl509 for job [JOB_ID]).


Possible Causes & Investigation Steps

  1. Resource Contention:
  2. The Maxwell cluster allocates nodes exclusively to jobs (no shared resources). However, background system processes (e.g., monitoring, logging) or high memory/CPU usage on the node could delay clone().
  3. Check: Run top, htop, or sar on the affected node to monitor system load, memory pressure, or I/O bottlenecks.

  4. Memory Overcommitment:

  5. If the job requests less memory than required, the kernel’s OOM killer may intervene, causing delays (see Example 3).
  6. Check: Verify the job’s memory allocation (--mem) matches the workload’s needs. Use sacct -j [JOB_ID] --format=MaxRSS to confirm memory usage.

  7. Kernel/Filesystem Latency:

  8. Slow clone() can stem from filesystem delays (e.g., /tmp or /dev/shm contention) or kernel scheduling issues.
  9. Check: Test with strace -T on a minimal process (e.g., sleep 1) to isolate the delay. Compare dmesg logs between fast/slow nodes.

  10. NUMA/CPU Affinity:

  11. If workers are bound to specific cores (e.g., via taskset or MPI), cross-NUMA memory access could slow process creation.
  12. Check: Use numactl --hardware to inspect NUMA topology and ensure workers are pinned to local memory.

  13. Slurm Configuration:

  14. The cluster’s non-consumable memory policy (see Ensuring minimum memory per core) may lead to uneven performance if nodes have varying memory/core ratios.
  15. Check: Compare sinfo -o "%n %m %c" for affected vs. unaffected nodes.

Suggested Solutions

  1. Short-Term Workarounds:
  2. Pre-warm the worker pool: Launch all workers at once (e.g., via mpirun or multiprocessing.Pool) to amortize the clone() cost.
  3. Request more memory: Increase --mem to avoid OOM-related delays (e.g., --mem=8G for 4 cores).
  4. Avoid overloading cores: Reduce workers to nproc/2 (physical cores only) to minimize contention.

  5. Long-Term Investigation:

  6. Node-Specific Debugging:
    • Run a test job on max-exfl509 with strace -f -T to trace all clone() calls.
    • Compare perf stat metrics (e.g., context switches, cache misses) between fast/slow nodes.
  7. Contact Support: Provide the job ID ([JOB_ID]) and node name (max-exfl509) to cluster admins for deeper analysis (e.g., kernel logs, hardware health).

References

  1. Maxwell: Running Non-Demanding Batch Jobs (Example 3)
  2. Maxwell: Ensuring Minimum Memory per Core
  3. Slurm Documentation: Resource Allocation