1581886 : Slow clone() on some Maxwell nodes¶
Created: 2026-03-05T10:51:36Z - current status: new¶
Here is the anonymized and summarized report:
Summary of Issue¶
A user reports that process creation (clone() system call) is unusually slow (1.4 seconds per call) on certain Maxwell cluster nodes, while other nodes run the same code normally. The delay occurs when launching a worker pool (1 worker per core), leading to ~3 minutes of overhead just for process startup (vs. 35 minutes for normal execution). The issue appears random and affects specific nodes (e.g., max-exfl509 for job [JOB_ID]).
Possible Causes & Investigation Steps¶
- Resource Contention:
- The Maxwell cluster allocates nodes exclusively to jobs (no shared resources). However, background system processes (e.g., monitoring, logging) or high memory/CPU usage on the node could delay
clone(). -
Check: Run
top,htop, orsaron the affected node to monitor system load, memory pressure, or I/O bottlenecks. -
Memory Overcommitment:
- If the job requests less memory than required, the kernel’s OOM killer may intervene, causing delays (see Example 3).
-
Check: Verify the job’s memory allocation (
--mem) matches the workload’s needs. Usesacct -j [JOB_ID] --format=MaxRSSto confirm memory usage. -
Kernel/Filesystem Latency:
- Slow
clone()can stem from filesystem delays (e.g.,/tmpor/dev/shmcontention) or kernel scheduling issues. -
Check: Test with
strace -Ton a minimal process (e.g.,sleep 1) to isolate the delay. Comparedmesglogs between fast/slow nodes. -
NUMA/CPU Affinity:
- If workers are bound to specific cores (e.g., via
tasksetor MPI), cross-NUMA memory access could slow process creation. -
Check: Use
numactl --hardwareto inspect NUMA topology and ensure workers are pinned to local memory. -
Slurm Configuration:
- The cluster’s non-consumable memory policy (see Ensuring minimum memory per core) may lead to uneven performance if nodes have varying memory/core ratios.
- Check: Compare
sinfo -o "%n %m %c"for affected vs. unaffected nodes.
Suggested Solutions¶
- Short-Term Workarounds:
- Pre-warm the worker pool: Launch all workers at once (e.g., via
mpirunormultiprocessing.Pool) to amortize theclone()cost. - Request more memory: Increase
--memto avoid OOM-related delays (e.g.,--mem=8Gfor 4 cores). -
Avoid overloading cores: Reduce workers to
nproc/2(physical cores only) to minimize contention. -
Long-Term Investigation:
- Node-Specific Debugging:
- Run a test job on
max-exfl509withstrace -f -Tto trace allclone()calls. - Compare
perf statmetrics (e.g., context switches, cache misses) between fast/slow nodes.
- Run a test job on
- Contact Support: Provide the job ID (
[JOB_ID]) and node name (max-exfl509) to cluster admins for deeper analysis (e.g., kernel logs, hardware health).