1581886 : Slow clone() on some Maxwell nodes¶

Created: 2026-03-05T10:51:36Z - current status: new¶

Here is the anonymized and summarized report:

Summary of Issue¶

A user reports that process creation (clone() system call) is unusually slow (1.4 seconds per call) on certain Maxwell cluster nodes, while other nodes run the same code normally. The delay occurs when launching a worker pool (1 worker per core), leading to ~3 minutes of overhead just for process startup (vs. 35 minutes for normal execution). The issue appears random and affects specific nodes (e.g., max-exfl509 for job [JOB_ID]).

Possible Causes & Investigation Steps¶

Resource Contention:
The Maxwell cluster allocates nodes exclusively to jobs (no shared resources). However, background system processes (e.g., monitoring, logging) or high memory/CPU usage on the node could delay clone().
Check: Run top, htop, or sar on the affected node to monitor system load, memory pressure, or I/O bottlenecks.
Memory Overcommitment:
If the job requests less memory than required, the kernel’s OOM killer may intervene, causing delays (see Example 3).
Check: Verify the job’s memory allocation (--mem) matches the workload’s needs. Use sacct -j [JOB_ID] --format=MaxRSS to confirm memory usage.
Kernel/Filesystem Latency:
Slow clone() can stem from filesystem delays (e.g., /tmp or /dev/shm contention) or kernel scheduling issues.
Check: Test with strace -T on a minimal process (e.g., sleep 1) to isolate the delay. Compare dmesg logs between fast/slow nodes.
NUMA/CPU Affinity:
If workers are bound to specific cores (e.g., via taskset or MPI), cross-NUMA memory access could slow process creation.
Check: Use numactl --hardware to inspect NUMA topology and ensure workers are pinned to local memory.
Slurm Configuration:
The cluster’s non-consumable memory policy (see Ensuring minimum memory per core) may lead to uneven performance if nodes have varying memory/core ratios.
Check: Compare sinfo -o "%n %m %c" for affected vs. unaffected nodes.

1581886 : Slow clone() on some Maxwell nodes¶

Created: 2026-03-05T10:51:36Z - current status: new¶

Summary of Issue¶

Possible Causes & Investigation Steps¶

Suggested Solutions¶

References¶