1591781 : resource monitoring (post job)¶
Created: 2026-04-13T14:46:53Z - current status: new¶
Here’s a structured response to your query about extracting job resource usage details on the Maxwell cluster, particularly focusing on GPU metrics and other advanced statistics similar to the Hummel cluster report you shared.
1. Summary of the Issue¶
The user seeks detailed post-job resource usage statistics on the Maxwell cluster, including: - GPU usage (load, memory, power). - CPU core-level breakdown (per-socket/core usage). - I/O metrics (filesystem reads/writes, inter-node communication). - Memory usage (RSS, virtual memory, swap). - Command-level resource attribution (e.g., per-process CPU/memory).
While sacct provides basic CPU and memory usage, it lacks GPU and granular I/O/communication metrics. The Hummel cluster’s report offers a more comprehensive breakdown, which the user wants to replicate on Maxwell.
2. Available Tools on Maxwell¶
A. sacct (Basic Metrics)¶
- What it provides:
- Job ID, state, elapsed time, CPU usage (
CPUTime,ReqCPUS), memory (MaxRSS), and exit codes. - Example:
bash sacct -j <JOBID> --format=JobID,Elapsed,CPUTime,MaxRSS,ReqCPUS,NodeList - Limitations:
- No GPU metrics (load, memory, power).
- No per-core CPU usage or I/O statistics.
B. Grafana (Real-Time Monitoring)¶
- What it provides:
- Real-time GPU metrics (utilization, memory usage, power) as graphs.
- Node-level CPU, memory, and network I/O.
- Limitations:
- Data is not summarized post-job (only 5-minute snapshots).
- No command-level breakdown.
C. seff (Job Efficiency Report)¶
- What it provides:
- Summary of CPU and memory efficiency (e.g., "CPU Efficiency: 50%").
- Example:
bash seff <JOBID> - Limitations:
- No GPU or I/O metrics.
D. Custom Scripts (Advanced Metrics)¶
Maxwell does not natively provide a tool like Hummel’s report, but you can manually extract some metrics using:
1. Slurm Logs:
- Check job output files (specified via #SBATCH --output) for application-specific logs (e.g., NVIDIA nvidia-smi logs).
2. sstat (During Job Execution):
- Real-time metrics for running jobs (CPU, memory, I/O), but not GPU.
- Example:
bash
sstat -j <JOBID> --format=JobID,MaxRSS,AveCPU
3. nvidia-smi Logs:
- If your job logs GPU usage (e.g., via nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv), you can parse these logs post-job.
- Example script to log GPU stats:
bash
#SBATCH --gres=gpu:1
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used --format=csv -l 60 > gpu_usage.log
4. darshan (I/O Profiling):
- Maxwell supports Darshan for I/O profiling, but it must be enabled before job submission.
- Example:
bash
module load darshan
export DARSHAN_ENABLE_NONMPI=1
sbatch your_script.sh
- Post-job, generate a report:
bash
darshan-parser /path/to/darshan/logs/<JOBID>.darshan > io_report.txt
3. Missing Metrics on Maxwell¶
| Metric | Available on Maxwell? | Tool/Workaround |
|---|---|---|
| GPU load/memory/power | ❌ No (via sacct) |
Parse nvidia-smi logs or use Grafana. |
| Per-core CPU usage | ❌ No | Use mpstat or pidstat during job. |
| Inter-node communication | ❌ No | Use sar or application-specific tools. |
| Filesystem I/O | ⚠️ Limited | Enable darshan or parse /proc. |
| Command-level breakdown | ❌ No | Use ps/top logs during job. |
4. Suggested Solutions¶
Option 1: Log GPU Metrics During Job¶
Add this to your Slurm script to log GPU usage:
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --output=job_%j.out
# Log GPU stats every 60 seconds
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used,power.draw --format=csv -l 60 > gpu_usage_${SLURM_JOB_ID}.log &
# Run your application
your_command_here
# Kill the background logger
kill $!
Option 2: Use darshan for I/O Metrics¶
Enable Darshan in your job script:
#!/bin/bash
#SBATCH --output=job_%j.out
module load darshan
export DARSHAN_ENABLE_NONMPI=1
your_command_here
Post-job, generate a report:
darshan-parser /beegfs/desy/group/it/darshan-logs/<JOBID>.darshan > io_report.txt
Option 3: Parse /proc for CPU/Memory¶
Add this to your script to log CPU/memory usage:
#!/bin/bash
#SBATCH --output=job_%j.out
# Log CPU/memory every 60 seconds
while true; do
ps aux >> cpu_mem_${SLURM_JOB_ID}.log
sleep 60
done &
your_command_here
kill $!
Option 4: Request Maxwell-Specific Tools¶
If you frequently need such reports, consider requesting the Maxwell team to:
- Add GPU metrics to sacct (e.g., via Slurm’s gres accounting).
- Provide a post-job summary tool (similar to Hummel’s report).
5. Example Workflow for Maxwell¶
- Submit a job with logging:
bash sbatch --gres=gpu:1 --output=job_%j.out job_script.sh - Post-job analysis:
- Parse
gpu_usage_<JOBID>.logfor GPU metrics. - Parse
cpu_mem_<JOBID>.logfor CPU/memory. - Use
darshan-parserfor I/O metrics.