1591781 : resource monitoring (post job)

Created: 2026-04-13T14:46:53Z - current status: new

Here’s a structured response to your query about extracting job resource usage details on the Maxwell cluster, particularly focusing on GPU metrics and other advanced statistics similar to the Hummel cluster report you shared.


1. Summary of the Issue

The user seeks detailed post-job resource usage statistics on the Maxwell cluster, including: - GPU usage (load, memory, power). - CPU core-level breakdown (per-socket/core usage). - I/O metrics (filesystem reads/writes, inter-node communication). - Memory usage (RSS, virtual memory, swap). - Command-level resource attribution (e.g., per-process CPU/memory).

While sacct provides basic CPU and memory usage, it lacks GPU and granular I/O/communication metrics. The Hummel cluster’s report offers a more comprehensive breakdown, which the user wants to replicate on Maxwell.


2. Available Tools on Maxwell

A. sacct (Basic Metrics)

  • What it provides:
  • Job ID, state, elapsed time, CPU usage (CPUTime, ReqCPUS), memory (MaxRSS), and exit codes.
  • Example: bash sacct -j <JOBID> --format=JobID,Elapsed,CPUTime,MaxRSS,ReqCPUS,NodeList
  • Limitations:
  • No GPU metrics (load, memory, power).
  • No per-core CPU usage or I/O statistics.

B. Grafana (Real-Time Monitoring)

  • What it provides:
  • Real-time GPU metrics (utilization, memory usage, power) as graphs.
  • Node-level CPU, memory, and network I/O.
  • Limitations:
  • Data is not summarized post-job (only 5-minute snapshots).
  • No command-level breakdown.

C. seff (Job Efficiency Report)

  • What it provides:
  • Summary of CPU and memory efficiency (e.g., "CPU Efficiency: 50%").
  • Example: bash seff <JOBID>
  • Limitations:
  • No GPU or I/O metrics.

D. Custom Scripts (Advanced Metrics)

Maxwell does not natively provide a tool like Hummel’s report, but you can manually extract some metrics using: 1. Slurm Logs: - Check job output files (specified via #SBATCH --output) for application-specific logs (e.g., NVIDIA nvidia-smi logs). 2. sstat (During Job Execution): - Real-time metrics for running jobs (CPU, memory, I/O), but not GPU. - Example: bash sstat -j <JOBID> --format=JobID,MaxRSS,AveCPU 3. nvidia-smi Logs: - If your job logs GPU usage (e.g., via nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv), you can parse these logs post-job. - Example script to log GPU stats: bash #SBATCH --gres=gpu:1 nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used --format=csv -l 60 > gpu_usage.log 4. darshan (I/O Profiling): - Maxwell supports Darshan for I/O profiling, but it must be enabled before job submission. - Example: bash module load darshan export DARSHAN_ENABLE_NONMPI=1 sbatch your_script.sh - Post-job, generate a report: bash darshan-parser /path/to/darshan/logs/<JOBID>.darshan > io_report.txt


3. Missing Metrics on Maxwell

Metric Available on Maxwell? Tool/Workaround
GPU load/memory/power ❌ No (via sacct) Parse nvidia-smi logs or use Grafana.
Per-core CPU usage ❌ No Use mpstat or pidstat during job.
Inter-node communication ❌ No Use sar or application-specific tools.
Filesystem I/O ⚠️ Limited Enable darshan or parse /proc.
Command-level breakdown ❌ No Use ps/top logs during job.

4. Suggested Solutions

Option 1: Log GPU Metrics During Job

Add this to your Slurm script to log GPU usage:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --output=job_%j.out

# Log GPU stats every 60 seconds
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used,power.draw --format=csv -l 60 > gpu_usage_${SLURM_JOB_ID}.log &

# Run your application
your_command_here

# Kill the background logger
kill $!

Option 2: Use darshan for I/O Metrics

Enable Darshan in your job script:

#!/bin/bash
#SBATCH --output=job_%j.out

module load darshan
export DARSHAN_ENABLE_NONMPI=1

your_command_here

Post-job, generate a report:

darshan-parser /beegfs/desy/group/it/darshan-logs/<JOBID>.darshan > io_report.txt

Option 3: Parse /proc for CPU/Memory

Add this to your script to log CPU/memory usage:

#!/bin/bash
#SBATCH --output=job_%j.out

# Log CPU/memory every 60 seconds
while true; do
  ps aux >> cpu_mem_${SLURM_JOB_ID}.log
  sleep 60
done &

your_command_here

kill $!

Option 4: Request Maxwell-Specific Tools

If you frequently need such reports, consider requesting the Maxwell team to: - Add GPU metrics to sacct (e.g., via Slurm’s gres accounting). - Provide a post-job summary tool (similar to Hummel’s report).


5. Example Workflow for Maxwell

  1. Submit a job with logging: bash sbatch --gres=gpu:1 --output=job_%j.out job_script.sh
  2. Post-job analysis:
  3. Parse gpu_usage_<JOBID>.log for GPU metrics.
  4. Parse cpu_mem_<JOBID>.log for CPU/memory.
  5. Use darshan-parser for I/O metrics.

Sources

  1. Maxwell Slurm Documentation
  2. Darshan I/O Profiler
  3. NVIDIA nvidia-smi Documentation