1591781 : resource monitoring (post job)¶

Created: 2026-04-13T14:46:53Z - current status: new¶

Here’s a structured response to your query about extracting job resource usage details on the Maxwell cluster, particularly focusing on GPU metrics and other advanced statistics similar to the Hummel cluster report you shared.

1. Summary of the Issue¶

The user seeks detailed post-job resource usage statistics on the Maxwell cluster, including: - GPU usage (load, memory, power). - CPU core-level breakdown (per-socket/core usage). - I/O metrics (filesystem reads/writes, inter-node communication). - Memory usage (RSS, virtual memory, swap). - Command-level resource attribution (e.g., per-process CPU/memory).

While sacct provides basic CPU and memory usage, it lacks GPU and granular I/O/communication metrics. The Hummel cluster’s report offers a more comprehensive breakdown, which the user wants to replicate on Maxwell.

2. Available Tools on Maxwell¶

A. `sacct` (Basic Metrics)¶

What it provides:
Job ID, state, elapsed time, CPU usage (CPUTime, ReqCPUS), memory (MaxRSS), and exit codes.
Example: bash sacct -j <JOBID> --format=JobID,Elapsed,CPUTime,MaxRSS,ReqCPUS,NodeList
Limitations:
No GPU metrics (load, memory, power).
No per-core CPU usage or I/O statistics.

B. Grafana (Real-Time Monitoring)¶

What it provides:
Real-time GPU metrics (utilization, memory usage, power) as graphs.
Node-level CPU, memory, and network I/O.
Limitations:
Data is not summarized post-job (only 5-minute snapshots).
No command-level breakdown.

C. `seff` (Job Efficiency Report)¶

What it provides:
Summary of CPU and memory efficiency (e.g., "CPU Efficiency: 50%").
Example: bash seff <JOBID>
Limitations:
No GPU or I/O metrics.

D. Custom Scripts (Advanced Metrics)¶

Maxwell does not natively provide a tool like Hummel’s report, but you can manually extract some metrics using: 1. Slurm Logs: - Check job output files (specified via #SBATCH --output) for application-specific logs (e.g., NVIDIA nvidia-smi logs). 2. sstat (During Job Execution): - Real-time metrics for running jobs (CPU, memory, I/O), but not GPU. - Example: bash sstat -j <JOBID> --format=JobID,MaxRSS,AveCPU 3. nvidia-smi Logs: - If your job logs GPU usage (e.g., via nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv), you can parse these logs post-job. - Example script to log GPU stats: bash #SBATCH --gres=gpu:1 nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used --format=csv -l 60 > gpu_usage.log 4. darshan (I/O Profiling): - Maxwell supports Darshan for I/O profiling, but it must be enabled before job submission. - Example: bash module load darshan export DARSHAN_ENABLE_NONMPI=1 sbatch your_script.sh - Post-job, generate a report: bash darshan-parser /path/to/darshan/logs/<JOBID>.darshan > io_report.txt

3. Missing Metrics on Maxwell¶

Metric	Available on Maxwell?	Tool/Workaround
GPU load/memory/power	❌ No (via `sacct`)	Parse `nvidia-smi` logs or use Grafana.
Per-core CPU usage	❌ No	Use `mpstat` or `pidstat` during job.
Inter-node communication	❌ No	Use `sar` or application-specific tools.
Filesystem I/O	⚠️ Limited	Enable `darshan` or parse `/proc`.
Command-level breakdown	❌ No	Use `ps`/`top` logs during job.

4. Suggested Solutions¶

Option 1: Log GPU Metrics During Job¶

Add this to your Slurm script to log GPU usage:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --output=job_%j.out

# Log GPU stats every 60 seconds
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used,power.draw --format=csv -l 60 > gpu_usage_${SLURM_JOB_ID}.log &

# Run your application
your_command_here

# Kill the background logger
kill $!

Option 2: Use `darshan` for I/O Metrics¶

Enable Darshan in your job script:

#!/bin/bash
#SBATCH --output=job_%j.out

module load darshan
export DARSHAN_ENABLE_NONMPI=1

your_command_here

Post-job, generate a report:

darshan-parser /beegfs/desy/group/it/darshan-logs/<JOBID>.darshan > io_report.txt

Option 3: Parse `/proc` for CPU/Memory¶

Add this to your script to log CPU/memory usage:

#!/bin/bash
#SBATCH --output=job_%j.out

# Log CPU/memory every 60 seconds
while true; do
  ps aux >> cpu_mem_${SLURM_JOB_ID}.log
  sleep 60
done &

your_command_here

kill $!

Option 4: Request Maxwell-Specific Tools¶

If you frequently need such reports, consider requesting the Maxwell team to: - Add GPU metrics to sacct (e.g., via Slurm’s gres accounting). - Provide a post-job summary tool (similar to Hummel’s report).

5. Example Workflow for Maxwell¶

Submit a job with logging: bash sbatch --gres=gpu:1 --output=job_%j.out job_script.sh
Post-job analysis:
Parse gpu_usage_<JOBID>.log for GPU metrics.
Parse cpu_mem_<JOBID>.log for CPU/memory.
Use darshan-parser for I/O metrics.