View Job Results
A job can finish or stop for a variety of reasons such as:
- Completed (Successfully)
- Failed
- Node Failure
- Timeout
- Canceled
When these happen, you'll want to know the status of your job, results of your calculations, or any errors that appear.
Output Files
Beyond output emails created from your program itself, Slurm can direct standard output and error text for you to view in real-time. This is the same text as if you were running the program commands manually in the command prompt.
#SBATCH --output=output-%j.txt # Standard Output Text
#SBATCH --error=error-%j.txt # Standard Error Text
Using Variables?
The "%j" in the example above will automatically inject the id of your Slurm job into the file name. This will allow you to retain past Slurm output logs rather than overwriting them each time, which is very beneficial for troubleshooting.
For other variables you can use in your #SBATCH lines, check out this guide on Slurm's website.
Resource Usage
To view how much CPU and memory your job uses, we have the jobeff
command available. Slurm's efficiency program was built to give users a look into how well their job uses the resources they requested after the job stops, however we extended it to support near real-time metrics by linking to our Grafana dashboards.
jobeff jobid
Note: jobeff works best for long-running jobs and may not collect enough information for anything that runs for just a few minutes.
[user@bose g16]$ jobeff 65931
---- Slurm Efficiency Report ----
Job ID: 65931
Cluster: bose
User/Group: user/SFU_Users
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 00:10:26
CPU Efficiency: 30.57% of 00:34:08 core-walltime
Job Wall-clock time: 00:00:32
Memory Utilized: 1.02 GB # (1)!
Memory Efficiency: 1.60% of 64.00 GB
---- Grafana Links ----
cn43 | https://metrics.hpc.uwec.edu/d/aaba6Ahbauquag/job-performance?orgId=1&theme=default&from=1704734501400&to=1704734534600&var-cluster=BOSE&var-host=cn43&var-jobid=6593
- Note that memory usage statistics are not perfect and are only captured every 30 seconds or so. This may cause a discrepancy, especially for jobs that quickly result in a "OUT_OF_MEMORY" error. Use this as a guideline, and check out the Grafana links to see how usage changes over time.
You can visit your job's metrics in our online Grafana dashboard by either clicking on the link (may have to hold ctrl or cmd first), or copy/pasting it into your web browser. Once on the page, you'll be able to see graphs showing your CPU and memory usage over the lifetime of your job, as well as metrics showing the shared node as a whole.