Scheduled Maintenance
The Blugold Center for High-Performance Computing performs scheduled maintenance at least twice a year, typically during the beginning or end of the winter and summer terms when usage is at its lowest. Announcements will go out in advance to all active users, but there may be times that emergency work needs to be performed with minimal warning.
Upcoming Dates
The following dates are the current plans for upcoming maintenance and are subject to change.
Date(s) | Cluster/Service | Status |
---|---|---|
Aug 26-27, 2024 | BGSC, WebMO | Complete |
Aug 28-30, 2024 | BOSE, OnDemand | Complete |
Week of Dec 23rd, 2024 | BOSE, OnDemand | Upcoming |
Upcoming maintenance, along with the current status of our services, can also be found on our public website under Resources --> System Status.
Why Prepare?
Once nice thing about a cluster environment is that we are able to perform work on individual compute nodes throughout the year, but there will be times we need to take offline core infrastructure. This includes servers such as the login node, storage, or the high-speed network, all of which impact usage of a cluster. When those are being worked on, we'd prefer not to have any users logged in or running jobs.
We perform tasks such as:
- Security and Firmware Updates
- Infrastructure Software Updates (such as Slurm and OnDemand)
- Change impactful configuration settings
- Implement new features and tools
How To Prepare
When we approach a scheduled maintenance period, an email will go out to all HPC-affiliated faculty, staff, and active research students. In addition, a notice will be displayed when you log into the cluster and at the top of Open OnDemand.
We recommend:
- Downloading any files you need in advance in case access is blocked
- Make sure any new jobs are scheduled to finish beforehand (see below)
- Alert HPC Staff as soon as possible if maintenance is going to cause any concerns
Submitting Jobs
All jobs must be finished before the scheduled maintenance begins, otherwise they are subject to suspension or termination. Most jobs automatically default to a max time limit of 7 days, which can prevent jobs from running if there is less than a week left.
To allow your job to still run, you can change the --time
flag in Slurm, as long as it fits before the deadline kicks in.
Identifying Remaining Time
Type motd
(Message of the Day) and look at the reservation list to see the remaining time before maintenance starts. Your job's time limit must be less than what’s specified next to ’For Slurm’.
Example Message of the Day
Node Reservations
----------------------------------------------------------------
The nodes listed below have been reserved for a period of time and are not accessible
while the reservation is active. They also will not accept any jobs that will not be completed by the Start Date.
Along with the time is what you can use with Slurm's #SBATCH --time=DD-HH:MM:SS setting to fit jobs in before
upcoming reservations.
Start Date End Date Node(s)
------------------------------------------------------
Aug 28 @ 8:00AM ==> Aug 30 @ 5:00PM cn[01-56],dev01,gpu[01-04],lm[01-03] (UPCOMING)
----------------------------------------------------------------
Note the start date, end date, and nodes that will not be available for use. If you see a full list such as the above, then the entire cluster will be unavailable for jobs to be running.
Once we are less than a week out, the "(UPCOMING)" will be changed to show the max time limit you can place in your Slurm script to allow your job to still run within the remaining time.
Before Submitting Your Job
If you haven't submitted your job yet, you can change the #SBATCH --time=DD-HH:MM:SS
in your Slurm script.
You can also override the time limit in your files by only specifying it when you submit your job with sbatch --time=DD-HH:MM:SS run-script.sh
.
- DD = Days
- HH = Hours
- MM = Minutes
- SS = Seconds
Job Already Submitted?
myjobs
or squeue
.
You can change your job's current time limit by using the command:
Example changing a job's time limit to 5 days:
Otherwise, your job will just sit in pending status and run once the maintenance period is over.
Have Questions? Concerns?
If you have any questions about the maintenance process, or have any concerns about an upcoming date, just let us know.