Scheduled Maintenance

The Blugold Center for High-Performance Computing performs scheduled maintenance at least twice a year, typically during the beginning or end of the winter and summer terms when usage is at its lowest. Announcements will go out in advance to all active users, but there may be times that emergency work needs to be performed with minimal warning.

Upcoming Dates

The following dates are the current plans for upcoming maintenance and are subject to change.

Date(s)	Cluster/Service	Status
Jan 2-3, 2025	BOSE, OnDemand	Complete
Jan 23-24, 2025	BGSC, WebMO	Complete
Week of May 19th, 2025	BOSE, OnDemand	Complete
Week of August 18th, 2025	BGSC, WebMO	Complete
August 28th-29th, 2025	BOSE, OnDemand, MedeA	Upcoming - Intermittently Offline

Upcoming maintenance, along with the current status of our services, can also be found on our public website under Resources --> System Status.

Why Prepare?

Once nice thing about a cluster environment is that we are able to perform work on individual compute nodes throughout the year, but there will be times we need to take offline core infrastructure. This includes servers such as the login node, storage, or the high-speed network, all of which impact usage of a cluster. When those are being worked on, we'd prefer not to have any users logged in or running jobs.

We perform tasks such as:

Security and Firmware Updates
Infrastructure Software Updates (such as Slurm and OnDemand)
Change impactful configuration settings
Implement new features and tools

How To Prepare

When we approach a scheduled maintenance period, an email will go out to all HPC-affiliated faculty, staff, and active research students. In addition, a notice will be displayed when you log into the cluster and at the top of Open OnDemand.

We recommend:

Downloading any files you need in advance in case access is blocked
Make sure any new jobs are scheduled to finish beforehand (see below)
Alert HPC Staff as soon as possible if maintenance is going to cause any concerns

Submitting Jobs

All jobs must be finished before the scheduled maintenance begins, otherwise they are subject to suspension or termination. Most jobs automatically default to a max time limit of 7 days, which can prevent jobs from running if there is less than a week left.

To allow your job to still run, you can change the --time flag in Slurm, as long as it fits before the deadline kicks in.

Identifying Remaining Time

Type motd (Message of the Day) and look at the reservation list to see the remaining time before maintenance starts. Your job's time limit must be less than what’s specified next to ’For Slurm’.

Example Message of the Day

Node Reservations
----------------------------------------------------------------
The nodes listed below have been reserved for a period of time and are not accessible
    while the reservation is active. They also will not accept any jobs that will not be completed by the Start Date.

Along with the time is what you can use with Slurm's #SBATCH --time=DD-HH:MM:SS setting to fit jobs in before
    upcoming reservations.

    Start Date              End Date             Node(s)
------------------------------------------------------
    Aug 28 @  8:00AM   ==>  Aug 30 @  5:00PM     cn[01-56],dev01,gpu[01-04],lm[01-03] (UPCOMING)

----------------------------------------------------------------

Note the start date, end date, and nodes that will not be available for use. If you see a full list such as the above, then the entire cluster will be unavailable for jobs to be running.

Once we are less than a week out, the "(UPCOMING)" will be changed to show the max time limit you can place in your Slurm script to allow your job to still run within the remaining time.

Before Submitting Your Job

If you haven't submitted your job yet, you can change the #SBATCH --time=DD-HH:MM:SS in your Slurm script.

You can also override the time limit in your files by only specifying it when you submit your job with sbatch --time=DD-HH:MM:SS run-script.sh.

DD = Days
HH = Hours
MM = Minutes
SS = Seconds

Job Already Submitted?

ReqNodeNotAvail: If you already submitted a job and didn't set a new time limit, you may see a message such as "(ReqNodeNotAvail, Reserved for maintenance)" when you run myjobs or squeue.

You can change your job's current time limit by using the command:

scontrol update job YourJobID TimeLimit=DD-HH:MM:SS

Example changing a job's time limit to 5 days:

scontrol update job 12345 TimeLimit=5-00:00:00

Otherwise, your job will just sit in pending status and run once the maintenance period is over.

Have Questions? Concerns?

If you have any questions about the maintenance process, or have any concerns about an upcoming date, just let us know.

Contact the HPC Team