H100 Usage

Work-In-Progress

This guide, and the overall set up of the server, is temporary while the usage and configuration of the H100 node is being fleshed out. Any major changes in usage would be announced in advance.

Current Set Up

Our new H100 node, known as h1gpu01, has two H100 cards in it split in multiple ways to allow for a mix of performance versus capacity. This is done using the Multi-Instance GPU (MIG) feature in certain NVIDIA cards.

Node Specs


CPU	AMD EPYC 9354 32-Core (x2)
CPU Cores (Total)	64
CPU Clock (Base/Boost)	3.25 GHz / 3.75 GHz
Memory	384 GB
Local Scratch	800 GB

GPU+MIG Set Up

Quantity	VRAM	Slurm Setting
7	11GB	--gres=gpu:h100_11gb
1	94GB	--gres=gpu:h100_94gb

Last Updated: 9/08/2025

Capacity vs Performance - Changes Often

This server is always changing based on performance vs capacity needs. Depending on what is needed by faculty for classes and projects, the H100 may be adjusted at any time to match those needs.

Before you use the H100, we recommend viewing the table above to see the current configurationa and adjust your scripts accordingly.

Limited Access

The following accounts are currently approved to use the H100 server. Use myaccounts to see which accounts your user account is under. If you need access to this server for your research, please contact us to discuss options.

gomesr_reu
gomesr_pdac_scans
2261.cs.426.001 (CS 426 - Fall 2025)

Slurm Settings

Using the H100 GPU requires a few changes that differ from other GPU nodes.

Partition: #SBATCH --partition=h1gpu

Account: #SBATCH --account=ABC (Only approved accounts are supported)

GRES: #SBATCH --gres=gpu:h100_11gb (Instead of --gpus=)

Time: Max 2 Days

Slurm-Script.sh

#SBATCH --account=my-research-group
#SBATCH --partition=h1gpu
#SBATCH --gres=gpu:h100_94gb
#SBATCH --time=2-00:00:00

module load python-libs
conda activate my-env
python script.py

Known Issues

Full usage of the H100 is still in development and requires several changes that are being worked out. Some of these are specific to the use of MIG in the H100, and some of them are related to other components in the compute node that are newer than other servers in our cluster.

The 11gb option is not currently available to other groups automatically. (94GB is to remain restricted)
Jupyter support is currently not available, but is being worked on.
Metrics dashboard does not currently show usage metrics of GPU cards due to how NVIDIA reports stats about MIG-based cards.
Slurm does not track GPU utilization or memory for statistics.
The high-speed slingshot network connected to the storage servers is currently not available (impacts heavy read/write operations).