Example: Building A Basic Machine Learning Model With Python

Machine learning is a large field of computer science which often utilizes HPC. This guide will focus on the setup to run the python script to train the model, not how the script or machine learning in general works.

For this basic example we will use the well-known iris dataset.

CPU Only

Building the Program

The CPU is the standard calculating hardware for a computer. It handles most tasks on the system, including basic machine learning. We can create a python script and save it to train.py.

train.py

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_iris
from joblib import parallel_backend

# Load the Iris dataset from sklearn
data = load_iris()
X = data.data  # Features (sepal length, sepal width, etc.)
y = data.target  # Labels (species of Iris)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# We use Random Forest for this example
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Use joblib's parallel_backend to enable parallelism using multiple cores
with parallel_backend('loky', n_jobs=-1):  # Use all available CPU cores
    scores = cross_val_score(model, X_train, y_train, cv=5, n_jobs=-1)

print("Cross-validation scores: ", scores)
print("Average cross-validation score: ", np.mean(scores))

# Train the model on the full training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
test_accuracy = model.score(X_test, y_test)
print(f"Test Accuracy: {test_accuracy:.4f}")

Building the Job Script

We have a program which actually builds the machine learning model. Now we need to tell Slurm to run the program and what kind of resources it needs. We can create a new file called train-model.sh.

#!/bin/bash

#SBATCH --time=00-01:00:0
#SBATCH --ntasks=16

#SBATCH --error=error-%j.txt

module load python-libs

python train.py

In this bash script we told Slurm a few key features of our job using #SBATCH.

The job will take at maximum 1 hour with --time
It needs access to 16 CPU cores with --ntasks=16
It should output any errors to error-xxxxx.txt (where xxxxx is the job id) with --error

Then we include our instructions for running the job

module load python-libs makes conda available, the software which allows us to access python libraries. Just loading this software activates the base conda environment, which is managed by the admins to include common python libraries which we need for this script
We run our training script with python train.py

Running the Job

Now that we have our program and the sbatch script, we can run it with sbatch train-model.sh.

You should get an output similar to this:

Submitted batch job xxxxx

Reading the Results

If you read the contents of the slurm-xxxxx.out file, you will see the output of your script, which should look similar to

Cross-validation scores:  [0.95833333 0.95833333 0.83333333 1.         0.95833333]
Average cross-validation score:  0.9416666666666667
Test Accuracy: 1.0000

If there is any errors in the program, they will be redirected to error-xxxxx.txt a specified in our job script.

You have successfully built a machine learning model with python and ran it on the cluster!

With GPU

GPUs specialize in fields like machine learning, where there is a lot of data which can be processed simultaneously. We have GPUs available for use on both clusters. In order to use them, we need to adjust our python and sbatch scripts.

Building the Program

We can create a new file gpu-train.py

gpu-train.py

import tensorflow as tf
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras import layers, models
import numpy as np

# Ensure that the script recognizes the available GPUs
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Standardizing the features (important for neural networks)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Define the model architecture
model = models.Sequential([
    layers.InputLayer(input_shape=(X_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(3, activation='softmax')  # 3 classes in the Iris dataset
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=20, batch_size=16, validation_data=(X_test, y_test))

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=2)
print(f"Test accuracy: {test_acc}")

Building the Job Script

Because we are using GPUs instead of CPUs, we switch from the Random Forest model to a Neural Network. We can now create our GPU sbatch script gpu-train-model.sh.

#!/bin/bash

#SBATCH --partition=GPU
#SBATCH --gpus=1
#SBATCH --time=00-01:00:0
#SBATCH --ntasks=1

#SBATCH --error=error-%j.txt

module load python-libs

conda activate tensorflow-2.11-gpu

python gpu-train,py

In addition to the settings we used last time, we include instructions for Slurm to

Use the GPU partition
Allocate 1 GPU for our script

And we load tensorflow-2.11-gpu with conda, giving us access to Tensorflow's machine learning libraries optimized for use on GPUs.

Running the Job

We can run it with sbatch gpu-train-model.sh.

You should again get an output similar to

Submitted batch job xxxxx

Reading the Output

And in the slurm-xxxxx.out file you should see something like

Loading python-libs/3.0
  Loading requirement: gcc/10.3.0 cuda/11.4 openblas/0.3.15
Num GPUs Available:  1
Epoch 1/20
8/8 [==============================] - 1s 20ms/step - loss: 1.0665 - accuracy: 0.3750 - val_loss: 0.9331 - val_accuracy: 0.7333
Epoch 2/20
8/8 [==============================] - 0s 4ms/step - loss: 0.8940 - accuracy: 0.6917 - val_loss: 0.7912 - val_accuracy: 0.7667
Epoch 3/20
8/8 [==============================] - 0s 4ms/step - loss: 0.7783 - accuracy: 0.8000 - val_loss: 0.6844 - val_accuracy: 0.8333
Epoch 4/20
8/8 [==============================] - 0s 4ms/step - loss: 0.6792 - accuracy: 0.8250 - val_loss: 0.5986 - val_accuracy: 0.8667
Epoch 5/20
8/8 [==============================] - 0s 4ms/step - loss: 0.6031 - accuracy: 0.8167 - val_loss: 0.5210 - val_accuracy: 0.9000
Epoch 6/20
8/8 [==============================] - 0s 4ms/step - loss: 0.5344 - accuracy: 0.8250 - val_loss: 0.4542 - val_accuracy: 0.9000
Epoch 7/20
8/8 [==============================] - 0s 4ms/step - loss: 0.4771 - accuracy: 0.8333 - val_loss: 0.3972 - val_accuracy: 0.9000
Epoch 8/20
8/8 [==============================] - 0s 4ms/step - loss: 0.4287 - accuracy: 0.8500 - val_loss: 0.3531 - val_accuracy: 0.9000
Epoch 9/20
8/8 [==============================] - 0s 4ms/step - loss: 0.3918 - accuracy: 0.8417 - val_loss: 0.3188 - val_accuracy: 0.9000
Epoch 10/20
8/8 [==============================] - 0s 4ms/step - loss: 0.3623 - accuracy: 0.8500 - val_loss: 0.2901 - val_accuracy: 0.9000
Epoch 11/20
8/8 [==============================] - 0s 4ms/step - loss: 0.3386 - accuracy: 0.8583 - val_loss: 0.2668 - val_accuracy: 0.9333
Epoch 12/20
8/8 [==============================] - 0s 4ms/step - loss: 0.3180 - accuracy: 0.8667 - val_loss: 0.2498 - val_accuracy: 0.9333
Epoch 13/20
8/8 [==============================] - 0s 4ms/step - loss: 0.2997 - accuracy: 0.8833 - val_loss: 0.2327 - val_accuracy: 0.9333
Epoch 14/20
8/8 [==============================] - 0s 4ms/step - loss: 0.2835 - accuracy: 0.8833 - val_loss: 0.2169 - val_accuracy: 0.9333
Epoch 15/20
8/8 [==============================] - 0s 4ms/step - loss: 0.2689 - accuracy: 0.9000 - val_loss: 0.2025 - val_accuracy: 0.9333
Epoch 16/20
8/8 [==============================] - 0s 4ms/step - loss: 0.2534 - accuracy: 0.9000 - val_loss: 0.1920 - val_accuracy: 0.9333
Epoch 17/20
8/8 [==============================] - 0s 4ms/step - loss: 0.2425 - accuracy: 0.8833 - val_loss: 0.1830 - val_accuracy: 0.9333
Epoch 18/20
8/8 [==============================] - 0s 4ms/step - loss: 0.2308 - accuracy: 0.9167 - val_loss: 0.1712 - val_accuracy: 0.9333
Epoch 19/20
8/8 [==============================] - 0s 4ms/step - loss: 0.2184 - accuracy: 0.9417 - val_loss: 0.1625 - val_accuracy: 0.9333
Epoch 20/20
8/8 [==============================] - 0s 4ms/step - loss: 0.2098 - accuracy: 0.9417 - val_loss: 0.1508 - val_accuracy: 0.9667
1/1 - 0s - loss: 0.1508 - accuracy: 0.9667 - 16ms/epoch - 16ms/step
Test accuracy: 0.9666666388511658

You have successfully built a machine learning model with python and ran it on the cluster with a GPU!