Skip to content

Ollama

Ollama is a tool to run Large Language Models (LLMs) locally. It starts a server for use in scripting and a command line tool.

Availability

Cluster Module/Version
BOSE ollama/0.11.4
ollama/0.12.0
ollama/0.12.10
BGSC Not Available

Warning

Note: Due to how often new releases of Ollama come out to support newer models, we recommend using module load ollama instead of specifying a version unless absolutely necessary.

This way you always have the most recent version when we update Ollama on the cluster.

CLI Usage

Ollama provides a command line interface for terminal usage. Use man ollama or this online man page for a full list.

Option Description
run Run a model - note that new models must be first installed by the HPC Team.
list List all installed models

Interative Prompts

To directly run prompts where you can see the output in real time, we recommend using Interactive Mode on the BOSE cluster. You will have to adjust your requested CPU cores, memory, and use of GPUs depending on the model you want to use (see Available Models below).

Command Line
sinteract --ntasks=64    # Request 64 Cores and log into a compute server
Output
Output
---------------------------------------------------------
Starting Interactive Session...

Please hold while we try to find an available node with your requested resources within 30 seconds.
---------------------------------------------------------
salloc: Pending job allocation 100000
salloc: job 100000 queued and waiting for resources
salloc: job 100000 has been allocated resources
salloc: Granted job allocation 100000
salloc: Waiting for resource configuration
salloc: Nodes cn08 are ready for job

[username@cn08 ~]$
Command Line
module load ollama
ollama run qwen3:4b
Output
>>> Hello World
Thinking...
Okay, the user sent "Hello World /think". That's a common greeting, but I need to figure out the best way to respond. Since they might be testing if I'm alive or just starting, I should acknowledge their message warmly.

I should make sure my response is friendly and open-ended. Maybe add an emoji to keep it light. Also, ask how I can assist them. Let me check if there's any specific context I'm missing, but since they just said "Hello World", it's probably a simple greeting. I'll
keep it simple and welcoming.
...done thinking.

Hello! 🌟 How can I assist you today?

API Usage

When loaded with module load ollama, ollama starts an OpenAI-compatible API.

Ollama also provides official Python and JavaScript libraries.

Note

An Ollama server will start automatically when you load Ollama with module load ollama.

Note

Each ollama server is started with a random port instead of the standard Ollama port in order to avoid conflicts with multiple ollama istances being started on one machine. The url to the started Ollama server is in the environment variable OLLAMA_HOST.

Available Models

Model Limitations

The HPC Team limits which models are available to run through Ollama due to licensing concerns and to prevent large models from taking up a lot of space when used by multiple researchers. If you would like to request another model to be made available on BOSE, please contact the HPC Team.

These are the current documented models available on our cluster with their recommended minimum requirements to run:

Name Performance w/ Minimum requirements File Size Requirements
deepseek-r1:1.5b 97 tk/s 1.1 GB 1xh100_11GB GPU
deepseek-r1:7b 37 tk/s 4.7 GB 1xh100_11GB GPU
deepseek-r1:8b 34 tk/s 5.2 GB 1xh100_11GB GPU
deepseek-r1:14b 22 tk/s 9.0 GB 1xh100_11GB GPU
deepseek-r1:32b 26 tk/s 19 GB 1xV100S GPU
deepseek-r1:70b 16 tk/s 42 GB 2xV100S GPU
deepseek-r1:671b 404 GB Not enough
gpt-oss:20b 65 tk/s 13 GB 1xV100S GPU
gpt-oss:120b 44 tk/s 65 GB 3xV100S GPU
mxbai-embed-large:335m Fast 669 MB 1xh100_11GB GPU
nomic-embed-text:latest Fast 274 MB 1xh100_11GB GPU
qwen3:0.6b 226 tk/s 522 MB 1xh100_11GB GPU
qwen3:1.7b 167 tk/s 1.4 GB 1xV100S GPU
qwen3:4b 108 tk/s 2.6 GB 1xV100S GPU
qwen3:8b 83 tk/s 5.2 GB 1xV100S GPU
qwen3:14b 57 tk/s 9.3 GB 1xV100S GPU
qwen3:30b 77 tk/s 18 GB 1xV100S GPU
qwen3:32b 27 tk/s 20 GB 1xV100S GPU
qwen3:235b 142 GB Not enough

You can view the official list of all installed models when you load the Ollama module or by running ollama list afterwards.

Model Taking Too Long?

If it takes a long time loading the model, you can try increasing the CPU and memory available to Ollama. Note that we are not able to accommodate the largest models with fast token speeds due to the available hardware on our cluster, so some models are best used through Slurm to get an answer at a later time.

Sample Slurm Script

submit.sh
#!/bin/bash
# -- SLURM SETTINGS -- #
#SBATCH --ntasks-per-node=6     # How many CPU cores you want to request
#SBATCH --mem=10GB              # How much memory you want to request
#SBATCH --parition=GPU          # What partition of machines you want to run on
#SBATCH --gpus=1                # How many GPUs you want to request

# -- SCRIPT COMMANDS -- #
module load ollama                 # Load ollama
ollama run qwen3:1.7b "What is a Blugold?"

Side node: Apparently a Blugold is a nickname for the University of South Carolina (USC) according to this model. Who knew we were on the wrong campus this entire time!

Real Example

Has your research group used ollama in a project? Contact the HPC Team and we'd be glad to feature your work.

Resources