Ollama
Ollama is a tool to run Large Language Models (LLMs) locally. It starts a server for use in scripting and a command line tool.
Availability
| Cluster | Module/Version |
|---|---|
| BOSE | ollama/0.11.4 ollama/0.12.0 ollama/0.12.10 |
| BGSC | Not Available |
Warning
Note: Due to how often new releases of Ollama come out to support newer models, we recommend using module load ollama instead of specifying a version unless absolutely necessary.
This way you always have the most recent version when we update Ollama on the cluster.
CLI Usage
Ollama provides a command line interface for terminal usage. Use man ollama or this online man page for a full list.
| Option | Description |
|---|---|
| run | Run a model - note that new models must be first installed by the HPC Team. |
| list | List all installed models |
Interative Prompts
To directly run prompts where you can see the output in real time, we recommend using Interactive Mode on the BOSE cluster. You will have to adjust your requested CPU cores, memory, and use of GPUs depending on the model you want to use (see Available Models below).
Output
---------------------------------------------------------
Starting Interactive Session...
Please hold while we try to find an available node with your requested resources within 30 seconds.
---------------------------------------------------------
salloc: Pending job allocation 100000
salloc: job 100000 queued and waiting for resources
salloc: job 100000 has been allocated resources
salloc: Granted job allocation 100000
salloc: Waiting for resource configuration
salloc: Nodes cn08 are ready for job
[username@cn08 ~]$
Output
>>> Hello World
Thinking...
Okay, the user sent "Hello World /think". That's a common greeting, but I need to figure out the best way to respond. Since they might be testing if I'm alive or just starting, I should acknowledge their message warmly.
I should make sure my response is friendly and open-ended. Maybe add an emoji to keep it light. Also, ask how I can assist them. Let me check if there's any specific context I'm missing, but since they just said "Hello World", it's probably a simple greeting. I'll
keep it simple and welcoming.
...done thinking.
Hello! 🌟 How can I assist you today?
API Usage
When loaded with module load ollama, ollama starts an OpenAI-compatible API.
Ollama also provides official Python and JavaScript libraries.
Note
An Ollama server will start automatically when you load Ollama with module load ollama.
Note
Each ollama server is started with a random port instead of the standard Ollama port in order to avoid conflicts with multiple ollama istances being started on one machine. The url to the started Ollama server is in the environment variable OLLAMA_HOST.
Available Models
Model Limitations
The HPC Team limits which models are available to run through Ollama due to licensing concerns and to prevent large models from taking up a lot of space when used by multiple researchers. If you would like to request another model to be made available on BOSE, please contact the HPC Team.
These are the current documented models available on our cluster with their recommended minimum requirements to run:
| Name | Performance w/ Minimum requirements | File Size | Requirements |
|---|---|---|---|
| deepseek-r1:1.5b | 97 tk/s | 1.1 GB | 1xh100_11GB GPU |
| deepseek-r1:7b | 37 tk/s | 4.7 GB | 1xh100_11GB GPU |
| deepseek-r1:8b | 34 tk/s | 5.2 GB | 1xh100_11GB GPU |
| deepseek-r1:14b | 22 tk/s | 9.0 GB | 1xh100_11GB GPU |
| deepseek-r1:32b | 26 tk/s | 19 GB | 1xV100S GPU |
| deepseek-r1:70b | 16 tk/s | 42 GB | 2xV100S GPU |
| deepseek-r1:671b | 404 GB | Not enough | |
| gpt-oss:20b | 65 tk/s | 13 GB | 1xV100S GPU |
| gpt-oss:120b | 44 tk/s | 65 GB | 3xV100S GPU |
| mxbai-embed-large:335m | Fast | 669 MB | 1xh100_11GB GPU |
| nomic-embed-text:latest | Fast | 274 MB | 1xh100_11GB GPU |
| qwen3:0.6b | 226 tk/s | 522 MB | 1xh100_11GB GPU |
| qwen3:1.7b | 167 tk/s | 1.4 GB | 1xV100S GPU |
| qwen3:4b | 108 tk/s | 2.6 GB | 1xV100S GPU |
| qwen3:8b | 83 tk/s | 5.2 GB | 1xV100S GPU |
| qwen3:14b | 57 tk/s | 9.3 GB | 1xV100S GPU |
| qwen3:30b | 77 tk/s | 18 GB | 1xV100S GPU |
| qwen3:32b | 27 tk/s | 20 GB | 1xV100S GPU |
| qwen3:235b | 142 GB | Not enough |
You can view the official list of all installed models when you load the Ollama module or by running ollama list afterwards.
Model Taking Too Long?
If it takes a long time loading the model, you can try increasing the CPU and memory available to Ollama. Note that we are not able to accommodate the largest models with fast token speeds due to the available hardware on our cluster, so some models are best used through Slurm to get an answer at a later time.
Sample Slurm Script
#!/bin/bash
# -- SLURM SETTINGS -- #
#SBATCH --ntasks-per-node=6 # How many CPU cores you want to request
#SBATCH --mem=10GB # How much memory you want to request
#SBATCH --parition=GPU # What partition of machines you want to run on
#SBATCH --gpus=1 # How many GPUs you want to request
# -- SCRIPT COMMANDS -- #
module load ollama # Load ollama
ollama run qwen3:1.7b "What is a Blugold?"
Side node: Apparently a Blugold is a nickname for the University of South Carolina (USC) according to this model. Who knew we were on the wrong campus this entire time!
Real Example
Has your research group used ollama in a project? Contact the HPC Team and we'd be glad to feature your work.