llama.cpp
llama.cpp is a highly efficent open source tool built using C/C++ to perform large language models (LLMs) inference on a variety of locally-hosted models. In addition to using the CLI to receive direct responses to prompts, it also includes a server that can be used for chatting within a web browser using Desktop Mode.
Availability
| Cluster | Module/Version |
|---|---|
| BOSE | llama.cpp/b1726 |
| BGSC | Not Available |
Warning
Note: Due to how often new releases of llama.cpp come out and how they are labeled, we recommend using module load llama.cpp instead of specifying a version unless absolutely necessary.
This way you always have the most recent version when we update llama.cpp on the cluster.
Pre-Installed Models
The Blugold Center for HPC will keep a select list of GGUF files available for cluster users to use instead of having to download their own. This is especially useful for large models that have 20+ billion parameters.
Current Models:
These are the files that we currently have available that you can reference. For a real-time list of what's available, which may contain newer models that listed below, you can run ls $MODELS_DB after loading the llama.cpp module.
| Model | File Path |
|---|---|
| gpt-oss-20b-GGUF | $MODELS_DB/gpt-oss-20b-mxfp4.gguf |
How to use:
When using a model in our repository, you simply prefix the name of the model file with $MODELS_DB.
Downloading New Models
Models and datasets are available to download through services such as Hugging Face. Our llama.cpp installation does not allow users to directly download models via llama.cpp, so you'll need to do it separately.
For example, to download gpt-oss-20b-GGUF, the steps would be:
- Search for gpt-oss-20b-GGUF on Hugging Face
- Direct Url: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF
- Click on "Files and versions"
- Find and click on the desired .gguf file (gpt-oss-20b-mxfp4.gguf)
- Click on "Copy download link"
- In the shell on the cluster, run
wget YOUR-URL-HEREto download the file.
Using llama-server
The llama-server command provides a OpenAI-compatible server that may also include a chat interface accessible within the Firefox web browser when using Desktop Mode.
Using Desktop Mode:
- Log into Open OnDemand - https://ondemand.hpc.uwec.edu
- Click "Desktop" on the dashboard, or by first clicking "Interactive Apps" in the top bar.
- Fill out your required resources to the best of your abilities. Unsure what to use?
- Wait for the job to start, then click "Launch Desktop"
- Start the terminal by clicking on the black square icon in the top bar, or by going to Applications --> System Tools --> MATE Terminal"
- Type:
module load llama.cpp - Run your llama.cpp commands, such as
llama-server -m model-file.gguf.
If you run into an error regarding binding, this may be due to your automatically chosen port being taken by another user after you initially ran module load. To resolve this, run the command get-llama-port prior to starting up llama-server again, as it'll generate a new port.
Sample Slurm Script
#!/bin/bash
# -- SLURM SETTINGS -- #
# [..] other settings here [..]
# The following settings are for the overall request to Slurm
#SBATCH --ntasks-per-node=32 # How many CPU cores do you want to request
#SBATCH --nodes=1 # How many nodes do you want to request
# -- SCRIPT COMMANDS -- #
# Load the needed modules
module load llama.cpp
llama-cli -m gpt-oss-20b-mxfp4.gguf -p "Hello World!"
Real Example
Has your research group used llama.cpp in a project? Contact the HPC Team and we'd be glad to feature your work.
Citation
Llama.cpp has no documentated citation that we could find and it was released under the MIT License.