Skip to content

llama.cpp

llama.cpp is a highly efficent open source tool built using C/C++ to perform large language models (LLMs) inference on a variety of locally-hosted models. In addition to using the CLI to receive direct responses to prompts, it also includes a server that can be used for chatting within a web browser using Desktop Mode.

Availability

Cluster Module/Version
BOSE llama.cpp/b1726
BGSC Not Available

Warning

Note: Due to how often new releases of llama.cpp come out and how they are labeled, we recommend using module load llama.cpp instead of specifying a version unless absolutely necessary.

This way you always have the most recent version when we update llama.cpp on the cluster.

Pre-Installed Models

The Blugold Center for HPC will keep a select list of GGUF files available for cluster users to use instead of having to download their own. This is especially useful for large models that have 20+ billion parameters.

Current Models:

These are the files that we currently have available that you can reference. For a real-time list of what's available, which may contain newer models that listed below, you can run ls $MODELS_DB after loading the llama.cpp module.

Model File Path
gpt-oss-20b-GGUF $MODELS_DB/gpt-oss-20b-mxfp4.gguf

How to use:

When using a model in our repository, you simply prefix the name of the model file with $MODELS_DB.

llama-cli -m $MODELS_DB/gpt-oss-20b-mxfp4.gguf

Downloading New Models

Models and datasets are available to download through services such as Hugging Face. Our llama.cpp installation does not allow users to directly download models via llama.cpp, so you'll need to do it separately.

For example, to download gpt-oss-20b-GGUF, the steps would be:

  1. Search for gpt-oss-20b-GGUF on Hugging Face
  2. Click on "Files and versions"
  3. Find and click on the desired .gguf file (gpt-oss-20b-mxfp4.gguf)
  4. Click on "Copy download link"
  5. In the shell on the cluster, run wget YOUR-URL-HERE to download the file.

Using llama-server

The llama-server command provides a OpenAI-compatible server that may also include a chat interface accessible within the Firefox web browser when using Desktop Mode.

Using Desktop Mode:

  1. Log into Open OnDemand - https://ondemand.hpc.uwec.edu
  2. Click "Desktop" on the dashboard, or by first clicking "Interactive Apps" in the top bar.
  3. Fill out your required resources to the best of your abilities. Unsure what to use?
  4. Wait for the job to start, then click "Launch Desktop"
  5. Start the terminal by clicking on the black square icon in the top bar, or by going to Applications --> System Tools --> MATE Terminal"
  6. Type: module load llama.cpp
  7. Run your llama.cpp commands, such as llama-server -m model-file.gguf.

If you run into an error regarding binding, this may be due to your automatically chosen port being taken by another user after you initially ran module load. To resolve this, run the command get-llama-port prior to starting up llama-server again, as it'll generate a new port.

Sample Slurm Script

submit.sh
#!/bin/bash
# -- SLURM SETTINGS -- #
# [..] other settings here [..]

# The following settings are for the overall request to Slurm
#SBATCH --ntasks-per-node=32     # How many CPU cores do you want to request
#SBATCH --nodes=1                # How many nodes do you want to request

# -- SCRIPT COMMANDS -- #

# Load the needed modules
module load llama.cpp
llama-cli -m gpt-oss-20b-mxfp4.gguf -p "Hello World!"

Real Example

Has your research group used llama.cpp in a project? Contact the HPC Team and we'd be glad to feature your work.

Citation

Llama.cpp has no documentated citation that we could find and it was released under the MIT License.

Resources