Sam Reghenzi Homepage

A practical configuration for a 32 GB M5 Mac that still needs to remain usable

Running large language models locally has become surprisingly practical on Apple Silicon. With a modern Mac, Ollama, and a carefully quantized GGUF model, it is possible to run models that only a short time ago would have felt out of reach for a personal machine.

This post collects the practical findings from configuring Gemma 4 31B on a 32 GB Apple Silicon Mac with an M5 processor, using Ollama and a highly compressed GGUF quantization.

The goal is not to squeeze every last token per second out of the machine. The goal is more realistic:

Run a capable 31B local model while keeping the Mac usable for normal work: browser, IDE, terminal, notes, chat apps, and light development tools.

That distinction matters. A configuration that works for a benchmark is not necessarily a configuration you want to live with all day.

The model

The model used in this setup is:

gemma-4-31B-it-UD-IQ3_XXS.gguf

This is an aggressively quantized GGUF build of Gemma 4 31B. The IQ3_XXS quantization makes the model small enough to fit into machines that would otherwise be unable to run a 31B model at all.

The trade-off is obvious: this is not the highest quality quantization, but it gives access to a much larger model class on consumer hardware.

The Ollama Modelfile starts from the local GGUF file:

FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf

Understanding the main Ollama parameters

Before tuning the configuration, it helps to understand what the key parameters actually control.

`num_ctx`

num_ctx controls the maximum context window used by the model.

A larger context means the model can keep more conversation, documents, code, or instructions in memory. But it also increases memory usage, especially through the KV cache.

For this model, useful values are:

6144   conservative
8192   balanced
12288  aggressive

For daily use on a 32 GB Mac, 8192 is a good target.

`num_batch`

num_batch affects how many tokens are processed together during prompt ingestion.

It mostly impacts the speed at which the model reads the input prompt, not necessarily the speed at which it generates the answer token by token.

Higher values can improve responsiveness with longer prompts, but they also increase temporary memory pressure.

Good values for this setup are:

32   conservative
64   balanced
96   aggressive

For a daily driver configuration, 64 is a reasonable compromise. If the Mac becomes sluggish or the runner crashes, this is one of the first values to reduce.

`num_gpu`

In Ollama, num_gpu does not mean “number of GPUs”. It means how many model layers are offloaded to the GPU.

For Gemma 4 31B, the theoretical maximum is:

num_gpu 60

because the model has 60 layers.

However, full offload is not always the best practical choice. On a machine that must remain usable for other work, leaving some margin is often better than maximizing GPU offload.

For a 32 GB M5 Mac, a good range is:

50   conservative
55   balanced
60   aggressive / full offload

The recommended daily value is 55.

The recommended daily configuration

This is the configuration I would use as a balanced daily driver.

It gives a useful context window, keeps prompt processing reasonably fast, and avoids pushing the system too close to the edge.

FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf

PARAMETER num_ctx 8192
PARAMETER num_batch 64
PARAMETER num_gpu 55

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64

Create or replace the model with:

cd ~/ollama-models/gemma4-31b-iq3

cat > Modelfile <<'EOF'
FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf

PARAMETER num_ctx 8192
PARAMETER num_batch 64
PARAMETER num_gpu 55

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
EOF

ollama stop gemma4-31b-iq3xxs-32gb
ollama rm gemma4-31b-iq3xxs-32gb
ollama create gemma4-31b-iq3xxs-32gb -f ./Modelfile

Then test it:

ollama run gemma4-31b-iq3xxs-32gb "Write a Python hello world."

Environment variables

For this setup, Flash Attention and quantized KV cache are useful.

If using the Ollama macOS app, set the variables with launchctl:

launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
launchctl setenv OLLAMA_CONTEXT_LENGTH "8192"
launchctl setenv OLLAMA_KEEP_ALIVE "5m"

Then quit and reopen Ollama from the menu bar.

If running Ollama manually from the terminal:

OLLAMA_FLASH_ATTENTION=1 \
OLLAMA_KV_CACHE_TYPE=q8_0 \
OLLAMA_CONTEXT_LENGTH=8192 \
OLLAMA_KEEP_ALIVE=5m \
ollama serve

What `OLLAMA_KV_CACHE_TYPE=q8_0` does

The KV cache stores the internal attention state for tokens that have already been processed. It grows with the active context length.

Using:

OLLAMA_KV_CACHE_TYPE=q8_0

asks Ollama to store the KV cache in an 8-bit quantized format.

The practical effect is:

lower memory usage
better support for longer context windows
usually minimal quality loss compared to f16 KV cache

It is especially useful when running large models with larger context windows.

However, it does not significantly reduce the memory used by the model weights themselves. It mainly helps with memory used by context.

For this reason, q8_0 is useful, but it is not a magic fix for every memory issue.

Flash Attention

Flash Attention should be enabled when using quantized KV cache:

OLLAMA_FLASH_ATTENTION=1

It helps reduce memory usage and can improve efficiency with larger contexts.

For this model, I would keep it enabled.

Keeping the Mac usable

Apple Silicon uses unified memory. The CPU and GPU share the same memory pool.

That means a local LLM can directly compete with everything else on the Mac:

browser tabs
IDE
Docker
database containers
video calls
Slack or Teams
file indexing
other development tools

This is why the best configuration is not necessarily the most aggressive one.

The aim is to leave enough headroom that the system does not constantly swap or become unresponsive.

Wired memory limit

Some users increase the Metal wired memory limit with:

sudo sysctl iogpu.wired_limit_mb=22000

On a 32 GB Mac, I would use one of these values:

20000  conservative workstation mode
22000  balanced daily mode
24000  LLM-priority mode

For daily work, I would start with:

sudo sysctl iogpu.wired_limit_mb=22000

I would avoid pushing this too close to total system memory. Values like 28000 or 30000 may help a benchmark but can make the machine unpleasant to use for normal work.

Three practical profiles

1. Workstation profile

Use this when you also need Docker, an IDE, many browser tabs, or other heavy tools.

FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf

PARAMETER num_ctx 6144
PARAMETER num_batch 32
PARAMETER num_gpu 50

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64

Recommended environment:

sudo sysctl iogpu.wired_limit_mb=20000

launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
launchctl setenv OLLAMA_CONTEXT_LENGTH "6144"
launchctl setenv OLLAMA_KEEP_ALIVE "2m"

This is the safest profile for a Mac that must remain responsive.

2. Balanced daily profile

This is the recommended default.

FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf

PARAMETER num_ctx 8192
PARAMETER num_batch 64
PARAMETER num_gpu 55

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64

Recommended environment:

sudo sysctl iogpu.wired_limit_mb=22000

launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
launchctl setenv OLLAMA_CONTEXT_LENGTH "8192"
launchctl setenv OLLAMA_KEEP_ALIVE "5m"

This is the profile I would use most of the time.

3. LLM-priority profile

Use this when the Mac is mostly dedicated to local inference.

FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf

PARAMETER num_ctx 12288
PARAMETER num_batch 96
PARAMETER num_gpu 60

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64

Recommended environment:

sudo sysctl iogpu.wired_limit_mb=24000

launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
launchctl setenv OLLAMA_CONTEXT_LENGTH "12288"
launchctl setenv OLLAMA_KEEP_ALIVE "10m"

This is more aggressive. It may be useful for focused LLM sessions, but I would not use it while doing normal development work.

How to verify what is happening

After running the model, check:

ollama ps

This shows whether the model is loaded and how it is using the available processors.

To inspect logs:

grep -iE "offload|layers|gpu|metal|memory" ~/.ollama/logs/server.log | tail -n 80

Look for messages indicating how many layers were offloaded to the GPU.

For example:

offloaded 55/60 layers to GPU

That confirms that num_gpu 55 is actually being applied.

Troubleshooting instability

If Ollama crashes or the Mac becomes sluggish, reduce parameters in this order.

First reduce num_batch:

64 -> 32

Then reduce num_ctx:

8192 -> 6144

Then reduce num_gpu:

55 -> 50

Finally reduce the wired memory limit:

22000 -> 20000

A stable machine is more useful than a theoretical maximum configuration that crashes during real work.

About Metal crashes

During experimentation, one possible failure mode is a Metal backend crash, with logs similar to:

ggml-metal-device.m:608: GGML_ASSERT([rsets->data count] == 0) failed
panic during panic

When this happens, it is often better to reset the setup rather than keep pushing the same configuration.

A practical recovery sequence is:

killall Ollama
killall ollama

launchctl unsetenv OLLAMA_FLASH_ATTENTION
launchctl unsetenv OLLAMA_KV_CACHE_TYPE
launchctl unsetenv OLLAMA_CONTEXT_LENGTH

sudo sysctl iogpu.wired_limit_mb=20000

Then reboot the Mac and restart from a conservative profile.

Final recommendation

For a 32 GB M5 Mac that should remain useful as a normal workstation, I would use this:

sudo sysctl iogpu.wired_limit_mb=22000

launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
launchctl setenv OLLAMA_CONTEXT_LENGTH "8192"
launchctl setenv OLLAMA_KEEP_ALIVE "5m"

And this Modelfile:

FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf

PARAMETER num_ctx 8192
PARAMETER num_batch 64
PARAMETER num_gpu 55

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64

This is not the most extreme configuration. It is the one I would actually want to use.

It gives enough context for serious work, enough GPU offload for acceptable performance, and enough memory headroom to keep the Mac usable while doing other things.

That is usually the sweet spot for local LLMs: not maximum throughput, but sustainable performance.

Running Gemma 4 31B on an Apple Silicon Mac with Ollama

A practical configuration for a 32 GB M5 Mac that still needs to remain usable

The model

Understanding the main Ollama parameters

num_ctx

num_batch

num_gpu

The recommended daily configuration

Environment variables

What OLLAMA_KV_CACHE_TYPE=q8_0 does

Flash Attention

Keeping the Mac usable

Wired memory limit

Three practical profiles

1. Workstation profile

2. Balanced daily profile

3. LLM-priority profile

How to verify what is happening

Troubleshooting instability

About Metal crashes

Final recommendation

`num_ctx`

`num_batch`

`num_gpu`

What `OLLAMA_KV_CACHE_TYPE=q8_0` does