Run GLM-4.7 Locally: Optimizing Z.ai's Thinking Model

Olivia Johnson
Dec 25, 2025
4 min read

Z.ai’s release of GLM-4.7 has set a new benchmark for open-weights "thinking models," boasting state-of-the-art scores on SWE-bench (73.8%) and Terminal Bench 2.0 (41.0%). However, the base model is a behemoth—a 355 billion parameter giant requiring 400GB of disk space. For most, this is unclimbable.

But the landscape has changed. Thanks to Unsloth’s Dynamic 2.0 quantization, you can now run GLM-4.7 locally compressed down to ~135GB with minimal loss in reasoning capabilities. By leveraging specific Mixture-of-Experts (MoE) offloading strategies in llama.cpp, this model is now accessible on high-end workstations and specific GPU+RAM configurations.

This guide details the exact flags, regex commands, and hardware setups required to tame this model.

Field Notes: Hardware Reality and the Unsloth Solution

To run GLM-4.7 locally, you must understand its architecture. It is an MoE (Mixture of Experts) model. Unlike dense models, it doesn't use every parameter for every token. This architecture allows for aggressive optimization if you know how to split the load between your GPU (VRAM) and CPU (System RAM).

Validated Hardware Tiers

1. The "Golden" Consumer Setup (2-Bit Dynamic)

Hardware: 1x 24GB GPU (RTX 3090/4090) + 128GB DDR4/DDR5 RAM.
Model: Unsloth Dynamic 2-bit (UD-Q2_K_XL).
Performance: Usable. You must use MoE offloading (explained below) to move the "expert" layers to system RAM while keeping the active processing on the GPU.

2. The High-Fidelity Setup (4-Bit)

Hardware: 1x 40GB+ GPU (A6000/Mac Studio) + 205GB+ System RAM.
Model: Q4_K_XL or Q4_K_M.
Performance: Expect around 5 tokens per second (t/s) if you have 205GB of unified memory or combined VRAM+RAM. Below 205GB, you will face significant slowdowns due to swapping.

3. The Mac Silicon Tier

Requirement: 192GB Unified Memory is the realistic floor for decent performance. Ideally, 205GB+ allows for the 4-bit quants to breathe.

Critical Configuration for llama.cpp

Before you execute a single command, there are two non-negotiable settings you must respect. Ignoring these will result in garbage output or broken chat loops.

1. The --jinja Flag

Standard llama.cpp builds may struggle with GLM's specific chat templates. You must append the --jinja flag to your launch command. This forces the engine to use the Jinja2 template embedded in the model, ensuring the conversation history and "thinking" tags are parsed correctly.

2. Temperature Sensitivity

GLM-4.7 is a "thinking" model that behaves differently depending on the task.

Default / Agent / Multi-turn: Set Temperature = 1.0 and Top_p = 0.95. This is the sweet spot for general reasoning.
Benchmarks / Strict Coding: Set Temperature = 0.7 and Top_p = 1.0. Use this for SWE-bench style tasks where precision outweighs creativity.

The Secret Sauce: MoE Offloading Strategies

This is the most important part of this guide. If you are trying to run this on a 24GB card, you cannot simply load all layers to GPU (-ngl 99). You need to selectively offload the MoE "expert" layers to your system RAM.

llama.cpp allows this via the -ot (override tensor) flag using regex.

Strategy A: Maximum Offloading (Lowest VRAM)This moves all MoE layers to the CPU RAM. This is the safest bet for a 24GB card.

codeBash

-ot ".ffn_.*_exps.=CPU"

Strategy B: Balanced Offloading (Medium VRAM)If you have more VRAM headroom, you can offload just the Up/Down projection layers:

codeBash

-ot ".ffn_(up|down)_exps.=CPU"

Strategy C: Hybrid Offloading (Custom)You can even offload layers starting from a specific depth (e.g., layer 6 onwards), keeping the early layers on GPU for speed:

codeBash

-ot "\.(6|7|8|9|[0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"

Step-by-Step Guide to Run GLM-4.7

We recommend using llama.cpp for the best control over memory splitting.

1. Install Dependencies

Get the transfer tools ready for a large download.

codeBash

pip install huggingface_hub hf_transfer

2. Download the Model

We recommend the UD-Q2_K_XL version. It balances size (135GB) with Unsloth's "Dynamic 2.0" accuracy, which maintains SOTA performance on MMLU.

codePython

import os
from huggingface_hub import snapshot_download
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Set to 0 to avoid rate limits
snapshot_download(
    repo_id = "unsloth/GLM-4.7-GGUF",
    local_dir = "unsloth/GLM-4.7-GGUF",
    allow_patterns = ["*UD-Q2_K_XL*"] 
)

3. The Launch Command

Here is the robust command for a Linux/WSL environment (assuming a 24GB GPU + 128GB RAM setup).

codeBash

./llama-cli \
    --model unsloth/GLM-4.7-GGUF/UD-Q2_K_XL/GLM-4.7-UD-Q2_K_XL-00001-of-00003.gguf \
    --n-gpu-layers 99 \
    --jinja \
    --threads 32 \
    --ctx-size 16384 \
    --flash-attn \
    --temp 1.0 \
    --top-p 0.95 \
    -ot ".ffn_.*_exps.=CPU"

Note: Even though we set --n-gpu-layers 99, the -ot flag overrides this for the massive expert layers, forcing them to CPU, while keeping the attention heads on your GPU.

Advanced Tuning: Long Contexts and Ollama Setup

Squeezing Context (KV Cache Quantization)

GLM-4.7 supports a 128k context window, but that eats RAM alive. To fit longer documents, you should quantize the K and V caches.Adding --cache-type-k q4_1 and --cache-type-v q4_1 can significantly reduce memory overhead with negligible accuracy loss.

Note: You must compile llama.cpp with -DGGML_CUDA_FA_ALL_QUANTS=ON to use Flash Attention with quantized V-cache.

Running in Ollama

Unsloth has optimized specific quants for Ollama.

The 1-bit Native: The UD-TQ1 quant runs natively.
codeBash
OLLAMA_MODELS=unsloth ollama run hf.co/unsloth/GLM-4.7-GGUF:TQ1_0
The 2-bit Workaround: To run the larger 2-bit quant in Ollama, you first need to merge the GGUF chunks using llama-gguf-split and then point Ollama to the single file using a Modelfile or the path.

FAQ: Troubleshooting GLM-4.7

Q: I get an error about "chat template" or incorrect generation.

A: You forgot the --jinja flag. GLM-4.7 requires this specific flag in llama.cpp to process its system prompts and turn-taking correctly.

Q: My generation speed is extremely slow (under 1 t/s).

A: You are likely swapping to your hard drive (SSD/NVMe). Ensure your combined VRAM + Physical RAM exceeds the model size (135GB for 2-bit). If you are relying on disk swap, the performance will tank.

Q: Can I use the 4-bit version on a Mac Studio?

A: Yes, but only if you have the 192GB memory configuration (M2/M3 Ultra). If you have 128GB, you must stick to the Dynamic 2-bit (UD-Q2_K_XL) version.

Q: What is the regex to save the most VRAM?

A: Use -ot ".ffn_.*_exps.=CPU". This moves all expert layers to the CPU, leaving only the "backbone" of the model on the GPU.

Q: Does 2-bit quantization make the model stupid?

A: Not with Unsloth's Dynamic 2.0 method. They prioritize keeping the "smart" layers at higher precision. Benchmarks show the 2-bit version performs nearly identically to the full precision model in coding and logic tasks, unlike older static quantization methods.

Run GLM-4.7 Locally: Optimizing Z.ai's Thinking Model

Field Notes: Hardware Reality and the Unsloth Solution

Validated Hardware Tiers

Critical Configuration for llama.cpp

1. The --jinja Flag

2. Temperature Sensitivity

The Secret Sauce: MoE Offloading Strategies

Step-by-Step Guide to Run GLM-4.7

1. Install Dependencies

2. Download the Model

3. The Launch Command

Advanced Tuning: Long Contexts and Ollama Setup

Squeezing Context (KV Cache Quantization)

Running in Ollama

FAQ: Troubleshooting GLM-4.7

Recent Posts

Get started for free

Features

Alternatives

Solutions

Resources

Company