Back to news

How-to Guide

How to deploy GLM-5.2 locally for Chinese-language tasks.

GLM-5.2 is a 753B parameter open-weights model optimised for Chinese. Learn how to deploy it locally with quantisation for translation, summarisation, and code generation tasks.

AI Kick Start editorial image for How to deploy GLM-5.2 locally for Chinese-language tasks.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: GLM-5.2 is Z.ai's 753B-parameter open-weights model, released under an MIT licence in mid-June 2026. It's a Mixture-of-Experts design (roughly 40B parameters active per token, not a dense 753B), and it's reportedly strong on Chinese-language work alongside its headline coding and agentic abilities. This guide walks through running it locally with 4-bit quantisation: installation, loading the model, speeding up inference, and serving it behind an API for Chinese NLP, translation, and bilingual code generation.

Key takeaways

  • Architecture: GLM (General Language Model), Mixture-of-Experts, 753B total params (~40B active per token)
  • Quantisation: Q4_K_M brings it to roughly 376GB (estimate); use model parallelism for multi-GPU
  • Pricing: Reportedly ~$1.40/$4.40 via the official API, or ~$1.20/$3.20 via OpenRouter; free if self-hosted
  • Strengths: Long-horizon coding and agentic tasks; reportedly Chinese text generation, translation, bilingual coding
  • Context: 1M tokens, with max output around 131K tokens

Analysis

Z.ai shipped GLM-5.2 in mid-June 2026, and the early reaction from people who track open models was hard to miss. Simon Willison called it probably the most powerful text-only open-weights LLM available, and the full weights weigh in at about 1.51TB. That is a lot of model to put on your own hardware.

The pitch for an Australian team is simple. The weights are open under an MIT licence, so once you have the hardware, there's no per-token bill and your data never leaves your servers. For anyone handling client documents or bilingual content they'd rather not pipe through a third-party API, that matters.

The catch is the size. A 1.51TB model in full precision needs more VRAM than most businesses will ever rack up. The workaround is quantisation: squeeze the weights down to 4 bits each, drop the footprint to roughly a quarter of the original, and run it on a multi-GPU box you can actually buy. That's what the rest of this guide does, step by step, ending with a working API server.

One thing worth saying up front: GLM-5.2 is positioned by its makers as a general-purpose, long-horizon coding and agentic model, and it tops several intelligence and coding leaderboards. Its strength specifically on Chinese-language tasks is reported rather than independently benchmarked, so treat the translation and bilingual claims below as a use case to test on your own data, not a settled fact.

Analysis

Prerequisites

  • Linux server with 2x A100 (80GB) or 4x RTX 4090 (24GB)
  • 512GB system RAM minimum
  • Python 3.10+, CUDA 12.1+
  • 500GB free disk space
  • transformers >= 4.40.0

A note before you buy anything: 753B params at 4 bits works out to roughly 376GB of weights, which a single 80GB A100 can't hold on its own. The single-GPU numbers later in this guide are unverified estimates, and at least one of them doesn't square with that math. Plan for multiple GPUs.

Step-by-Step Framework

Step 1: Install Dependencies

# Create environment
python3 -m venv glm-env
source glm-env/bin/activate

# Install with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate bitsandbytes sentencepiece
pip install fastapi uvicorn  # For API serving

Step 2: Download the Model

GLM-5.2 lives on Hugging Face under the zai-org organisation. Note the repo path: it's zai-org/GLM-5.2, not the older THUDM org that hosts the legacy ChatGLM models. Older guides (and the snippet below, before you correct it) sometimes point at the wrong repo, so double-check the id before you kick off a 376GB download.

# download.py
from huggingface_hub import snapshot_download
import os

# Login (get token from https://huggingface.co/settings/tokens)
from huggingface_hub import login
login(token=os.environ["HF_TOKEN"])

# Download with resume support
snapshot_download(
    repo_id="THUDM/GLM-5.2",
    local_dir="./glm-5.2",
    local_dir_use_symlinks=False,
    resume_download=True
)

If you're pulling from inside China, ModelScope is usually faster. Confirm the exact ModelScope id on the model's own page before relying on it, the path below is unconfirmed:

from modelscope import snapshot_download
snapshot_download("ZhipuAI/GLM-5.2", cache_dir="./glm-5.2")

Step 3: Load with 4-Bit Quantisation

# load_model.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

MODEL_PATH = "./glm-5.2"

# 4-bit quantisation config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # Nested quantisation
    bnb_4bit_quant_type="nf4"            # Normalised float 4-bit
)

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

print("Loading model with 4-bit quantisation...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    quantization_config=bnb_config,
    device_map="auto",                    # Auto-distribute across GPUs
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    low_cpu_mem_usage=True                # Stream weights from disk
)

print(f"Model loaded. GPUs: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
    print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

The nf4 quant type and double quantisation are what keep quality close to the full-precision model while cutting memory hard. device_map="auto" hands the layout to Accelerate, which spreads layers across whatever GPUs it finds.

Step 4: Run Inference

GLM-5.2 expects its own chat format, so build messages and pass them through the tokenizer's chat template rather than concatenating raw strings:

# inference.py
import torch

chat = [
    {"role": "system", "content": "You are a helpful assistant specialised in Chinese-English translation."},
    {"role": "user", "content": "Translate this to Chinese: The transformer architecture revolutionised NLP."}
]

inputs = tokenizer.apply_chat_template(
    chat,
    tokenize=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1
    )

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
# Output: Transformer架构彻底改变了自然语言处理领域。

Slicing the output by the input length is what strips the prompt tokens back out, so you print only the model's reply.

Step 5: Multi-GPU Model Parallelism

For the full 753B model, reach for DeepSpeed or Accelerate. The pattern below loads an empty shell first, then streams the checkpoint onto your GPUs:

# multi_gpu.py
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        trust_remote_code=True
    )

model = load_checkpoint_and_dispatch(
    model,
    MODEL_PATH,
    device_map="auto",
    no_split_module_classes=["GLMBlock"],  # Keep layers together
    dtype=torch.bfloat16
)

The no_split_module_classes line keeps each GLMBlock on one device, which avoids the slow cross-GPU chatter you'd get from splitting a single layer in half.

Step 6: Create an API Server

Wrapping the model in FastAPI gives you an OpenAI-style endpoint your other services can call:

# api_server.py
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import torch

app = FastAPI(title="GLM-5.2 Local API")

class ChatMessage(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    messages: List[ChatMessage]
    max_tokens: int = 1024
    temperature: float = 0.7
    top_p: float = 0.9

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    chat = [{"role": m.role, "content": m.content} for m in request.messages]

    inputs = tokenizer.apply_chat_template(
        chat,
        tokenize=True,
        return_tensors="pt",
        return_dict=True
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            top_p=request.top_p,
            do_sample=True
        )

    response_text = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )

    return {
        "choices": [{"message": {"role": "assistant", "content": response_text}}],
        "model": "glm-5.2",
        "usage": {
            "prompt_tokens": inputs['input_ids'].shape[1],
            "completion_tokens": len(outputs[0]) - inputs['input_ids'].shape[1]
        }
    }

@app.get("/health")
async def health():
    return {"status": "ok", "model": "glm-5.2", "gpus": torch.cuda.device_count()}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

The /v1/chat/completions path mirrors the OpenAI schema on purpose, so existing client libraries point at your local box with little more than a base-URL change. The /health route gives you something to poll from a load balancer.

Step 7: Optimise Inference Speed

TechniqueSpeedupQuality Impact
4-bit quantisation2.5xMinimal
Flash Attention 21.5xNone
vLLM serving5-10xNone
Speculative decoding2-3xNone

Flash Attention 2 is the easy win on long inputs. Install it, then switch it on at load time:

pip install flash-attn --no-build-isolation
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    attn_implementation="flash_attention_2",
    # ... other args
)

For anything resembling production traffic, vLLM is the bigger lever, it handles batching and KV-cache management far better than a hand-rolled loop.

Step 8: Bilingual Code Generation

One use case worth testing on your own work: GLM-5.2 can generate code with Chinese comments and documentation in a single pass.

chat = [
    {"role": "user", "content": "写一个Python函数,用快速排序算法对列表进行排序。添加中文注释。"}
]
# Generates:
# def quick_sort(arr):
#     '''快速排序算法实现'''
#     if len(arr) <= 1:
#         return arr
#     pivot = arr[len(arr) // 2]
#     left = [x for x in arr if x < pivot]  # 小于基准值的元素
#     middle = [x for x in arr if x == pivot]  # 等于基准值的元素
#     right = [x for x in arr if x > pivot]  # 大于基准值的元素
#     return quick_sort(left) + middle + quick_sort(right)

For a team maintaining a codebase that's documented in both English and Chinese, that saves a translation round-trip. Run it against a few real examples before you trust it on anything important.

Do/Don't

DoDon't
Use Q4 quantisation for inferenceAttempt FP16 without 1.5TB+ VRAM
Enable Flash Attention 2Use default attention on long sequences
Use trust_remote_code=TrueSkip this, GLM requires custom model code
Test Chinese tokenisation separatelyAssume standard tokeniser works
Use device_map="auto"Manually specify layer placement

Hardware Requirements

The table below is a set of estimates, not vendor-published figures, and the numbers haven't been independently verified. Read them as a rough starting point. The single-A100 row in particular is hard to reconcile with the ~376GB Q4 weight size, so don't bank on fitting the full model on one 80GB card.

SetupVRAMRAMStorageSpeed
Q4, single A100 80GB76GB128GB400GB8 tok/s
Q4, 2x A100 80GB152GB256GB400GB15 tok/s
Q4, 4x RTX 409096GB256GB400GB12 tok/s
vLLM serving80GB128GB400GB40+ tok/s

Conclusion

GLM-5.2 is one of the strongest open-weights models you can self-host right now, and the MIT licence means you can run it without a per-token bill. Quantise it to 4 bits and it fits on a multi-GPU box rather than a data centre; add Flash Attention and vLLM and it's quick enough for real traffic. The reported bilingual abilities make it worth a serious look for any team working across English and Chinese codebases and documentation, just benchmark it on your own material before you build on the Chinese-specialisation claims, since those are reported rather than proven.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Write the job-to-be-done before looking at another product.
  2. Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
  3. Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: How to deploy GLM-5.2 locally for Chinese-language tasks

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call