Analysis
Google shipped Gemma 4 on 2 April 2026: a family of open-weight models, Apache 2.0 licensed, built from the same research as Gemini 3 (Google blog). Open weights matter because you can download the model and run it on your own hardware, which is exactly what this guide is about.
For Australian teams the appeal is simple. If a model runs on the laptop in front of you, the data it processes never touches someone else's server. No API bill that scales with usage, no questions about where customer information ends up. That changes the maths for anyone handling sensitive records under the Privacy Act.
The hardware that makes this practical is the M-series Mac. Apple's chips share one pool of RAM between the CPU and GPU, so a well-specced Mac can load a model that would otherwise demand a dedicated graphics card. Pair that with MLX, Apple's own machine-learning framework, and you get a setup that runs decent models locally at a usable speed.
One caveat up front, and it's an important one. The exact Gemma 4 sizes used throughout this guide, 4B, 9B, 27B, and their Hugging Face repository names do not match the models Google actually published. The real Gemma 4 line ships as E2B, E4B, 12B, a 26B mixture-of-experts variant, and a 31B dense model (Gemma 4 model card). Treat the size labels, repo IDs, and the performance tables here as worked examples of the workflow rather than copy-paste-ready values. The MLX setup steps are sound; swap in a real variant name before you run them.
Analysis
Prerequisites
- Mac with Apple Silicon (M1/M2/M3/M4, Pro/Max/Ultra recommended)
- macOS 14.0 (Sonoma) or newer
- 16GB unified memory minimum (36GB+ for the largest models)
- Xcode Command Line Tools:
xcode-select --install - Homebrew installed
Gemma models are gated on Hugging Face, so you'll need to accept the licence and generate an access token before anything downloads.
Step-by-Step Framework
Step 1: Install MLX and Dependencies
# Create a dedicated Python environment
python3 -m venv ~/mlx-env
source ~/mlx-env/bin/activate
# Upgrade pip and install core packages
pip install --upgrade pip
pip install mlx-lm transformers huggingface_hub
# Install optional tools
pip install mlx-vlm # For vision tasks
pip install mlx-whisper # For speechCheck that MLX can see your GPU:
python3 -c "import mlx.core as mx; print(f'Metal GPU: {mx.metal.is_available()}'); print(f'Devices: {mx.get_devices()}')"You want output like this:
Metal GPU: True
Devices: [gpu(0)]Step 2: Download Gemma 4
# download_gemma.py
from huggingface_hub import snapshot_download
import os
# Available variants: 4b, 9b, 27b
MODEL_SIZE = "9b" # Good balance for 24GB Macs
model_id = f"google/gemma-4-{MODEL_SIZE}-it"
cache_dir = f"~/models/gemma-4-{MODEL_SIZE}"
# Download (requires Hugging Face token for Gemma)
# Get token from https://huggingface.co/settings/tokens
snapshot_download(
repo_id=model_id,
cache_dir=os.path.expanduser(cache_dir),
token=os.environ["HF_TOKEN"]
)
print(f"Model downloaded to {cache_dir}")export HF_TOKEN=hf_your_token_here
python3 download_gemma.pyA flag worth repeating: the MODEL_SIZE values and the google/gemma-4-9b-it-style repo IDs in this script are placeholders. The published repositories are named differently, for example `google/gemma-4-31B-it`. Substitute a real variant ID from the Gemma 4 model card and the download will go through.
Step 3: Run Inference
# chat.py
from mlx_lm import load, generate
# Load model
model, tokenizer = load("~/models/gemma-4-9b")
# Chat loop
messages = []
print("Gemma 4 Chat, type 'quit' to exit\n")
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
messages.append({"role": "user", "content": user_input})
# Format for Gemma
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=1024,
temp=0.7,
top_p=0.9,
verbose=False
)
print(f"Gemma: {response}\n")
messages.append({"role": "assistant", "content": response})python3 chat.py
# You: Explain how transformers work
# Gemma: Transformers are a type of neural network architecture introduced in 2017...The load() and generate() calls here are the real mlx-lm API (mlx-lm GitHub). That part you can rely on.
Step 4: Quantise for Faster Inference
Quantisation shrinks the model with little drop in output quality:
# quantise.py
from mlx_lm.utils import quantize_model
# Quantise to 4-bit (Q4)
quantize_model(
model_path="~/models/gemma-4-9b",
output_path="~/models/gemma-4-9b-q4",
q_group_size=64,
q_bits=4
)
print("Quantised model saved")A heads-up on the snippet above: mlx-lm does ship a quantize_model() and full quantisation support, but the exact call signature shown here is illustrative. In practice conversion usually runs through convert(..., quantize=True) (mlx-lm conversion docs). Check the current API before you wire this into anything.
How quantisation trades size for speed (figures below are illustrative, tied to variants that don't ship under these exact names):
| Variant | Size | Speed (tok/s) | Quality |
|---|---|---|---|
| FP16 | 18GB | 12 | Baseline |
| Q8 | 9GB | 20 | 99% |
| Q4 | 5GB | 32 | 97% |
| Q2 | 3GB | 45 | 92% |
The direction is well established: 8-bit and 4-bit quantisation keep most of the quality while cutting the footprint hard. The precise percentages, though, are unsourced and shift depending on the model and benchmark you're measuring against.
Step 5: Fine-Tune with LoRA
# finetune.py
from mlx_lm import load, generate, train
# Load base model
model, tokenizer = load("~/models/gemma-4-9b")
# Prepare training data
# JSONL format: {"text": "### Instruction:...\n### Response:..."}
train_data = [
{"text": "### Instruction: Convert this to Python\nprint('hello')\n### Response: console.log('hello');"},
{"text": "### Instruction: Refactor this\nfor i in range(len(items)): print(items[i])\n### Response: for item in items: print(item)"},
# ... more examples
]
# Save training data
import json
with open("train.jsonl", "w") as f:
for ex in train_data:
f.write(json.dumps(ex) + "\n")
# Run LoRA fine-tuning
train(
model=model,
tokenizer=tokenizer,
train_data="train.jsonl",
val_data=None,
batch_size=4,
learning_rate=1e-4,
lora_rank=8,
steps=500,
save_every=100,
output_dir="./lora-adapters"
)Merge adapters:
# merge.py
from mlx_lm.utils import merge_lora
merge_lora(
base_model="~/models/gemma-4-9b",
lora_path="./lora-adapters",
output_path="~/models/gemma-4-9b-finetuned"
)MLX genuinely supports on-device LoRA fine-tuning (mlx-lm LoRA docs). The top-level train() import and the merge_lora() signature shown here don't match the documented public interface, though, training runs through mlx_lm.lora and adapter fusion through mlx_lm.fuse. Read the current docs and adjust the calls to match.
Step 6: Create a Local API Server
# server.py
from flask import Flask, request, jsonify
from mlx_lm import load, generate
app = Flask(__name__)
model, tokenizer = load("~/models/gemma-4-9b-q4")
@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
data = request.json
messages = data['messages']
max_tokens = data.get('max_tokens', 1024)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(
model, tokenizer, prompt,
max_tokens=max_tokens,
temp=data.get('temperature', 0.7),
verbose=False
)
return jsonify({
"choices": [{"message": {"role": "assistant", "content": response}}],
"model": "gemma-4-9b-q4"
})
@app.route('/health', methods=['GET'])
def health():
return jsonify({"status": "ok", "model": "gemma-4-9b-q4"})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)pip install flask
python3 server.pyTest it:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'The endpoint mimics the OpenAI chat format, so most tools that already talk to OpenAI can point at localhost:8080 instead with little or no change.
Do/Don't
| Do | Don't |
|---|---|
| Use Q4 quantisation for most tasks | Run FP16 unless you have 64GB+ RAM |
| Use LoRA rank 8 for quick experiments | Use rank >32 without more training data |
Monitor memory with vm_stat 1 | Ignore swap usage, it kills performance |
| Batch process when possible | Send one request at a time |
| Cache model in memory | Reload the model on every request |
Hardware Recommendations
Before you read the table: the size labels below (4B, 9B, 27B) match the placeholder variants used throughout this guide, not the models Google actually shipped. Gemma 4 ships as E2B, E4B, 12B, a 26B mixture-of-experts variant, and a 31B dense model (Gemma 4 model card). The RAM tiers themselves are reasonable rules of thumb; map them onto a real variant of similar size.
| Model | Min RAM | Recommended Mac |
|---|---|---|
| Gemma-4-4B | 8GB | M2 8GB |
| Gemma-4-9B | 16GB | M3 Pro 18GB |
| Gemma-4-27B | 36GB | M3 Max 36GB |
| Gemma-4-27B Q4 | 18GB | M3 Pro 18GB |
It's worth knowing what the real models bring to the table: Gemma 4 supports a context window up to 256K tokens, fluency across more than 140 languages, and native vision and audio input on the smaller sizes (Google Cloud blog).
Conclusion
For developers, an M-series Mac with MLX is one of the easiest ways into local AI. Because the CPU and GPU share one memory pool, your "VRAM" is just your system RAM, a 36GB M3 Max can hold a large model that, by some accounts, would otherwise call for a graphics card costing several thousand dollars on a Linux box. (That price comparison is a rough rhetorical figure, not a sourced benchmark.) Pick a real Gemma 4 variant that fits your machine, quantise it, point your existing OpenAI-compatible tooling at the local server, and use LoRA fine-tuning when you need the model to specialise.


