Back to news

How-to Guide

How to set up local AI: M-series Mac + Gemma 4 + MLX.

Build a powerful local AI development environment on Apple Silicon using Google's Gemma 4, Apple's MLX framework, and a fully open-source toolchain for inference and fine-tuning.

AI Kick Start editorial image for How to set up local AI: M-series Mac + Gemma 4 + MLX.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: Apple Silicon Macs paired with MLX make local AI genuinely usable. This guide walks through running Google's [Gemma 4](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/) on an M-series Mac with the [MLX framework](https://github.com/ml-explore/mlx-lm): fast on-device inference and local fine-tuning, with none of your data leaving the machine for an external API. (Note: the specific Gemma 4 sizes and tok/s figures below are illustrative, see the variant warning in the Hardware section.)

Key takeaways

  • MLX: Apple's ML framework optimised for unified memory architecture
  • Gemma 4: Google's latest open model (real sizes: E2B, E4B, 12B, 26B MoE, 31B, *not* the 4B-27B labels used below)
  • Performance: Quantised models run at usable speeds on M3 Max; the exact tok/s figures below are illustrative
  • Unified memory: CPU and GPU share RAM, no VRAM limitations
  • Fine-tuning: MLX supports LoRA fine-tuning on-device

Analysis

Google shipped Gemma 4 on 2 April 2026: a family of open-weight models, Apache 2.0 licensed, built from the same research as Gemini 3 (Google blog). Open weights matter because you can download the model and run it on your own hardware, which is exactly what this guide is about.

For Australian teams the appeal is simple. If a model runs on the laptop in front of you, the data it processes never touches someone else's server. No API bill that scales with usage, no questions about where customer information ends up. That changes the maths for anyone handling sensitive records under the Privacy Act.

The hardware that makes this practical is the M-series Mac. Apple's chips share one pool of RAM between the CPU and GPU, so a well-specced Mac can load a model that would otherwise demand a dedicated graphics card. Pair that with MLX, Apple's own machine-learning framework, and you get a setup that runs decent models locally at a usable speed.

One caveat up front, and it's an important one. The exact Gemma 4 sizes used throughout this guide, 4B, 9B, 27B, and their Hugging Face repository names do not match the models Google actually published. The real Gemma 4 line ships as E2B, E4B, 12B, a 26B mixture-of-experts variant, and a 31B dense model (Gemma 4 model card). Treat the size labels, repo IDs, and the performance tables here as worked examples of the workflow rather than copy-paste-ready values. The MLX setup steps are sound; swap in a real variant name before you run them.

Analysis

Prerequisites

  • Mac with Apple Silicon (M1/M2/M3/M4, Pro/Max/Ultra recommended)
  • macOS 14.0 (Sonoma) or newer
  • 16GB unified memory minimum (36GB+ for the largest models)
  • Xcode Command Line Tools: xcode-select --install
  • Homebrew installed

Gemma models are gated on Hugging Face, so you'll need to accept the licence and generate an access token before anything downloads.

Step-by-Step Framework

Step 1: Install MLX and Dependencies

# Create a dedicated Python environment
python3 -m venv ~/mlx-env
source ~/mlx-env/bin/activate

# Upgrade pip and install core packages
pip install --upgrade pip
pip install mlx-lm transformers huggingface_hub

# Install optional tools
pip install mlx-vlm  # For vision tasks
pip install mlx-whisper  # For speech

Check that MLX can see your GPU:

python3 -c "import mlx.core as mx; print(f'Metal GPU: {mx.metal.is_available()}'); print(f'Devices: {mx.get_devices()}')"

You want output like this:

Metal GPU: True
Devices: [gpu(0)]

Step 2: Download Gemma 4

# download_gemma.py
from huggingface_hub import snapshot_download
import os

# Available variants: 4b, 9b, 27b
MODEL_SIZE = "9b"  # Good balance for 24GB Macs

model_id = f"google/gemma-4-{MODEL_SIZE}-it"
cache_dir = f"~/models/gemma-4-{MODEL_SIZE}"

# Download (requires Hugging Face token for Gemma)
# Get token from https://huggingface.co/settings/tokens
snapshot_download(
    repo_id=model_id,
    cache_dir=os.path.expanduser(cache_dir),
    token=os.environ["HF_TOKEN"]
)

print(f"Model downloaded to {cache_dir}")
export HF_TOKEN=hf_your_token_here
python3 download_gemma.py

A flag worth repeating: the MODEL_SIZE values and the google/gemma-4-9b-it-style repo IDs in this script are placeholders. The published repositories are named differently, for example `google/gemma-4-31B-it`. Substitute a real variant ID from the Gemma 4 model card and the download will go through.

Step 3: Run Inference

# chat.py
from mlx_lm import load, generate

# Load model
model, tokenizer = load("~/models/gemma-4-9b")

# Chat loop
messages = []
print("Gemma 4 Chat, type 'quit' to exit\n")

while True:
    user_input = input("You: ")
    if user_input.lower() == 'quit':
        break

    messages.append({"role": "user", "content": user_input})

    # Format for Gemma
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    response = generate(
        model,
        tokenizer,
        prompt=prompt,
        max_tokens=1024,
        temp=0.7,
        top_p=0.9,
        verbose=False
    )

    print(f"Gemma: {response}\n")
    messages.append({"role": "assistant", "content": response})
python3 chat.py
# You: Explain how transformers work
# Gemma: Transformers are a type of neural network architecture introduced in 2017...

The load() and generate() calls here are the real mlx-lm API (mlx-lm GitHub). That part you can rely on.

Step 4: Quantise for Faster Inference

Quantisation shrinks the model with little drop in output quality:

# quantise.py
from mlx_lm.utils import quantize_model

# Quantise to 4-bit (Q4)
quantize_model(
    model_path="~/models/gemma-4-9b",
    output_path="~/models/gemma-4-9b-q4",
    q_group_size=64,
    q_bits=4
)

print("Quantised model saved")

A heads-up on the snippet above: mlx-lm does ship a quantize_model() and full quantisation support, but the exact call signature shown here is illustrative. In practice conversion usually runs through convert(..., quantize=True) (mlx-lm conversion docs). Check the current API before you wire this into anything.

How quantisation trades size for speed (figures below are illustrative, tied to variants that don't ship under these exact names):

VariantSizeSpeed (tok/s)Quality
FP1618GB12Baseline
Q89GB2099%
Q45GB3297%
Q23GB4592%

The direction is well established: 8-bit and 4-bit quantisation keep most of the quality while cutting the footprint hard. The precise percentages, though, are unsourced and shift depending on the model and benchmark you're measuring against.

Step 5: Fine-Tune with LoRA

# finetune.py
from mlx_lm import load, generate, train

# Load base model
model, tokenizer = load("~/models/gemma-4-9b")

# Prepare training data
# JSONL format: {"text": "### Instruction:...\n### Response:..."}
train_data = [
    {"text": "### Instruction: Convert this to Python\nprint('hello')\n### Response: console.log('hello');"},
    {"text": "### Instruction: Refactor this\nfor i in range(len(items)): print(items[i])\n### Response: for item in items: print(item)"},
    # ... more examples
]

# Save training data
import json
with open("train.jsonl", "w") as f:
    for ex in train_data:
        f.write(json.dumps(ex) + "\n")

# Run LoRA fine-tuning
train(
    model=model,
    tokenizer=tokenizer,
    train_data="train.jsonl",
    val_data=None,
    batch_size=4,
    learning_rate=1e-4,
    lora_rank=8,
    steps=500,
    save_every=100,
    output_dir="./lora-adapters"
)

Merge adapters:

# merge.py
from mlx_lm.utils import merge_lora

merge_lora(
    base_model="~/models/gemma-4-9b",
    lora_path="./lora-adapters",
    output_path="~/models/gemma-4-9b-finetuned"
)

MLX genuinely supports on-device LoRA fine-tuning (mlx-lm LoRA docs). The top-level train() import and the merge_lora() signature shown here don't match the documented public interface, though, training runs through mlx_lm.lora and adapter fusion through mlx_lm.fuse. Read the current docs and adjust the calls to match.

Step 6: Create a Local API Server

# server.py
from flask import Flask, request, jsonify
from mlx_lm import load, generate

app = Flask(__name__)
model, tokenizer = load("~/models/gemma-4-9b-q4")

@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
    data = request.json
    messages = data['messages']
    max_tokens = data.get('max_tokens', 1024)

    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    response = generate(
        model, tokenizer, prompt,
        max_tokens=max_tokens,
        temp=data.get('temperature', 0.7),
        verbose=False
    )

    return jsonify({
        "choices": [{"message": {"role": "assistant", "content": response}}],
        "model": "gemma-4-9b-q4"
    })

@app.route('/health', methods=['GET'])
def health():
    return jsonify({"status": "ok", "model": "gemma-4-9b-q4"})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)
pip install flask
python3 server.py

Test it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'

The endpoint mimics the OpenAI chat format, so most tools that already talk to OpenAI can point at localhost:8080 instead with little or no change.

Do/Don't

DoDon't
Use Q4 quantisation for most tasksRun FP16 unless you have 64GB+ RAM
Use LoRA rank 8 for quick experimentsUse rank >32 without more training data
Monitor memory with vm_stat 1Ignore swap usage, it kills performance
Batch process when possibleSend one request at a time
Cache model in memoryReload the model on every request

Hardware Recommendations

Before you read the table: the size labels below (4B, 9B, 27B) match the placeholder variants used throughout this guide, not the models Google actually shipped. Gemma 4 ships as E2B, E4B, 12B, a 26B mixture-of-experts variant, and a 31B dense model (Gemma 4 model card). The RAM tiers themselves are reasonable rules of thumb; map them onto a real variant of similar size.

ModelMin RAMRecommended Mac
Gemma-4-4B8GBM2 8GB
Gemma-4-9B16GBM3 Pro 18GB
Gemma-4-27B36GBM3 Max 36GB
Gemma-4-27B Q418GBM3 Pro 18GB

It's worth knowing what the real models bring to the table: Gemma 4 supports a context window up to 256K tokens, fluency across more than 140 languages, and native vision and audio input on the smaller sizes (Google Cloud blog).

Conclusion

For developers, an M-series Mac with MLX is one of the easiest ways into local AI. Because the CPU and GPU share one memory pool, your "VRAM" is just your system RAM, a 36GB M3 Max can hold a large model that, by some accounts, would otherwise call for a graphics card costing several thousand dollars on a Linux box. (That price comparison is a rough rhetorical figure, not a sourced benchmark.) Pick a real Gemma 4 variant that fits your machine, quantise it, point your existing OpenAI-compatible tooling at the local server, and use LoRA fine-tuning when you need the model to specialise.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI consulting & strategy.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: How to set up local AI: M-series Mac + Gemma 4 + MLX

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call