Back to news

How-to Guide

How to evaluate LLMs with private benchmarks.

Create custom, proprietary evaluation suites that test LLMs on your specific tasks, data, and criteria, going far beyond public leaderboards to measure real business impact.

AI Kick Start editorial image for How to evaluate LLMs with private benchmarks.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: Public benchmarks like [MMLU](https://en.wikipedia.org/wiki/MMLU) and HumanEval measure general capability, not business value. Private benchmarks test LLMs on your actual tasks, with your actual data, against your actual success criteria. This guide builds a complete private evaluation framework with automated scoring, regression tracking, and model comparison.

Key takeaways

  • Test data: Use real anonymised production data, not synthetic examples
  • Metrics: Task-specific: accuracy, relevance, latency, cost, safety
  • Regression: Track scores over time; flag degradation
  • Comparison: Head-to-head evaluation of 3-5 models per task
  • Automation: Run evaluations nightly on the latest model versions

Analysis

Every few weeks another model lands with a press release full of benchmark scores. It tops the leaderboard on MMLU, it nudges ahead on HumanEval, and the chart on the announcement page goes a little further to the right than the last one. So you swap it into production, and your support replies get worse. Or your code generation starts ignoring the house style. Or nothing changes at all and you've burned a week of engineering time chasing a number.

The problem isn't that those benchmarks are fake. They're real, and they tell you something. They just don't tell you the thing you actually need to know, which is whether the model is good at *your* job, on *your* data, judged by *your* standards.

This is the gap a private benchmark closes. Instead of asking "how smart is this model in general," you ask "does it answer our customers correctly, follow our procedures, and stay inside our cost and latency budget." The answer to that question is the one you can take to a deployment decision.

What follows is a working framework you can stand up and run before your next model upgrade. It covers how to define what "good" means with the people who actually know, how to score it automatically, and how to catch the day a model quietly gets worse.

Analysis

Prerequisites

  • Python 3.10+
  • API keys for all models being evaluated
  • Anonymised test dataset
  • Evaluation criteria defined by domain experts

Step-by-Step Framework

Step 1: Define Evaluation Criteria

Sit down with the people who do the work and pin down what "good" actually means. Their answers become your scoring criteria:

# evaluation/criteria.py
from dataclasses import dataclass
from typing import Callable

@dataclass
class EvaluationCriterion:
    name: str
    description: str
    weight: float  # 0-1, sum of all weights = 1
    scorer: Callable[[str, str], float]  # (expected, actual) -> score

CRITERIA = {
    "customer_support": [
        EvaluationCriterion(
            name="answer_accuracy",
            description="Does the answer correctly address the customer's question?",
            weight=0.40,
            scorer=semantic_similarity_scorer
        ),
        EvaluationCriterion(
            name="empathy",
            description="Is the tone empathetic and understanding?",
            weight=0.20,
            scorer=empathy_scorer
        ),
        EvaluationCriterion(
            name="procedure_compliance",
            description="Does the answer follow company procedures?",
            weight=0.25,
            scorer=procedure_scorer
        ),
        EvaluationCriterion(
            name="conciseness",
            description="Is the answer appropriately concise?",
            weight=0.15,
            scorer=conciseness_scorer
        )
    ],
    "code_generation": [
        EvaluationCriterion(
            name="correctness",
            description="Does the code produce correct output?",
            weight=0.50,
            scorer=execution_scorer
        ),
        EvaluationCriterion(
            name="style_compliance",
            description="Does it follow team coding standards?",
            weight=0.20,
            scorer=style_scorer
        ),
        EvaluationCriterion(
            name="efficiency",
            description="Is the solution algorithmically efficient?",
            weight=0.20,
            scorer=efficiency_scorer
        ),
        EvaluationCriterion(
            name="documentation",
            description="Are functions documented?",
            weight=0.10,
            scorer=documentation_scorer
        )
    ]
}

The weights matter. For support, accuracy carries 40% and tone carries 20%, because a polite wrong answer still fails the customer. For code, correctness is half the score on its own. Set these numbers with your experts, not by gut feel.

Step 2: Build Custom Scorers

Each criterion needs a function that turns a response into a number. Some you can measure mechanically; others need a judgement call:

# evaluation/scorers.py
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import subprocess
import tempfile
import os

embedder = SentenceTransformer('all-MiniLM-L6-v2')
llm_client = OpenAI()

def semantic_similarity_scorer(expected: str, actual: str) -> float:
    """Score based on semantic embedding similarity."""
    emb1 = embedder.encode(expected)
    emb2 = embedder.encode(actual)
    similarity = cosine_similarity(emb1, emb2)
    return float(similarity)

def llm_judge_scorer(expected: str, actual: str, criteria: str) -> float:
    """Use a strong LLM as a judge."""
    response = llm_client.chat.completions.create(
        model="claude-sonnet-4.6",
        messages=[{
            "role": "user",
            "content": f"""Rate how well the actual response meets the expected standard.

Criteria: {criteria}

Expected standard: {expected[:500]}

Actual response: {actual[:500]}

Rate 0.0 to 1.0. Respond with ONLY a number."""
        }]
    )
    try:
        return float(response.choices[0].message.content.strip())
    except:
        return 0.0

def execution_scorer(expected: str, code: str) -> float:
    """Execute generated code and check output."""
    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
        f.write(code)
        f.flush()
        try:
            result = subprocess.run(
                ['python', f.name],
                capture_output=True,
                text=True,
                timeout=10
            )
            os.unlink(f.name)

            if result.returncode != 0:
                return 0.0

            # Compare output with expected
            return semantic_similarity_scorer(expected, result.stdout)
        except:
            os.unlink(f.name)
            return 0.0

def cosine_similarity(a, b):
    import numpy as np
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

The semantic_similarity_scorer leans on the all-MiniLM-L6-v2 embedding model from the sentence-transformers library, which is a solid, lightweight default for comparing meaning rather than exact wording. The execution_scorer actually runs the generated code in a temp file and checks what comes out, which beats eyeballing the source for whether it works.

One thing to watch in this snippet: the llm_judge_scorer creates an OpenAI() client but then asks for claude-sonnet-4.6. Claude doesn't run through the default OpenAI client out of the box, so in a real setup you'd point it at Anthropic's own SDK or an OpenAI-compatible gateway. Treat the judge call as illustrative, not copy-paste ready.

Step 3: Create Test Dataset

This is the part people skimp on, and it's the part that decides whether the whole exercise is worth anything. Pull from real production traffic, strip the personal data, and keep the messiness:

# evaluation/dataset.py
from dataclasses import dataclass
from typing import Optional
import json

@dataclass
class TestCase:
    id: str
    task_type: str  # "customer_support", "code_generation", etc.
    input: str
    expected_output: str
    expected_intermediate_steps: Optional[list] = None
    difficulty: str = "medium"  # easy, medium, hard
    tags: list = None

def load_test_dataset(path: str) -> list[TestCase]:
    with open(path) as f:
        data = json.load(f)
    return [TestCase(**item) for item in data]

# Example test case format (JSON):
# {
#   "id": "cs-001",
#   "task_type": "customer_support",
#   "input": "My order #12345 hasn't arrived. It's been 3 weeks.",
#   "expected_output": "I'm sorry to hear about the delay with order #12345...",
#   "difficulty": "medium",
#   "tags": ["shipping", "delay", "order_lookup"]
# }

Tag your cases by difficulty and topic. When a model's overall score looks fine but it's quietly failing every "hard" shipping query, the tags are how you'll spot it.

Step 4: Run Evaluation

The runner walks each model through every test case, scores the responses, and rolls the criteria up into one weighted number per task:

# evaluation/runner.py
import asyncio
from typing import Dict, List
import pandas as pd

class EvaluationRunner:
    def __init__(self, models: List[str], criteria_map: Dict):
        self.models = models
        self.criteria_map = criteria_map
        self.results = []

    async def evaluate_model(self, model: str, test_cases: List[TestCase]) -> dict:
        scores_by_criterion = {c.name: [] for c in self.criteria_map[test_cases[0].task_type]}

        for case in test_cases:
            # Get model response
            response = await self.call_model(model, case.input)

            # Score against each criterion
            for criterion in self.criteria_map[case.task_type]:
                score = criterion.scorer(case.expected_output, response)
                scores_by_criterion[criterion.name].append(score)

        # Calculate weighted score
        criteria = self.criteria_map[test_cases[0].task_type]
        weighted_score = sum(
            sum(scores_by_criterion[c.name]) / len(scores_by_criterion[c.name]) * c.weight
            for c in criteria
        )

        return {
            "model": model,
            "task_type": test_cases[0].task_type,
            "weighted_score": weighted_score,
            "criterion_scores": {
                name: sum(scores) / len(scores)
                for name, scores in scores_by_criterion.items()
            },
            "raw_scores": scores_by_criterion
        }

    async def run_comparison(self, test_cases: List[TestCase]) -> pd.DataFrame:
        results = []

        for model in self.models:
            print(f"Evaluating {model}...")
            result = await self.evaluate_model(model, test_cases)
            results.append(result)

        # Create comparison table
        df = pd.DataFrame([
            {
                "Model": r["model"],
                "Overall": f"{r['weighted_score']:.3f}",
                **{k: f"{v:.3f}" for k, v in r["criterion_scores"].items()}
            }
            for r in results
        ])

        return df

    async def call_model(self, model: str, prompt: str) -> str:
        # Route to appropriate API
        if "claude" in model:
            return await call_anthropic(model, prompt)
        elif "gpt" in model:
            return await call_openai(model, prompt)
        elif "minimax" in model:
            return await call_minimax(model, prompt)
        else:
            raise ValueError(f"Unknown model: {model}")

run_comparison hands you a pandas table with one row per model and a column per criterion. That's your scorecard: not "which model is smartest" but "which model wins on the work you care about."

Step 5: Regression Tracking

A model that scored well last month can slip after a provider-side update, and you won't notice from the changelog. So keep a history and compare against it:

# evaluation/tracking.py
import json
from datetime import datetime

class RegressionTracker:
    def __init__(self, history_file: str = "evaluation_history.json"):
        self.history_file = history_file

    def record(self, results: dict):
        entry = {
            "timestamp": datetime.now().isoformat(),
            "results": results
        }

        try:
            with open(self.history_file) as f:
                history = json.load(f)
        except FileNotFoundError:
            history = []

        history.append(entry)

        with open(self.history_file, 'w') as f:
            json.dump(history, f, indent=2)

    def detect_regression(self, model: str, current_score: float) -> dict:
        with open(self.history_file) as f:
            history = json.load(f)

        # Get last 5 scores for this model
        model_scores = [
            e["results"][model]["weighted_score"]
            for e in history[-5:]
            if model in e["results"]
        ]

        if len(model_scores) < 2:
            return {"regression": False, "reason": "Insufficient history"}

        avg_previous = sum(model_scores[:-1]) / len(model_scores[:-1])
        threshold = avg_previous * 0.95  # 5% tolerance

        if current_score < threshold:
            return {
                "regression": True,
                "previous_avg": avg_previous,
                "current": current_score,
                "drop_percent": (avg_previous - current_score) / avg_previous * 100
            }

        return {"regression": False}

The 5% tolerance is a starting point. Scoring has some natural noise, so you don't want an alarm every time a number wobbles. Tune the threshold once you've seen a few weeks of your own variance.

Step 6: Run the Evaluation

Wire it together and point it at the models you're weighing up:

# evaluate.py
import asyncio

async def main():
    test_cases = load_test_dataset("tests/evaluation/dataset.json")

    runner = EvaluationRunner(
        models=[
            "claude-sonnet-4.6",
            "gpt-5.5",
            "minimax-m3",
            "gemini-3.5-flash"
        ],
        criteria_map=CRITERIA
    )

    results = await runner.run_comparison(test_cases)
    print("\n=== Evaluation Results ===")
    print(results.to_string(index=False))

    # Check for regressions
    tracker = RegressionTracker()
    for model in runner.models:
        score = float(results[results["Model"] == model]["Overall"].values[0])
        regression = tracker.detect_regression(model, score)
        if regression["regression"]:
            print(f"⚠ REGRESSION DETECTED for {model}: {regression['drop_percent']:.1f}% drop")

    tracker.record({
        model: {
            "weighted_score": float(results[results["Model"] == model]["Overall"].values[0]),
            "criterion_scores": {
                k: float(v) for k, v in
                results[results["Model"] == model].drop("Model", axis=1).to_dict('records')[0].items()
                if k != "Overall"
            }
        }
        for model in runner.models
    })

if __name__ == "__main__":
    asyncio.run(main())

The four models in the list here are the current flagships: Claude Sonnet 4.6, GPT-5.5, MiniMax M3, and Gemini 3.5 Flash. The short identifiers above are for readability; the exact API strings each provider expects may carry version suffixes, so check the current docs when you wire up the calls. Schedule this to run nightly and you'll have a standing answer to "should we switch" instead of a guess.

Do/Don't

DoDon't
Use real (anonymised) production dataRely solely on synthetic test cases
Define criteria with domain expertsGuess at what "good" looks like
Track scores over timeRun one-off evaluations without history
Use LLM-as-judge for subjective criteriaTry to programmatically score empathy
Run evaluations before every model changeDeploy new models without evaluation

Conclusion

A private benchmark is how you find out whether a model is good for your work, not the industry's. Public scores measure general intelligence; your scores measure whether customers got the right answer and the code compiled. Build the test set from real traffic, set the criteria with the people who do the job, automate the run, and watch the history for slips. Do that and the next model upgrade stops being a leap of faith.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI consulting & strategy.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: How to evaluate LLMs with private benchmarks

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call