Analysis
Every few weeks another model lands with a press release full of benchmark scores. It tops the leaderboard on MMLU, it nudges ahead on HumanEval, and the chart on the announcement page goes a little further to the right than the last one. So you swap it into production, and your support replies get worse. Or your code generation starts ignoring the house style. Or nothing changes at all and you've burned a week of engineering time chasing a number.
The problem isn't that those benchmarks are fake. They're real, and they tell you something. They just don't tell you the thing you actually need to know, which is whether the model is good at *your* job, on *your* data, judged by *your* standards.
This is the gap a private benchmark closes. Instead of asking "how smart is this model in general," you ask "does it answer our customers correctly, follow our procedures, and stay inside our cost and latency budget." The answer to that question is the one you can take to a deployment decision.
What follows is a working framework you can stand up and run before your next model upgrade. It covers how to define what "good" means with the people who actually know, how to score it automatically, and how to catch the day a model quietly gets worse.
Analysis
Prerequisites
- Python 3.10+
- API keys for all models being evaluated
- Anonymised test dataset
- Evaluation criteria defined by domain experts
Step-by-Step Framework
Step 1: Define Evaluation Criteria
Sit down with the people who do the work and pin down what "good" actually means. Their answers become your scoring criteria:
# evaluation/criteria.py
from dataclasses import dataclass
from typing import Callable
@dataclass
class EvaluationCriterion:
name: str
description: str
weight: float # 0-1, sum of all weights = 1
scorer: Callable[[str, str], float] # (expected, actual) -> score
CRITERIA = {
"customer_support": [
EvaluationCriterion(
name="answer_accuracy",
description="Does the answer correctly address the customer's question?",
weight=0.40,
scorer=semantic_similarity_scorer
),
EvaluationCriterion(
name="empathy",
description="Is the tone empathetic and understanding?",
weight=0.20,
scorer=empathy_scorer
),
EvaluationCriterion(
name="procedure_compliance",
description="Does the answer follow company procedures?",
weight=0.25,
scorer=procedure_scorer
),
EvaluationCriterion(
name="conciseness",
description="Is the answer appropriately concise?",
weight=0.15,
scorer=conciseness_scorer
)
],
"code_generation": [
EvaluationCriterion(
name="correctness",
description="Does the code produce correct output?",
weight=0.50,
scorer=execution_scorer
),
EvaluationCriterion(
name="style_compliance",
description="Does it follow team coding standards?",
weight=0.20,
scorer=style_scorer
),
EvaluationCriterion(
name="efficiency",
description="Is the solution algorithmically efficient?",
weight=0.20,
scorer=efficiency_scorer
),
EvaluationCriterion(
name="documentation",
description="Are functions documented?",
weight=0.10,
scorer=documentation_scorer
)
]
}The weights matter. For support, accuracy carries 40% and tone carries 20%, because a polite wrong answer still fails the customer. For code, correctness is half the score on its own. Set these numbers with your experts, not by gut feel.
Step 2: Build Custom Scorers
Each criterion needs a function that turns a response into a number. Some you can measure mechanically; others need a judgement call:
# evaluation/scorers.py
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import subprocess
import tempfile
import os
embedder = SentenceTransformer('all-MiniLM-L6-v2')
llm_client = OpenAI()
def semantic_similarity_scorer(expected: str, actual: str) -> float:
"""Score based on semantic embedding similarity."""
emb1 = embedder.encode(expected)
emb2 = embedder.encode(actual)
similarity = cosine_similarity(emb1, emb2)
return float(similarity)
def llm_judge_scorer(expected: str, actual: str, criteria: str) -> float:
"""Use a strong LLM as a judge."""
response = llm_client.chat.completions.create(
model="claude-sonnet-4.6",
messages=[{
"role": "user",
"content": f"""Rate how well the actual response meets the expected standard.
Criteria: {criteria}
Expected standard: {expected[:500]}
Actual response: {actual[:500]}
Rate 0.0 to 1.0. Respond with ONLY a number."""
}]
)
try:
return float(response.choices[0].message.content.strip())
except:
return 0.0
def execution_scorer(expected: str, code: str) -> float:
"""Execute generated code and check output."""
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(code)
f.flush()
try:
result = subprocess.run(
['python', f.name],
capture_output=True,
text=True,
timeout=10
)
os.unlink(f.name)
if result.returncode != 0:
return 0.0
# Compare output with expected
return semantic_similarity_scorer(expected, result.stdout)
except:
os.unlink(f.name)
return 0.0
def cosine_similarity(a, b):
import numpy as np
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))The semantic_similarity_scorer leans on the all-MiniLM-L6-v2 embedding model from the sentence-transformers library, which is a solid, lightweight default for comparing meaning rather than exact wording. The execution_scorer actually runs the generated code in a temp file and checks what comes out, which beats eyeballing the source for whether it works.
One thing to watch in this snippet: the llm_judge_scorer creates an OpenAI() client but then asks for claude-sonnet-4.6. Claude doesn't run through the default OpenAI client out of the box, so in a real setup you'd point it at Anthropic's own SDK or an OpenAI-compatible gateway. Treat the judge call as illustrative, not copy-paste ready.
Step 3: Create Test Dataset
This is the part people skimp on, and it's the part that decides whether the whole exercise is worth anything. Pull from real production traffic, strip the personal data, and keep the messiness:
# evaluation/dataset.py
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class TestCase:
id: str
task_type: str # "customer_support", "code_generation", etc.
input: str
expected_output: str
expected_intermediate_steps: Optional[list] = None
difficulty: str = "medium" # easy, medium, hard
tags: list = None
def load_test_dataset(path: str) -> list[TestCase]:
with open(path) as f:
data = json.load(f)
return [TestCase(**item) for item in data]
# Example test case format (JSON):
# {
# "id": "cs-001",
# "task_type": "customer_support",
# "input": "My order #12345 hasn't arrived. It's been 3 weeks.",
# "expected_output": "I'm sorry to hear about the delay with order #12345...",
# "difficulty": "medium",
# "tags": ["shipping", "delay", "order_lookup"]
# }Tag your cases by difficulty and topic. When a model's overall score looks fine but it's quietly failing every "hard" shipping query, the tags are how you'll spot it.
Step 4: Run Evaluation
The runner walks each model through every test case, scores the responses, and rolls the criteria up into one weighted number per task:
# evaluation/runner.py
import asyncio
from typing import Dict, List
import pandas as pd
class EvaluationRunner:
def __init__(self, models: List[str], criteria_map: Dict):
self.models = models
self.criteria_map = criteria_map
self.results = []
async def evaluate_model(self, model: str, test_cases: List[TestCase]) -> dict:
scores_by_criterion = {c.name: [] for c in self.criteria_map[test_cases[0].task_type]}
for case in test_cases:
# Get model response
response = await self.call_model(model, case.input)
# Score against each criterion
for criterion in self.criteria_map[case.task_type]:
score = criterion.scorer(case.expected_output, response)
scores_by_criterion[criterion.name].append(score)
# Calculate weighted score
criteria = self.criteria_map[test_cases[0].task_type]
weighted_score = sum(
sum(scores_by_criterion[c.name]) / len(scores_by_criterion[c.name]) * c.weight
for c in criteria
)
return {
"model": model,
"task_type": test_cases[0].task_type,
"weighted_score": weighted_score,
"criterion_scores": {
name: sum(scores) / len(scores)
for name, scores in scores_by_criterion.items()
},
"raw_scores": scores_by_criterion
}
async def run_comparison(self, test_cases: List[TestCase]) -> pd.DataFrame:
results = []
for model in self.models:
print(f"Evaluating {model}...")
result = await self.evaluate_model(model, test_cases)
results.append(result)
# Create comparison table
df = pd.DataFrame([
{
"Model": r["model"],
"Overall": f"{r['weighted_score']:.3f}",
**{k: f"{v:.3f}" for k, v in r["criterion_scores"].items()}
}
for r in results
])
return df
async def call_model(self, model: str, prompt: str) -> str:
# Route to appropriate API
if "claude" in model:
return await call_anthropic(model, prompt)
elif "gpt" in model:
return await call_openai(model, prompt)
elif "minimax" in model:
return await call_minimax(model, prompt)
else:
raise ValueError(f"Unknown model: {model}")run_comparison hands you a pandas table with one row per model and a column per criterion. That's your scorecard: not "which model is smartest" but "which model wins on the work you care about."
Step 5: Regression Tracking
A model that scored well last month can slip after a provider-side update, and you won't notice from the changelog. So keep a history and compare against it:
# evaluation/tracking.py
import json
from datetime import datetime
class RegressionTracker:
def __init__(self, history_file: str = "evaluation_history.json"):
self.history_file = history_file
def record(self, results: dict):
entry = {
"timestamp": datetime.now().isoformat(),
"results": results
}
try:
with open(self.history_file) as f:
history = json.load(f)
except FileNotFoundError:
history = []
history.append(entry)
with open(self.history_file, 'w') as f:
json.dump(history, f, indent=2)
def detect_regression(self, model: str, current_score: float) -> dict:
with open(self.history_file) as f:
history = json.load(f)
# Get last 5 scores for this model
model_scores = [
e["results"][model]["weighted_score"]
for e in history[-5:]
if model in e["results"]
]
if len(model_scores) < 2:
return {"regression": False, "reason": "Insufficient history"}
avg_previous = sum(model_scores[:-1]) / len(model_scores[:-1])
threshold = avg_previous * 0.95 # 5% tolerance
if current_score < threshold:
return {
"regression": True,
"previous_avg": avg_previous,
"current": current_score,
"drop_percent": (avg_previous - current_score) / avg_previous * 100
}
return {"regression": False}The 5% tolerance is a starting point. Scoring has some natural noise, so you don't want an alarm every time a number wobbles. Tune the threshold once you've seen a few weeks of your own variance.
Step 6: Run the Evaluation
Wire it together and point it at the models you're weighing up:
# evaluate.py
import asyncio
async def main():
test_cases = load_test_dataset("tests/evaluation/dataset.json")
runner = EvaluationRunner(
models=[
"claude-sonnet-4.6",
"gpt-5.5",
"minimax-m3",
"gemini-3.5-flash"
],
criteria_map=CRITERIA
)
results = await runner.run_comparison(test_cases)
print("\n=== Evaluation Results ===")
print(results.to_string(index=False))
# Check for regressions
tracker = RegressionTracker()
for model in runner.models:
score = float(results[results["Model"] == model]["Overall"].values[0])
regression = tracker.detect_regression(model, score)
if regression["regression"]:
print(f"⚠ REGRESSION DETECTED for {model}: {regression['drop_percent']:.1f}% drop")
tracker.record({
model: {
"weighted_score": float(results[results["Model"] == model]["Overall"].values[0]),
"criterion_scores": {
k: float(v) for k, v in
results[results["Model"] == model].drop("Model", axis=1).to_dict('records')[0].items()
if k != "Overall"
}
}
for model in runner.models
})
if __name__ == "__main__":
asyncio.run(main())The four models in the list here are the current flagships: Claude Sonnet 4.6, GPT-5.5, MiniMax M3, and Gemini 3.5 Flash. The short identifiers above are for readability; the exact API strings each provider expects may carry version suffixes, so check the current docs when you wire up the calls. Schedule this to run nightly and you'll have a standing answer to "should we switch" instead of a guess.
Do/Don't
| Do | Don't |
|---|---|
| Use real (anonymised) production data | Rely solely on synthetic test cases |
| Define criteria with domain experts | Guess at what "good" looks like |
| Track scores over time | Run one-off evaluations without history |
| Use LLM-as-judge for subjective criteria | Try to programmatically score empathy |
| Run evaluations before every model change | Deploy new models without evaluation |
Conclusion
A private benchmark is how you find out whether a model is good for your work, not the industry's. Public scores measure general intelligence; your scores measure whether customers got the right answer and the code compiled. Build the test set from real traffic, set the criteria with the people who do the job, automate the run, and watch the history for slips. Do that and the next model upgrade stops being a leap of faith.


