Back to news

How-to Guide

How to create a prompt engineering pipeline.

Systematise prompt development with version control, A/B testing, automated evaluation, and regression detection, turning prompt engineering from art into engineering.

AI Kick Start editorial image for How to create a prompt engineering pipeline.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: Prompt engineering becomes systematic with version control, A/B testing, automated evaluation, and regression detection. This guide creates a complete pipeline: prompts live in git, changes go through automated evaluation, winners deploy via feature flags, and regressions trigger rollbacks.

Key takeaways

  • Version control: Every prompt version stored in git with semantic versioning
  • A/B testing: Deploy multiple prompt variants; measure against KPIs
  • Evaluation: Automated scoring against known test cases on every change
  • Regression: Detect score drops > 5% and block deployment
  • Feature flags: Deploy prompts independently of code releases

Analysis

Most teams treat the prompt that drives their AI feature like a sticky note. It sits buried in the codebase, someone tweaks a line to fix a complaint, and nobody can say later what changed or whether it actually helped. When the bot starts giving worse answers, the trail is cold.

That is a strange way to run something customers talk to every day. The text inside a prompt shapes the tone of your support replies, the accuracy of your code reviews, the quality of every answer your AI produces. It deserves the same care as any other code you ship.

This guide walks through treating prompts that way. The prompts get stored in git like real files. Every change runs through automated tests before it goes live. Promising variants get tested on real users through feature flags, and if a change quietly makes things worse, the system catches the drop and stops the release. None of it is exotic. It is the same discipline software teams already apply to code, pointed at the words you feed your model.

Analysis

Prerequisites

  • Git repository for prompts
  • Test dataset of known inputs and expected outputs
  • A feature flag system. LaunchDarkly is the common commercial pick; Unleash is the open-source one, or roll your own
  • CI/CD pipeline (GitHub Actions, GitLab CI, etc.)

Step-by-Step Framework

Step 1: Prompt Version Control

Keep prompts as files, not strings buried in your code:

prompts/
├── customer-support/
│   ├── v1.0.0-system.txt
│   ├── v1.1.0-system.txt
│   └── v2.0.0-beta-system.txt
├── code-review/
│   ├── v1.0.0-system.txt
│   └── v1.0.1-system.txt
└── metadata.yaml
# prompts/metadata.yaml
prompts:
  customer-support:
    current_version: "1.1.0"
    versions:
      "1.0.0":
        file: "v1.0.0-system.txt"
        description: "Initial support prompt"
        model: "claude-sonnet-4.6"
      "1.1.0":
        file: "v1.1.0-system.txt"
        description: "Added empathy instructions"
        model: "claude-sonnet-4.6"
      "2.0.0-beta":
        file: "v2.0.0-beta-system.txt"
        description: "Testing Claude Opus 4.8"
        model: "claude-opus-4.8"
        status: "experimental"

The model names here are real. Claude Sonnet 4.6 is the current mid-tier model, and the metadata above tracks which version of a prompt runs on it. The beta entry points an experimental prompt at Claude Opus 4.8, the higher-end model, so you can trial a heavier setup without touching what is in production.

Step 2: Prompt Loader with Feature Flags

# prompt_loader.py
import yaml
from pathlib import Path
from typing import Optional

class PromptManager:
    def __init__(self, prompts_dir: str = "prompts"):
        self.prompts_dir = Path(prompts_dir)
        with open(self.prompts_dir / "metadata.yaml") as f:
            self.metadata = yaml.safe_load(f)

    def get_prompt(self, name: str, version: Optional[str] = None,
                   feature_flag_client=None, user_id: Optional[str] = None) -> str:
        prompt_config = self.metadata["prompts"][name]

        # Check if user is in A/B test
        if feature_flag_client and user_id:
            variant = feature_flag_client.get_variant(
                flag_key=f"prompt-{name}",
                user_id=user_id,
                default=prompt_config["current_version"]
            )
            version = variant

        # Fall back to specified or current version
        version = version or prompt_config["current_version"]
        version_config = prompt_config["versions"][version]

        prompt_file = self.prompts_dir / name / version_config["file"]
        return prompt_file.read_text()

    def list_versions(self, name: str) -> list:
        return list(self.metadata["prompts"][name]["versions"].keys())

The loader reads the metadata, asks your feature flag client which version this user should see, and returns the right prompt text. If there is no flag client and no user, it falls back to the current version. That one method is what lets you point different users at different prompts without redeploying anything.

Step 3: Automated Evaluation on PR

# evaluate_prompt.py
import asyncio
from dataclasses import dataclass
from typing import List

@dataclass
class TestCase:
    id: str
    input: str
    expected_contains: List[str]  # Response should contain these
    expected_not_contains: List[str]  # Response should NOT contain these
    min_length: int = 50
    max_length: int = 500

def evaluate_prompt(prompt_text: str, test_cases: List[TestCase], model: str) -> dict:
    results = []

    for case in test_cases:
        # Call LLM with prompt
        response = call_llm(model, system=prompt_text, user=case.input)

        # Check criteria
        checks = {
            "contains_all": all(s in response for s in case.expected_contains),
            "excludes_all": not any(s in response for s in case.expected_not_contains),
            "length_ok": case.min_length <= len(response) <= case.max_length,
            "response": response
        }

        checks["passed"] = all([
            checks["contains_all"],
            checks["excludes_all"],
            checks["length_ok"]
        ])

        results.append({"case": case.id, **checks})

    pass_rate = sum(1 for r in results if r["passed"]) / len(results)

    return {
        "pass_rate": pass_rate,
        "total": len(results),
        "passed": sum(1 for r in results if r["passed"]),
        "failed": sum(1 for r in results if not r["passed"]),
        "details": results
    }

Each test case says what a good answer should contain, what it must not contain, and how long it can run. The evaluator calls the model with the prompt, checks the response against those rules, and hands back a pass rate. The 50-to-500 character bounds and the rest of the defaults are starting points; set them to whatever your own answers should look like.

Step 4: GitHub Actions Integration

# .github/workflows/prompt-eval.yml
name: Prompt Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Evaluate changed prompts
        run: |
          CHANGED=$(git diff --name-only origin/main | grep "^prompts/" || true)

          for file in $CHANGED; do
            PROMPT_NAME=$(dirname $file | xargs basename)
            echo "Evaluating $PROMPT_NAME..."
            python evaluate_prompt.py --prompt $PROMPT_NAME --output results.json

            PASS_RATE=$(jq '.pass_rate' results.json)
            if (( $(echo "$PASS_RATE < 0.85" | bc -l) )); then
              echo "FAIL: Pass rate $PASS_RATE below threshold (0.85)"
              exit 1
            fi

            echo "PASS: $PROMPT_NAME - $PASS_RATE"
          done
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

This workflow fires only when a pull request touches the prompts/ folder. It finds the changed prompts, runs the evaluator on each one, and fails the build if any drops below an 85% pass rate. That number is yours to set. The point is that a prompt change can no longer merge on a hunch; it has to clear the same gate as code.

Step 5: A/B Testing Framework

# ab_testing.py
class PromptABTest:
    def __init__(self, flag_client, analytics):
        self.flags = flag_client
        self.analytics = analytics

    def get_prompt_for_user(self, test_name: str, user_id: str,
                            control_prompt: str, treatment_prompt: str) -> str:
        variant = self.flags.get_variant(
            flag_key=f"prompt-test-{test_name}",
            user_id=user_id,
            default="control"
        )
        return treatment_prompt if variant == "treatment" else control_prompt

    def record_outcome(self, test_name: str, user_id: str, variant: str,
                       metrics: dict):
        """Record outcome for A/B test analysis."""
        self.analytics.track({
            "event": "prompt_ab_test",
            "test_name": test_name,
            "user_id": user_id,
            "variant": variant,
            **metrics
        })

    def analyse_results(self, test_name: str, min_samples: int = 100) -> dict:
        """Analyse A/B test results. Returns winner or 'inconclusive'."""
        # Query analytics for test data
        control_data = self.analytics.query(
            f"SELECT * FROM events WHERE test_name='{test_name}' AND variant='control'"
        )
        treatment_data = self.analytics.query(
            f"SELECT * FROM events WHERE test_name='{test_name}' AND variant='treatment'"
        )

        if len(control_data) < min_samples or len(treatment_data) < min_samples:
            return {"status": "insufficient_data", "control_n": len(control_data),
                    "treatment_n": len(treatment_data)}

        # Compare key metric (e.g., task completion rate)
        control_rate = sum(1 for d in control_data if d["completed"]) / len(control_data)
        treatment_rate = sum(1 for d in treatment_data if d["completed"]) / len(treatment_data)

        # Statistical significance (simplified)
        if treatment_rate > control_rate * 1.05:
            return {"winner": "treatment", "improvement": (treatment_rate - control_rate) / control_rate}
        elif control_rate > treatment_rate * 1.05:
            return {"winner": "control", "improvement": (control_rate - treatment_rate) / treatment_rate}

        return {"winner": "inconclusive", "control_rate": control_rate, "treatment_rate": treatment_rate}

Here the loader's flag check earns its keep. Half your users get the control prompt, half get the new one, and you log how each group does on a metric you care about, like whether the task got finished. The analyse_results method waits until at least 100 people have seen each variant, then declares a winner only if it beats the other by more than 5%. Anything closer comes back inconclusive, which is the honest answer. The significance test here is deliberately rough; for high-stakes calls, lean on a proper statistical method rather than this margin check.

Step 6: Regression Detection

# regression_detector.py
import json
from pathlib import Path

class RegressionDetector:
    def __init__(self, history_dir: str = ".prompt_history"):
        self.history_dir = Path(history_dir)
        self.history_dir.mkdir(exist_ok=True)

    def save_baseline(self, prompt_name: str, scores: dict):
        """Save baseline scores for a prompt version."""
        path = self.history_dir / f"{prompt_name}_baseline.json"
        with open(path, 'w') as f:
            json.dump(scores, f, indent=2)

    def check_regression(self, prompt_name: str, current_scores: dict,
                        threshold: float = 0.05) -> dict:
        """Check if current scores regressed from baseline."""
        path = self.history_dir / f"{prompt_name}_baseline.json"

        if not path.exists():
            self.save_baseline(prompt_name, current_scores)
            return {"status": "new_baseline", "regression": False}

        with open(path) as f:
            baseline = json.load(f)

        regressions = []
        for metric, current in current_scores.items():
            baseline_val = baseline.get(metric, 0)
            if baseline_val > 0:
                drop = (baseline_val - current) / baseline_val
                if drop > threshold:
                    regressions.append({
                        "metric": metric,
                        "baseline": baseline_val,
                        "current": current,
                        "drop": drop
                    })

        return {
            "regression": len(regressions) > 0,
            "regressions": regressions,
            "threshold": threshold
        }

This is the safety net. The first time it sees a prompt, it records the scores as a baseline. After that, every new run gets compared against that saved baseline, and any metric that drops more than 5% gets flagged as a regression. The detail that matters: you compare against a fixed, saved baseline, not against last week's numbers. Compare against a moving target and a slow decline never trips the alarm because each step looks small.

Do/Don't

DoDon't
Version every prompt changeEdit prompts without tracking
Run automated tests before deploymentDeploy untested prompt changes
A/B test significant changesRoll out major changes to 100% immediately
Save baselines for regression detectionCompare against moving targets
Use feature flags for prompt deploymentHard-code prompts in application code

Conclusion

Put these six pieces together and prompt work stops being guesswork. Git holds the history, so you can always see what changed and roll back if you need to. The CI gate keeps a weak prompt from reaching customers. A/B tests tell you which version actually performs, rather than which one felt right in a meeting. And the baseline check catches the slow slide that nobody notices until the complaints start. The prompt is one of the most load-bearing pieces of text in your product. Worth treating it that way.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI consulting & strategy.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: How to create a prompt engineering pipeline

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call