Back to news

Code

Prompt Engineering for Agents: System Prompts That Work.

System prompts are the highest-leverage optimisation in agentic coding. The six-section framework that consistently improves output quality by 30-50% across Claude Code, Hermes, and OpenClaw.

AI Kick Start editorial image for Prompt Engineering for Agents: System Prompts That Work.

Decision

Start narrow

Use the article to decide the smallest useful workflow worth testing before expanding the system.

Risk to watch

Hype drift

Avoid turning a practical adoption step into a broad transformation promise nobody can verify.

Proof to collect

Business signal

Write down the owner, data boundary, review point, and measurable outcome before the first build.

TL;DR

TL;DR: System prompts are the highest-leverage optimisation in agentic coding. The six-section framework that consistently improves output quality by 30-50% across Claude Code, Hermes, and OpenClaw.

Key takeaways

  • Briefing: System prompts are the most underrated part of agent performance.
  • The Anatomy of an Effective System Prompt: A system prompt that earns its keep tends to have six sections.
  • Anti-Patterns: These habits reliably drag agent performance down: **Over-constraining**: Pile on too many rules and the agent starts optimising for compliance instead of quality **Under-constraining**: Too few rules and the output drifts from one run to the next **Conflicting instructions**: Rules that contradict each other just confuse the agent **Vague language**: "Be careful" and "do good work" give the agent nothing to act on **All-caps shouting**: AGENTS DO NOT NEED TO BE YELLED AT **Negative framing**: "Do not use X" lands weaker than "Use Y instead"
  • Testing System Prompts: Test your system prompts the way you test code.
  • Prompt Length Trade-offs: A longer prompt carries more context but eats into your token budget.

Briefing

System prompts are the most underrated part of agent performance. Practitioners often report that a well-written system prompt does more for output quality than swapping in a bigger model, and a sloppy one can make even a top-tier model like Claude Opus 4.8 turn out average work. What follows is drawn from running these agents in production over several months.

Here is the part most teams miss. When you wire up a coding agent or an internal assistant, the instinct is to reach for the most capable model and assume the rest takes care of itself. It usually does not. The system prompt, that block of standing instructions the agent reads before it sees your actual task, quietly shapes every decision it makes.

Get it right and the agent behaves like a senior hire who already knows your codebase and your standards. Get it wrong and you get a clever generalist who keeps guessing at things you never told it. The good news for Australian teams watching their tool spend: tuning a prompt costs nothing, takes minutes to test, and you can do it without touching your model contract.

This piece walks through the structure of a prompt that holds up under real work, the mistakes that quietly drag performance down, and a way to test prompts like you'd test code.

The Anatomy of an Effective System Prompt

A system prompt that earns its keep tends to have six sections.

1. Role Definition

Define what the agent is, not just what it does:

You are a senior TypeScript engineer specialising in API design. You value
type safety, explicit error handling, and composable architectures. You
dislike clever one-liners that sacrifice readability.

This sets the persona, the expertise, and the values. It nudges the agent toward sensible choices without you having to spell out a rule for every situation.

2. Constraint Specification

Explicit constraints head off the usual failure modes:

CONSTRAINTS:
- Never use `any` type. Use `unknown` with type guards instead.
- Never use `eval()` or dynamic code execution.
- All database queries must use the repository layer in src/repositories/.
- All public functions must have JSDoc comments.
- Prefer early returns over nested conditionals.
- Use neverthrow for error handling, not try/catch in new code.

Constraints work when they're specific and enforceable. "Write good code" is not a constraint. "Use neverthrow for error handling" is, because the agent can either follow it or not, and you can check.

3. Context Provision

Give the agent the context it needs to make good calls:

CONTEXT:
- This is a Fastify-based REST API using Prisma ORM.
- Authentication uses JWT tokens with refresh token rotation.
- The codebase supports Node 18+ and uses native fetch (not axios).
- We are migrating from REST to GraphQL (in progress, ~30% complete).
- Performance target: p95 response time < 200ms for all endpoints.

Context stops the agent from falling back on defaults that clash with your setup. (Fastify and Prisma here are just stand-ins for whatever your real stack happens to be.)

4. Output Format Specification

Spell out exactly how you want the output formatted:

OUTPUT FORMAT:
For code changes:
1. Show the complete file content, not just the diff
2. Include JSDoc for all new or modified functions
3. Flag any breaking changes with [BREAKING] prefix
4. Suggest test cases for new functionality

For explanations:
1. Start with a one-sentence summary
2. Provide details in bullet points
3. Include code examples where relevant
4. Note any trade-offs or alternatives considered

5. Error Handling Instructions

Tell the agent what to do when things go sideways:

ERROR HANDLING:
- If you cannot complete a task, explain why and what you tried
- If you are uncertain about a requirement, ask for clarification
- If you encounter a pattern that violates the constraints, flag it
- Never silently skip steps or ignore errors
- If a file is too large to read at once, read it in sections

6. Chain-of-Thought Trigger

For complex work, tell the agent to reason through it step by step:

For tasks rated medium or high complexity:
1. First, analyse the codebase to understand current patterns
2. Second, identify the minimal set of changes needed
3. Third, consider edge cases and error scenarios
4. Fourth, implement the changes
5. Fifth, verify with tests
Show your reasoning at each step.

Anti-Patterns

These habits reliably drag agent performance down:

  • Over-constraining: Pile on too many rules and the agent starts optimising for compliance instead of quality
  • Under-constraining: Too few rules and the output drifts from one run to the next
  • Conflicting instructions: Rules that contradict each other just confuse the agent
  • Vague language: "Be careful" and "do good work" give the agent nothing to act on
  • All-caps shouting: AGENTS DO NOT NEED TO BE YELLED AT
  • Negative framing: "Do not use X" lands weaker than "Use Y instead"

Testing System Prompts

Test your system prompts the way you test code. Keep a suite of representative tasks and score the output:

def test_system_prompt():
    test_cases = [
        "Add a new REST endpoint for user preferences",
        "Refactor this callback-heavy code to use async/await",
        "Fix this race condition in the caching layer",
    ]
    for task in test_cases:
        output = agent.run(task, system_prompt=candidate_prompt)
        score = evaluate(output, criteria=[
            "type_safety", "error_handling", "readability", "test_coverage"
        ])
        assert score > 0.8, f"Failed on {task}: {score}"

Prompt Length Trade-offs

A longer prompt carries more context but eats into your token budget. The practical ceiling is the model's context window minus the space the actual task needs. With Claude Opus 4.8, a roughly 2,000-token system prompt still leaves plenty of room for involved work. On smaller models, trim the prompt to 500-1,000 tokens by keeping the constraints and the output format and cutting the rest.

System prompt engineering is one of the highest-leverage moves available in agentic coding. Iterating costs nothing, a test cycle takes minutes, and teams that put the work in commonly report a noticeable lift in output quality. It's worth your time here before you reach for a bigger model or bolt on more tools.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Pick the smallest useful workflow that proves the pattern.
  2. Write down the owner, data boundary, review point, and success measure.
  3. Review the result after the first real run and decide whether to scale, change, or stop.

Want help applying this? Explore AI agent design systems.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Prompt Engineering for Agents: System Prompts That Work

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call