A pattern from the last twelve months of client work. A team has an agent doing something useful, but it's flaky. Some days it produces the right answer; other days it drifts. The team's reflex is to "improve the prompt". They add bullet points. They add examples. They add a system message in all caps. The behaviour improves for a week and then drifts again. The team starts to talk about prompt engineering as an art rather than a craft.

The diagnosis, almost every time, is the same: the prompt is being asked to do work that a service would do better. The fix is to move that work out of the prompt and into the service, leaving the prompt with only the things a language model is uniquely good at.

It helps to name what each side is good at.

What belongs in a service

A service is anything you write in code that runs deterministically. Functions. Database calls. API integrations. Validation. Retrieval. Tools the agent can invoke.

The services in your stack should absorb:

The test for "should this be in a service" is the same as the test for "should I write a unit test for this". If you'd write a unit test, it belongs in a service. Models are not unit-testable in the same way; services are.

What belongs in a prompt

A prompt — and the model behind it — is good at the things services are bad at. The prompt should carry:

The test for "should this be in a prompt" is roughly the inverse of the unit-test test. If two competent humans would write subtly different answers to the question, the answer probably belongs in a prompt.

A worked example

A team is building an agent that triages incoming sales enquiries. The agent reads the enquiry, classifies it (qualified / not qualified / needs more info), looks up the company in the CRM, drafts a response, and routes the conversation to the right human.

In a prompt-heavy design, the team writes a long prompt that asks the model to do all five steps. The classification drifts. The CRM lookup sometimes hallucinates a company that doesn't exist. The drafted response occasionally contradicts the classification.

In a service-heavy design, the same agent is decomposed:

  1. Service: parse the enquiry into structured fields. (Regex or a structured-extract call.)
  2. Prompt: classify the enquiry against three categories, given the structured fields. (One small call. Eval-able.)
  3. Service: look up the company in the CRM. (Deterministic API call. Either finds it or doesn't.)
  4. Prompt: draft a response, given the classification and the CRM record. (One model call. Voice-tunable.)
  5. Service: route the conversation based on the classification. (Switch statement.)

The same agent. Three model calls instead of one. Each call doing the thing models are good at, with the boundaries enforced by services. Drift is now isolated — if the draft starts going off the rails, only the draft prompt needs attention.

The cost of getting it wrong

Putting deterministic logic into prompts looks cheap because the first version works. The cost shows up in week six, when the system is in production and the behaviour starts drifting in ways the team can't reproduce. Debugging then is brutal: there's no stack trace, no error log, no failing unit test. There's just "it used to work and now it doesn't, sometimes".

Putting ambiguous reasoning into services looks cheap because you can write the code. The cost shows up six months later, when the team has built a thousand-line classifier full of special cases that a five-line model call could now handle better. The team has spent half a year manually doing what the model is now capable of doing for them.

The middle path — services for the deterministic parts, prompts for the ambiguous parts — is the one that ages well. It also makes the system testable. Every prompt has an eval. Every service has a unit test. When something breaks, you know which side of the line to look on.

A practical heuristic

When you're about to add a new step to the prompt, pause for thirty seconds and ask: would this be one if-statement in code? If yes, the step does not belong in the prompt.

Move it to the service that wraps the model. That single discipline, applied consistently for a few weeks, turns most flaky agents into reliable ones.


If your agent is flaky and you're tired of "prompt engineering" your way out of it, we can usually find the line in an afternoon. Get in touch →