Agents in the stack: what belongs in a service, what belongs in a prompt

A pattern from the last twelve months of client work. A team has an agent doing something useful, but it's flaky. Some days it produces the right answer; other days it drifts. The team's reflex is to "improve the prompt". They add bullet points. They add examples. They add a system message in all caps. The behaviour improves for a week and then drifts again. The team starts to talk about prompt engineering as an art rather than a craft.

The diagnosis, almost every time, is the same: the prompt is being asked to do work that a service would do better. The fix is to move that work out of the prompt and into the service, leaving the prompt with only the things a language model is uniquely good at.

It helps to name what each side is good at.

What belongs in a service

A service is anything you write in code that runs deterministically. Functions. Database calls. API integrations. Validation. Retrieval. Tools the agent can invoke.

The services in your stack should absorb:

Deterministic logic. "If the contract is over fifty thousand dollars, route it to legal." That is not a model decision. That is an if-statement. The model decides nothing useful here; the service decides everything.
Retrieval. Pulling the right three paragraphs out of a long document is a search problem, not a reasoning problem. A search index gives you the same answer every time. A model asked to "remember and quote" the document gives you a different answer every time, often subtly wrong.
Side effects. Anything that writes to a database, sends an email, files a ticket, books a meeting. The model proposes the action. The service performs it, after a validation step the service controls.
Authentication and authorisation. Who is allowed to do what is not a question the model should be answering, even if it sounds like it could. Models can be talked into things.
Validation. Does the output match the schema you expect? Does the date parse? Is the dollar amount a positive number? These are service checks. If the model produces a malformed output, the service rejects it and asks again.

The test for "should this be in a service" is the same as the test for "should I write a unit test for this". If you'd write a unit test, it belongs in a service. Models are not unit-testable in the same way; services are.

What belongs in a prompt

A prompt — and the model behind it — is good at the things services are bad at. The prompt should carry:

Language understanding. Reading a customer's email and figuring out what they actually mean. A service could regex this. It would be terrible.
Ambiguity resolution. When the request is "summarise this in a way that makes sense to a non-specialist", the model is making a thousand small choices about what to keep and what to cut. A service can't make those choices without becoming a thousand-line decision tree.
Reasoning over context it already has. Once the relevant facts are in front of the model, drawing the next inference is something the model does naturally. The trap is asking the model to recall the facts as well; that part should be retrieval.
Tone, voice, and human-shaped polish. A drafted email that reads like a person wrote it is something the model is good at and a service cannot do.

The test for "should this be in a prompt" is roughly the inverse of the unit-test test. If two competent humans would write subtly different answers to the question, the answer probably belongs in a prompt.

A worked example

A team is building an agent that triages incoming sales enquiries. The agent reads the enquiry, classifies it (qualified / not qualified / needs more info), looks up the company in the CRM, drafts a response, and routes the conversation to the right human.

In a prompt-heavy design, the team writes a long prompt that asks the model to do all five steps. The classification drifts. The CRM lookup sometimes hallucinates a company that doesn't exist. The drafted response occasionally contradicts the classification.

In a service-heavy design, the same agent is decomposed:

Service: parse the enquiry into structured fields. (Regex or a structured-extract call.)
Prompt: classify the enquiry against three categories, given the structured fields. (One small call. Eval-able.)
Service: look up the company in the CRM. (Deterministic API call. Either finds it or doesn't.)
Prompt: draft a response, given the classification and the CRM record. (One model call. Voice-tunable.)
Service: route the conversation based on the classification. (Switch statement.)

The same agent. Three model calls instead of one. Each call doing the thing models are good at, with the boundaries enforced by services. Drift is now isolated — if the draft starts going off the rails, only the draft prompt needs attention.

The cost of getting it wrong

Putting deterministic logic into prompts looks cheap because the first version works. The cost shows up in week six, when the system is in production and the behaviour starts drifting in ways the team can't reproduce. Debugging then is brutal: there's no stack trace, no error log, no failing unit test. There's just "it used to work and now it doesn't, sometimes".

Putting ambiguous reasoning into services looks cheap because you can write the code. The cost shows up six months later, when the team has built a thousand-line classifier full of special cases that a five-line model call could now handle better. The team has spent half a year manually doing what the model is now capable of doing for them.

The middle path — services for the deterministic parts, prompts for the ambiguous parts — is the one that ages well. It also makes the system testable. Every prompt has an eval. Every service has a unit test. When something breaks, you know which side of the line to look on.

A practical heuristic

When you're about to add a new step to the prompt, pause for thirty seconds and ask: would this be one if-statement in code? If yes, the step does not belong in the prompt.

Move it to the service that wraps the model. That single discipline, applied consistently for a few weeks, turns most flaky agents into reliable ones.

If your agent is flaky and you're tired of "prompt engineering" your way out of it, we can usually find the line in an afternoon. Get in touch →