What a governance wrapper actually looks like
Rate limits, guard rails, plan-gated data access, semantic caching: all enforced at the gateway without touching a single line of agent code. A full technical walkthrough of the RNLI AI Agent demo.
This is the second article in a three-part series. The first argued that governance belongs in the infrastructure, not the policy document. This article shows what building that infrastructure actually looks like.
My previous post argued that AI governance has to reach beyond the tools people can see, down into the APIs, event streams and integration layers where agents now operate. But what does that actually look like to build?
This is the answer.
Over the past few weeks I built a full AI agent demo, starting from the excellent foundation Dorian Blanc -- fellow Field CTO at Gravitee -- had already laid. The scenario is a fictional RNLI station finder. I spent ten years as Head of Data at the RNLI, so the context felt honest rather than borrowed. More usefully, it gave me a real-world framing with genuine stakes: emergency services, operational data and time-pressured queries. The kind of environment where governance is not optional.
What I was trying to prove
One central claim: you can enforce security, observability, rate limiting and policy at the infrastructure layer without touching agent code. The agent does its job. The gateway handles everything else.
I wanted to build something that either proved or broke that in practice.
The scenario
A user asks: "Find me the nearest lifeboat stations to Edinburgh and tell me what the sea conditions are like."
Underneath that simple question, several things need to happen. The query "nearest to Edinburgh" has to be resolved against a dataset of 238 RNLI stations. A structured tool call fetches station data. The right coordinates are identified. A specialist weather agent is delegated to for live marine data from Open-Meteo. The results are merged and streamed back token by token.
All of that needs to be observable, rate-limited, screened for bad inputs and auditable. None of that logic should live inside the agent itself.
What is running
Fourteen APIs registered in Gravitee 4.10.7, several Docker containers and three deployment options: local via Docker Compose, an Azure VM and an AKS cluster with horizontal pod autoscaling. The demo was built against 4.10.7. The governance patterns described here are stable across recent major versions; check the changelog for any policy API changes if upgrading.
The agents: two Python A2A agents. The main stations agent handles the conversational loop, tool discovery and LLM calls. It delegates sea conditions queries to a specialist weather agent via an agent-to-agent call. Both agents talk to each other through the gateway, not directly. Every hop is governed and logged.
The gateway: Gravitee sitting in front of everything. Every API call, every LLM request, every agent-to-agent communication goes through it. This is the governance wrapper. It enforces rate limits, screens payloads for injection patterns, runs the toxicity classifier and caches responses.
The event layer: Redpanda -- a Kafka-compatible event streaming platform -- running live streams. Lifeboat launches poll every 30 seconds from the RNLI's public operational feed. Sea conditions are published by the weather agent on demand. A separate enrichment agent subscribes to the launches topic, calls the weather agent for each launch and publishes enriched events autonomously.
The inspector: a custom Node.js service that listens on a TCP socket for events from the Gravitee TCP Reporter and rebroadcasts them over WebSocket to the browser. The result is a real-time animated sequence diagram showing exactly what the agent is doing at every step. No instrumentation in the agent code. The gateway does the reporting.
The governance layer in detail
Here is what Gravitee actually does in this stack, in the order it fires on an inbound request.
Authentication and authorisation. Every request is checked for valid credentials before it reaches the agent. OAuth 2.0 and OIDC handled by Gravitee Access Management. Three access tiers, each with different entitlements. The agent never sees credentials. The gateway decides who gets in.
Injection threat screening. Every request body is checked for SQL injection, XSS patterns and prompt injection signatures before it reaches the agent. If it matches, the request is blocked with a 400 and the event is logged. The agent never sees it.
Rate limiting. Per-role limits enforced at the gateway. An anonymous user hits a limit after two LLM requests in five minutes. This lives in the API definition, not in agent logic. Adding a new tier means changing a gateway policy, not redeploying agent code.
AI guard rails. LLM requests are screened by Gravitee's own DistilBERT multilingual toxicity classifier, running as an ONNX model inside the gateway inference service. Requests above the threshold are blocked with a 400 and a toxicity score in the response. The model runs locally. Nothing leaves the infrastructure for classification.
Response caching. Identical LLM prompts are cached for five minutes at the gateway. The agent sends its full prompt; the gateway checks whether it has seen this exact prompt recently and returns the cached response if so. In testing, cached responses arrived roughly six times faster than live LLM calls. The LLM is never called on a cache hit. The agent has no idea this happened.
Observability. Every request is reported to Elasticsearch for analytics and to the TCP Reporter for the live inspector. Response times, role breakdowns, blocked request counts and cache hit rates are all visible in the APIM console. The agent emits nothing special. The gateway does the work.
Making any existing API agent-ready without touching the backend
The Lifeboat API is a plain REST service with no awareness of AI agents. Gravitee exposes MCP tools from its OpenAPI spec at the gateway entrypoint -- transforming existing REST endpoints into agent-discoverable tools without modifying the underlying service. MCP, Model Context Protocol, is the emerging standard that lets AI agents query structured data sources directly rather than scraping or guessing.
The agent discovers these tools at runtime, reasons about which to call and constructs the right parameters.
The practical implication: any REST API can be made agent-callable without touching the backend. For organisations with legacy REST services and no budget to re-engineer them, this matters considerably.
The same question, three different answers
This is the clearest demonstration of the governance wrapper argument in practice.
The demo has three access tiers drawn from realistic RNLI roles:
Anonymous user -- no credentials, public access only. Two LLM requests per five minutes. One tidal event returned. Wind speed only.
Station volunteer -- authenticated with an API key. Five LLM requests per minute. Two tidal events returned. Full weather conditions.
Coastguard liaison -- authenticated via JWT through Gravitee AM. Uncapped LLM requests. Four tidal events returned. Full conditions plus air and sea temperatures. Live Kafka event streams via SSE.
A coastguard liaison asking "what are the sea conditions near Poole?" gets four tidal events, full weather data and temperatures, and a live stream of active launches. An anonymous user asking the exact same question gets one tidal event and wind speed.
Same agent. Same prompt. Same tool call. The gateway decides what comes back.
Here is how it works. The agent sends the same request regardless of who is calling it. The gateway intercepts it, checks the role and either blocks the call at the rate limit or passes role context downstream so the backend knows what to return. The agent never reasons about what access tier the user is on. It just gets back what it gets back.
The policy is on the path, not in the agent.
What surprised me
The cold-start problem is real. On Apple Silicon, Gravitee's ONNX model takes up to 60 seconds to warm up on first load. I added a dummy request to the startup sequence to pre-warm it. In AKS I added a mock LLM service because the real Ollama instance is unreachable from the cluster network. These are the things that never appear in architecture diagrams.
Multi-turn context is harder than it looks. Getting the agent to resolve "there" from a previous turn -- "find stations near Edinburgh... what are the sea conditions there?" -- required storing the last five conversation turns in session state and injecting them into the LLM context on each call. Not complex, but easy to get wrong in subtle ways.
The inspector is the governance layer made visible. Build it before you run a demo. People will ask about it before they ask about anything else, and what they are really asking is: how do I know what the agent is doing? The inspector answers that question in real time.
Governance without agent changes is achievable but requires discipline. Every time I was tempted to add governance logic to the agent I forced myself to put it in the gateway instead. It requires more upfront thought about API design and policy ordering, but the result is an agent that is genuinely portable. You could swap the gateway implementation and the agent would not care.
What I would do differently
The enrichment agent runs in a tight polling loop at startup and occasionally gets ahead of the rest of the stack during initialisation. I would replace this with a proper readiness check and a backoff strategy. Fine for a demo, wrong for production.
The role-gated data access is implemented via different API configurations per role, which works but means duplicated policy configuration. A cleaner approach would be a single API with role-aware policy conditions. That is the next iteration.
The voice interface uses the browser's Web Speech API, which works well in Chrome and is inconsistent elsewhere. For production I would route audio through a proper STT/TTS service behind the gateway.
If you are starting with limited capacity, you do not need fourteen APIs and a Kafka cluster. One agent, one gateway, rate limiting and request logging. That is the minimum viable governance wrapper. Confirm you can see every call the agent makes before you build anything else. Add the rest when you need it.
The thing I kept having to resist while building this was the temptation to put governance logic in the agent. It always felt easier in the moment. The agent knows the context. The agent can reason about it. Let the agent handle it.
That reasoning is wrong and it will cost you. An agent that reasons about its own governance is an agent you cannot audit, cannot update and cannot trust across contexts. The moment you need to change a policy – tighten a rate limit, add a new role tier, block a new category of prompt – you are redeploying agent code instead of changing a gateway configuration. At scale, in a production system, that difference is the difference between a five-minute change and a release cycle.
Governance belongs in the infrastructure. Not because it is tidier. Because it is the only version of governance that holds when the agent does something you did not anticipate.
Start with the gateway. Not the agent. Put a single API behind it, add rate limiting and request logging, and confirm you can see every call the agent makes before you build anything else.
The code is on GitHub. The demo runs on request. If you are building something in this space and want to talk through the architecture, the contact link is below.
Sam Prodger is Field CTO at Gravitee and spent ten years as Head of Data at the RNLI.
Continue this conversation
Open a pre-loaded prompt in your preferred AI. Edit it before you send.
Pre-loaded with context from this article. Opens in a new tab.