31 Mar 2026 6 min read Technical

What a governance wrapper actually looks like in code

Rate limits, guard rails, plan-gated data access, semantic caching: all enforced at the gateway without touching a single line of agent code. A full technical walkthrough of the RNLI AI Agent demo.

My previous post argued that AI governance has to reach beyond the tools people can see, down into the APIs, event streams and integration layers where agents now operate. A few people came back with a reasonable challenge: fine, but what does that actually look like to build?

This is the answer.

Over the past few weeks, I built a full AI agent demo using the great starter of Dorian Blanc, the other Field CTO at Gravitee. The scenario is a fictional RNLI station finder. I spent nine years as Head of Data at the RNLI, so the context felt honest rather than borrowed. More usefully, it gave me a real-world framing with genuine stakes: emergency services, operational data and time-pressured queries. The kind of environment where governance is not optional.

This post walks through the architecture, the key decisions and the things that surprised me.

What I was trying to prove

One central claim: you can enforce security, observability, rate limiting and policy at the infrastructure layer without touching agent code. The agent does its job. The gateway handles everything else.

I wanted to build something that either proved or broke that in practice and showed the Gravitee gateway working.

The scenario

A user asks: "Find me the nearest lifeboat stations to Edinburgh and tell me what the sea conditions are like."

Underneath that requires resolving "nearest to Edinburgh" against a dataset of 238 RNLI stations, making a structured tool call to fetch station data, identifying the right coordinates, delegating to a specialist weather agent to fetch live marine data from Open-Meteo, merging the results and streaming the response back token by token via SSE.

All of that needs to be observable, rate-limited, screened for bad inputs and auditable. None of that logic should live inside the agent itself.

What is running

Fourteen APIs registered in Gravitee, several Docker containers and three deployment options: local via Docker Compose, an Azure VM and an AKS cluster with horizontal pod autoscaling. The core pieces:

The agents. Two Python A2A agents. The main stations agent handles the conversational loop, tool discovery and LLM calls. It delegates sea conditions queries to a specialist weather agent via an agent-to-agent call. Both agents talk to each other through the gateway, not directly. That means every hop is governed and logged.

The gateway. Gravitee 4.10.7 sitting in front of everything. Every API call, every LLM request, every agent-to-agent communication goes through it. This is the governance wrapper. It enforces rate limits, screens payloads for injection patterns, runs the toxicity classifier and caches responses.

The event layer. Redpanda (Kafka-compatible) running live streams: lifeboat launches polling every 30 seconds from the RNLI's public feed and sea conditions published by the weather agent on demand. A separate enrichment agent subscribes to the launches topic, calls the weather agent for each launch and publishes enriched events. This runs autonomously with no user trigger.

The inspector. A custom Node.js service that listens on a TCP socket for events from the Gravitee TCP Reporter and rebroadcasts them over WebSocket to the browser. The result is a real-time animated sequence diagram showing exactly what the agent is doing at every step. No instrumentation in the agent code. The gateway does the reporting.

Access Management. Gravitee AM handling OAuth 2.0 and OIDC. Three access tiers, which I will come back to.

The governance layer in detail

Here is what Gravitee actually does in this stack, in the order it fires on an inbound request.

Injection threat screening. Every request body is checked for SQL injection, XSS patterns and prompt injection signatures before it reaches the agent. If it matches, the request is blocked with a 400 and the event is logged. The agent never sees it.

Rate limiting. Per-plan limits enforced at the gateway. Free users hit a 429 after two LLM requests in five minutes. This lives in the API definition, not in agent logic. Adding a new tier means changing a gateway policy, not redeploying agent code.

AI guard rails. LLM requests are screened by a DistilBERT toxicity classifier running as an ONNX model inside the gateway inference service. Requests above the threshold are blocked with a 400 and a toxicity score in the response. The model runs locally. Nothing leaves the infrastructure for classification.

Response caching. Identical LLM prompts are cached for five minutes at the gateway. The agent sends its full prompt; the gateway checks whether it has seen this exact prompt recently and returns the cached response if so. On a cache hit the response is roughly six times faster and the LLM is never called. The agent has no idea this happened.

Observability. Every request is reported to Elasticsearch for analytics and to the TCP Reporter for the live inspector. Response times, plan breakdowns, blocked request counts and cache hit rates are all visible in the APIM console. The agent emits nothing special. The gateway does the work.

The MCP piece

The Lifeboat API is a plain REST service with no MCP awareness. Gravitee auto-generates MCP tools from the OpenAPI spec at the gateway entrypoint. The agent discovers these tools at runtime, reasons about which to call and constructs the right parameters.

The practical implication: you can make any REST API agent-callable without modifying the backend. For organisations sitting on legacy REST services with no budget to re-engineer them, that matters quite a lot.

Plan-gated data: the same question, three different answers

This is the clearest demonstration of the governance wrapper argument in practice, so it is worth being specific about how it works.

The demo has three access tiers:

	Free	Silver	Gold
Authentication	Anonymous	API key	JWT via Gravitee AM
LLM requests	2 per 5 min	5 per min	Uncapped
Tidal events returned	1	2	4
Weather data	Wind speed only	Full conditions	Full conditions plus air and sea temperatures

A Gold user asking "what are the sea conditions near Poole?" gets four tidal events, full weather data and temperatures. A Free user asking the exact same question gets one tidal event and wind speed. Same agent, same prompt, same tool call. The gateway decides what comes back.

Here is how it hangs together. The agent sends the same request regardless of who is calling it. The gateway intercepts it, checks the plan and either blocks the call at the rate limit or passes plan context downstream so the backend knows what to return. The agent never reasons about what tier the user is on. It just gets back what it gets back.

That is the governance wrapper argument made concrete. The policy is on the path, not in the agent.

What surprised me

The cold-start problem is real. On Apple Silicon, the ONNX model takes up to 60 seconds to warm up on first load. I added a dummy request to the startup sequence to pre-warm it. In AKS I added a mock LLM service because the real Ollama instance is unreachable from the cluster network. These are the things that never appear in architecture diagrams.

Multi-turn context is harder than it looks. Getting the agent to resolve "there" from a previous turn ("find stations near Edinburgh... what are the sea conditions there?") required storing the last five conversation turns in session state and injecting them into the LLM context on each call. Not complex, but easy to get wrong in subtle ways.

The inspector stole the show. In every demo I have run since, it is the first thing people ask about. Seeing the agent reasoning, calling tools, delegating to a specialist and getting a response, animated in real time in the browser, is more convincing than any explanation of what AI governance means. Build the observability layer. People remember it.

Governance without agent changes is achievable but requires discipline. Every time I was tempted to add governance logic to the agent I forced myself to put it in the gateway instead. It requires more upfront thought about API design and policy ordering, but the result is an agent that is genuinely portable. You could swap the gateway implementation and the agent would not care.

What I would do differently

The enrichment agent runs in a tight polling loop at startup and occasionally gets ahead of the rest of the stack during initialisation. I would replace this with a proper readiness check and a backoff strategy. It is fine for a demo and wrong for production.

The plan-gated data access is implemented via different API configurations per plan, which works but means duplicated policy configuration. A cleaner approach would be a single API with plan-aware policy conditions. That is the next iteration.

The voice interface uses the browser's Web Speech API, which works well in Chrome and is inconsistent elsewhere. For production I would route audio through a proper STT/TTS service behind the gateway.

Where to find it

The full demo is on GitHub. You can run it locally in about 15 minutes with Docker Compose. The README covers three scenarios in order of complexity, which is the order I would recommend if you are coming to it fresh.

There is a live version running on AKS that I demo on request. If you want a walkthrough, the Demos page has a booking link.

If you are building something similar and want to talk through the architecture, the contact link is at the bottom.

Sam Prodger is Field CTO at Gravitee and spent nine years as Head of Data at the RNLI.

Continue this conversation

Open a pre-loaded prompt in your preferred AI. Edit it before you send.

Continue in Claude Continue in ChatGPT Continue in Grok Continue in Perplexity

Pre-loaded with context from this article. Opens in a new tab.

You might also like...

Governing AI and APIs in Mission-First Organisations

How an agent reads this