The governance plane for every AI agent your team runs

One proxy in front of Claude Code, Copilot, ChatGPT, Gemini, and your custom agents. Shared semantic cache, per-team budgets, cryptographically signed audit trail. 85% of calls never reach the model. 75% lower cost. Two environment variables to adopt.

pip install agentmesh-proxy sentence-transformers

Zero code changes. Two env vars.

Point your existing tools at AgentMesh and watch 85% of calls return from cache — with full governance headers on every response.

setup.sh

cache_hit.py

audit_query.py

# Install and start the proxy

$ pip install agentmesh-proxy sentence-transformers

$ agentmesh start --port 8080

✓ AgentMesh proxy listening on localhost:8080

✓ Semantic cache loaded sentence-transformers/all-MiniLM-L6-v2

✓ Exact-match cache ready

✓ Circuit breaker armed · $500/day budget

✓ Audit log Ed25519 · hash-chained

# That's it. Now point your tools at the proxy:

$ export ANTHROPIC_BASE_URL=http://localhost:8080

$ export OPENAI_BASE_URL=http://localhost:8080

Claude Code, Copilot, ChatGPT, Gemini, CrewAI, LangGraph —

all routed through the same cache and budget.

# 5 developers ask similar questions — only 1 hits the model

import anthropic

client = anthropic.Anthropic()

# Dev 1 — cache miss, calls the model

r1 = client.messages.create(model="claude-sonnet-4-6",

messages=[{"role":"user","content":"Summarize Q2 revenue trends"}])

# Dev 2 — semantically similar → cache hit

r2 = client.messages.create(model="claude-sonnet-4-6",

messages=[{"role":"user","content":"Give me a summary of Q2 revenue"}])

X-AgentMesh-Cache: MISS → model called · 847ms · $0.0024

X-AgentMesh-Cache: HIT → semantic · 12ms · $0.0000

X-AgentMesh-Similarity: 0.94 threshold 0.85

X-AgentMesh-Saved: $0.0024 this request

Session total 20 requests · 15 cache hits · $0.031 spent vs $0.121 list price

74.4% cost reduction · zero accuracy loss · identical responses

# Tamper-evident audit — every request, every team, every dollar

from agentmesh import AuditLog

log = AuditLog.open("audit.jsonl")

entries = log.query(team="platform-eng", days=30)

Audit query team=platform-eng · last 30 days

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

am_4f2a91 2026-06-14 09:11 HIT semantic · $0.0000 · similarity=0.91

am_7c3b02 2026-06-14 09:09 MISS model call · $0.0024 · claude-sonnet-4-6

am_1d8e44 2026-06-13 16:32 BLOCKED quota exceeded · team budget $500/day

am_9a0f77 2026-06-13 14:15 HIT exact · $0.0000 · 3ms

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

chain valid · Ed25519 verified · 847 entries this month

total spend: $14.20 · list price: $56.80 · 75.0% saved

Nine stages. One request.

Every LLM call passes through all nine stages in sequence. Most never make it past stage 4 — that's the 85%.

STAGE 01

Circuit Breaker

Stops runaway agent loops before they explode your budget. A single undetected recursive loop can generate $47,000 in API charges overnight. The circuit breaker trips at your configured threshold.

STAGE 02

Quota Validation

Per-team, per-agent budget enforcement before any model call. Finance team has a $200/day ceiling; platform engineering has $500. Requests over budget are blocked with a 429 and logged.

STAGE 03

Exact-Match Cache

Identical prompts return instantly from an in-memory hash lookup — sub-millisecond, zero cost. Catches repeated automated queries, regression tests, and CI pipelines that call the same prompt hundreds of times.

~10% hit rate on its own

STAGE 04

Semantic Similarity Cache

Prompts are normalized first — persona scaffolding stripped, markdown formatting removed, spelling variations collapsed — then embedded and compared against the cache. Similarity above the threshold returns the cached answer instantly.

85% combined hit rate — this is where the savings come from

STAGE 05

Vendor Routing

For cache misses, route to the cheapest capable model that satisfies the request's complexity profile. Route haiku-tier requests to Haiku, not Sonnet. Use the right model for the task, not just the default.

STAGE 06

Provider-Level Prompt Caching

For long system prompts, activate the provider's own prompt caching (Anthropic's cache_control, OpenAI's similar mechanism). Stacks on top of AgentMesh's own cache for maximum savings on multi-turn agents.

STAGE 07

LLM Call

Only 15% of requests reach here. The actual model call — with governance headers tracking which team, which agent, which model, and what the expected cost is before the response arrives.

STAGE 08

Response Caching

The model's response is normalized, embedded, and stored so the next semantically similar request hits Stage 4 instead of Stage 7. Each miss seeds future hits — the cache gets smarter over time.

STAGE 09

Tamper-Evident Audit Log

Every request — hit or miss, allowed or blocked — produces a hash-chained, Ed25519-signed audit entry. Forensic answer to "who called what, when, and how much did it cost?" for every team, every agent, every day.

Why the hit rate is 85%, not 10%

Traditional exact-match caches top out around 10%. AgentMesh's preprocessing step is what makes 85% possible.

Normalize before embedding

"Summarize Q2 revenue" and "Can you give me a summary of our Q2 revenue trends?" are semantically identical. AgentMesh strips persona prefixes, markdown formatting, filler phrases, and spelling variations before computing the embedding — so both prompts land in the same cache bucket.

Shared cache across all tools

Without a proxy, Claude Code, Copilot, and your custom CrewAI agent each maintain separate API connections. The same question asked in three tools generates three model calls. AgentMesh collapses them into one — the shared cache sees the full picture your individual tools can't.

Configurable similarity threshold

The default threshold (cosine similarity ≥ 0.85) is tuned for code and factual queries where near-identical prompts have near-identical ideal answers. You can tighten it for creative tasks or loosen it for summarization — per route, per team, per agent type.

Works with every tool your team uses

Any tool that calls Anthropic or OpenAI-compatible APIs routes through AgentMesh with two environment variables. No SDK changes, no code changes.

🤖

Claude Code

ANTHROPIC_BASE_URL

✨

GitHub Copilot

OPENAI_BASE_URL

💬

ChatGPT / GPT-4

OPENAI_BASE_URL

🔷

Google Gemini

OpenAI-compat endpoint

🕸️

CrewAI

Works via OPENAI_BASE_URL

🔗

LangGraph / LangChain

Works via base_url override

🐍

Custom Python agents

SDK base_url parameter

🌐

Web tools (Chrome ext.)

Proxy for browser-based AI

What makes AgentMesh different

AgentMesh

No proxy (direct API)

Vendor-specific caching

Semantic cache (cross-tool)

✓ 85% hit rate across all tools

0% — every call hits the model

Exact-match only, ~10%

Shared budget across tools

✓ Per-team, per-agent quotas

No shared visibility

Per-tool silos

Unified audit trail

✓ Ed25519 signed, hash-chained

Per-vendor, fragmented

Basic logs, not cross-tool

Circuit breaker (runaway loops)

✓ Configurable threshold

No protection

Zero code changes to adopt

✓ Two env vars

✓ (nothing to change)

SDK refactoring required

Vendor neutrality

✓ Anthropic · OpenAI · Gemini

Depends on your code

Tied to one provider

85% of Our LLM Calls Never Reach the Model

The architecture behind AgentMesh's semantic cache — why normalize-then-embed produces 85% hit rates while traditional exact-match tops out at 10%, and what "govern at the wire, not in the code" means in practice.

The governance plane for every AI agent your team runs

Zero code changes. Two env vars.

Nine stages. One request.

Why the hit rate is 85%, not 10%

Works with every tool your team uses

What makes AgentMesh different

The launch writing

85% of Our LLM Calls Never Reach the Model

I Put One Proxy in Front of Every AI Tool My Team Uses

75% Lower AI Costs — Zero Code Changes

Govern at the wire, not in the code.