Open source — Apache 2.0

The governance plane for every AI agent your team runs

One proxy in front of Claude Code, Copilot, ChatGPT, Gemini, and your custom agents. Shared semantic cache, per-team budgets, cryptographically signed audit trail. 85% of calls never reach the model. 75% lower cost. Two environment variables to adopt.

Star on GitHub See it in action How it works
pip install agentmesh-proxy sentence-transformers
85%
Cache hit rate — requests that never reach the model
75%
Reduction in AI API costs, measured across real team usage
2
Environment variables to adopt — no code changes required
9
Pipeline stages every request passes through before the model
Apache 2.0 Python 3.10+ Anthropic · OpenAI · Gemini CrewAI · LangGraph · LangChain Ed25519 audit signatures Semantic + exact-match cache

Zero code changes. Two env vars.

Point your existing tools at AgentMesh and watch 85% of calls return from cache — with full governance headers on every response.

setup.sh
cache_hit.py
audit_query.py
# Install and start the proxy
$ pip install agentmesh-proxy sentence-transformers
$ agentmesh start --port 8080
AgentMesh proxy listening on localhost:8080
Semantic cache loaded sentence-transformers/all-MiniLM-L6-v2
Exact-match cache ready
Circuit breaker armed · $500/day budget
Audit log Ed25519 · hash-chained

# That's it. Now point your tools at the proxy:
$ export ANTHROPIC_BASE_URL=http://localhost:8080
$ export OPENAI_BASE_URL=http://localhost:8080
Claude Code, Copilot, ChatGPT, Gemini, CrewAI, LangGraph —
all routed through the same cache and budget.
# 5 developers ask similar questions — only 1 hits the model
import anthropic
client = anthropic.Anthropic()
# Dev 1 — cache miss, calls the model
r1 = client.messages.create(model="claude-sonnet-4-6",
messages=[{"role":"user","content":"Summarize Q2 revenue trends"}])
# Dev 2 — semantically similar → cache hit
r2 = client.messages.create(model="claude-sonnet-4-6",
messages=[{"role":"user","content":"Give me a summary of Q2 revenue"}])

X-AgentMesh-Cache: MISS → model called · 847ms · $0.0024
X-AgentMesh-Cache: HIT → semantic · 12ms · $0.0000
X-AgentMesh-Similarity: 0.94 threshold 0.85
X-AgentMesh-Saved: $0.0024 this request
Session total 20 requests · 15 cache hits · $0.031 spent vs $0.121 list price
74.4% cost reduction · zero accuracy loss · identical responses
# Tamper-evident audit — every request, every team, every dollar
from agentmesh import AuditLog
log = AuditLog.open("audit.jsonl")
entries = log.query(team="platform-eng", days=30)

Audit query team=platform-eng · last 30 days
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
am_4f2a91 2026-06-14 09:11 HIT semantic · $0.0000 · similarity=0.91
am_7c3b02 2026-06-14 09:09 MISS model call · $0.0024 · claude-sonnet-4-6
am_1d8e44 2026-06-13 16:32 BLOCKED quota exceeded · team budget $500/day
am_9a0f77 2026-06-13 14:15 HIT exact · $0.0000 · 3ms
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
chain valid · Ed25519 verified · 847 entries this month
total spend: $14.20 · list price: $56.80 · 75.0% saved

Nine stages. One request.

Every LLM call passes through all nine stages in sequence. Most never make it past stage 4 — that's the 85%.

STAGE 01
Circuit Breaker
Stops runaway agent loops before they explode your budget. A single undetected recursive loop can generate $47,000 in API charges overnight. The circuit breaker trips at your configured threshold.
STAGE 02
Quota Validation
Per-team, per-agent budget enforcement before any model call. Finance team has a $200/day ceiling; platform engineering has $500. Requests over budget are blocked with a 429 and logged.
STAGE 03
Exact-Match Cache
Identical prompts return instantly from an in-memory hash lookup — sub-millisecond, zero cost. Catches repeated automated queries, regression tests, and CI pipelines that call the same prompt hundreds of times.
~10% hit rate on its own
STAGE 04
Semantic Similarity Cache
Prompts are normalized first — persona scaffolding stripped, markdown formatting removed, spelling variations collapsed — then embedded and compared against the cache. Similarity above the threshold returns the cached answer instantly.
85% combined hit rate — this is where the savings come from
STAGE 05
Vendor Routing
For cache misses, route to the cheapest capable model that satisfies the request's complexity profile. Route haiku-tier requests to Haiku, not Sonnet. Use the right model for the task, not just the default.
STAGE 06
Provider-Level Prompt Caching
For long system prompts, activate the provider's own prompt caching (Anthropic's cache_control, OpenAI's similar mechanism). Stacks on top of AgentMesh's own cache for maximum savings on multi-turn agents.
STAGE 07
LLM Call
Only 15% of requests reach here. The actual model call — with governance headers tracking which team, which agent, which model, and what the expected cost is before the response arrives.
STAGE 08
Response Caching
The model's response is normalized, embedded, and stored so the next semantically similar request hits Stage 4 instead of Stage 7. Each miss seeds future hits — the cache gets smarter over time.
STAGE 09
Tamper-Evident Audit Log
Every request — hit or miss, allowed or blocked — produces a hash-chained, Ed25519-signed audit entry. Forensic answer to "who called what, when, and how much did it cost?" for every team, every agent, every day.

Why the hit rate is 85%, not 10%

Traditional exact-match caches top out around 10%. AgentMesh's preprocessing step is what makes 85% possible.

01
Normalize before embedding
"Summarize Q2 revenue" and "Can you give me a summary of our Q2 revenue trends?" are semantically identical. AgentMesh strips persona prefixes, markdown formatting, filler phrases, and spelling variations before computing the embedding — so both prompts land in the same cache bucket.
02
Shared cache across all tools
Without a proxy, Claude Code, Copilot, and your custom CrewAI agent each maintain separate API connections. The same question asked in three tools generates three model calls. AgentMesh collapses them into one — the shared cache sees the full picture your individual tools can't.
03
Configurable similarity threshold
The default threshold (cosine similarity ≥ 0.85) is tuned for code and factual queries where near-identical prompts have near-identical ideal answers. You can tighten it for creative tasks or loosen it for summarization — per route, per team, per agent type.

Works with every tool your team uses

Any tool that calls Anthropic or OpenAI-compatible APIs routes through AgentMesh with two environment variables. No SDK changes, no code changes.

🤖
Claude Code
ANTHROPIC_BASE_URL
GitHub Copilot
OPENAI_BASE_URL
💬
ChatGPT / GPT-4
OPENAI_BASE_URL
🔷
Google Gemini
OpenAI-compat endpoint
🕸️
CrewAI
Works via OPENAI_BASE_URL
🔗
LangGraph / LangChain
Works via base_url override
🐍
Custom Python agents
SDK base_url parameter
🌐
Web tools (Chrome ext.)
Proxy for browser-based AI

What makes AgentMesh different

AgentMesh No proxy (direct API) Vendor-specific caching
Semantic cache (cross-tool) 85% hit rate across all tools 0% — every call hits the model Exact-match only, ~10%
Shared budget across tools Per-team, per-agent quotas No shared visibility Per-tool silos
Unified audit trail Ed25519 signed, hash-chained Per-vendor, fragmented Basic logs, not cross-tool
Circuit breaker (runaway loops) Configurable threshold No protection No protection
Zero code changes to adopt Two env vars (nothing to change) SDK refactoring required
Vendor neutrality Anthropic · OpenAI · Gemini Depends on your code Tied to one provider

The launch writing

85% of Our LLM Calls Never Reach the Model

The architecture behind AgentMesh's semantic cache — why normalize-then-embed produces 85% hit rates while traditional exact-match tops out at 10%, and what "govern at the wire, not in the code" means in practice.

I Put One Proxy in Front of Every AI Tool My Team Uses

A walkthrough of the real-world benchmark: 20 requests across 5 topics, 15 semantic cache hits, 3 misses — with exact numbers on latency, cost, and how the similarity scoring works in practice.

75% Lower AI Costs — Zero Code Changes

The business case: why your team's AI tools each call the LLM API independently — no shared cache, no shared budget, no audit trail — and what a single governance proxy changes about that equation.

Govern at the wire, not in the code.

Open source. Apache 2.0. Two environment variables. One proxy for everything.

Star on GitHub Enterprise Inquiry

Custom deployments, enterprise quotas, and advisory: anil@ambharii.com