Research & Original Work

Ideas we've shipped, not slides we've shown.

Original frameworks, architectures, and field notes from 28 years of shipping AI to production. Every artifact below is open-source, code-first, and battle-tested before it was named.

Frameworks & architectures

Production AI evaluation and security primitives we've designed, implemented, and released as open source.

Framework · 2025

G-ARVIS — Six dimensions of LLM quality in production

An evaluation framework that scores every LLM response across Groundedness, Accuracy, Reliability, Variance, Inference cost, and Safety. Composite score, sub-5ms heuristic implementation, weight profiles for enterprise / healthcare / finance / consumer / agentic. Released as pip install argus-ai.

G
Groundedness
Anchored in provided context vs. fabricating
A
Accuracy
Match to ground truth, internal consistency
R
Reliability
Format consistency, latency SLA compliance
V
Variance
Output determinism, drift detection
I
Inference Cost
Token efficiency, cost-per-correct-answer
S
Safety
PII leakage, toxicity, injection detection
Metrics · 2025

Agentic evaluation metrics: ASF, ERR, CPCS

Three metrics for autonomous workflow monitoring that BLEU, ROUGE, and perplexity were never designed to measure. Agent Stability Factor, Error Recovery Rate, Cost Per Completed Step — production thresholds for each.

ASF
Agent Stability Factor
Completion rate × failure resilience × retry consistency. Production threshold ≥ 0.85.
ERR
Error Recovery Rate
Recovered steps ÷ failed steps. Catches cascade failures vs. self-correction. Threshold ≥ 0.70.
CPCS
Cost Per Completed Step
Total spend ÷ successfully completed steps. Production threshold ≤ $0.10/step.
Architecture · 2026

Bulwark — Five-layer defense for production AI agents

An end-to-end agent security architecture: input sanitizer (zero-permission isolate) → injection detector (BERT + pattern catalog) → compartmentalized RBAC (zero trust for agents) → human confirmation gate (async, multi-channel) → encrypted audit trail (Fernet, 7-year retention). MCP-native, vendor-neutral. Released as pip install bulwark-agent-security.

Field notes & essays

Long-form writing on production AI engineering — what the failure modes actually look like, and how the architectural choices land.

2026 · The launch essay behind Bulwark
The Web Is Now Weaponized Against Your AI Agents
Why prompt injection is the SQL injection of the agent era — Google's April 2026 catalog of agent threat vectors, what it actually means for production teams, and the architectural principles behind a five-layer defense.
Ongoing · Field Notes newsletter
Field Notes: Production AI
A running series on the gap between research-prototype AI and production AI — the failure modes that don't show up in benchmarks, the metrics that matter at 3 AM, and the architectural choices that determine whether a system holds up under real load.

Get new essays as they drop.

Subscribe on Medium for new field notes on production agentic AI, evaluation, and security.

Subscribe on Medium →

Open-source contributions

Every framework, every architecture, every paper — code-first, Apache 2.0, on GitHub.

OSS · Apache 2.0

Bulwark

Five-layer agent security framework. Sanitizer + detector + RBAC + audit + human gate. pip install bulwark-agent-security

OSS · Apache 2.0

argus-ai

Reference implementation of G-ARVIS. Six-dimension LLM scoring + agentic metrics. pip install argus-ai

OSS

PulseFlow

MLOps platform for the production model lifecycle: experiment tracking, model registry, automated evaluation pipelines, drift detection, governance dashboards.

OSS

TorchForge

PyTorch production framework — standardized training loops, model registry, one-command deployment. Built because most PyTorch models never reach production.

OSS

CLARIFY

Explainable AI framework for regulated industries — human-readable explanations with audit-grade traceability for healthcare, finance, and compliance environments.

OSS

ScalableGen

Vitess-based horizontal MySQL sharding for healthcare and genomics. Zero-downtime migrations from monolithic to sharded production.