Research & Original Work

Ideas we've shipped, not slides we've shown.

Original frameworks, architectures, and field notes from 28 years of shipping AI to production. Every artifact below is open-source, code-first, and battle-tested before it was named.

Frameworks & architectures

Production AI evaluation and security primitives we've designed, implemented, and released as open source.

Framework · 2025

G-ARVIS — Six dimensions of LLM quality in production

An evaluation framework that scores every LLM response across Groundedness, Accuracy, Reliability, Variance, Inference cost, and Safety. Composite score, sub-5ms heuristic implementation, weight profiles for enterprise / healthcare / finance / consumer / agentic. Released as pip install argus-ai.

G
Groundedness
Anchored in provided context vs. fabricating
A
Accuracy
Match to ground truth, internal consistency
R
Reliability
Format consistency, latency SLA compliance
V
Variance
Output determinism, drift detection
I
Inference Cost
Token efficiency, cost-per-correct-answer
S
Safety
PII leakage, toxicity, injection detection
Metrics · 2025

Agentic evaluation metrics: ASF, ERR, CPCS

Three metrics for autonomous workflow monitoring that BLEU, ROUGE, and perplexity were never designed to measure. Agent Stability Factor, Error Recovery Rate, Cost Per Completed Step — production thresholds for each.

ASF
Agent Stability Factor
Completion rate × failure resilience × retry consistency. Production threshold ≥ 0.85.
ERR
Error Recovery Rate
Recovered steps ÷ failed steps. Catches cascade failures vs. self-correction. Threshold ≥ 0.70.
CPCS
Cost Per Completed Step
Total spend ÷ successfully completed steps. Production threshold ≤ $0.10/step.
Architecture · 2026

Bulwark — Five-layer defense for production AI agents

An end-to-end agent security architecture: input sanitizer (zero-permission isolate) → injection detector (BERT + pattern catalog) → compartmentalized RBAC (zero trust for agents) → human confirmation gate (async, multi-channel) → encrypted audit trail (Fernet, 7-year retention). MCP-native, vendor-neutral. Released as pip install bulwark-agent-security.

Field notes & essays

Long-form writing on production AI engineering — what the failure modes actually look like, and how the architectural choices land.

2026 · The launch essay behind Bulwark
The Web Is Now Weaponized Against Your AI Agents
Why prompt injection is the SQL injection of the agent era — Google's April 2026 catalog of agent threat vectors, what it actually means for production teams, and the architectural principles behind a five-layer defense.
2026 · Developer experience · Companion to claude-auth-setup
Claude Code Authentication: Cross-Platform Setup Script
Why setting up Claude Code authentication takes 15 minutes across six docs pages, what each of the six auth paths actually requires (subscription, API key, OAuth, Bedrock, Vertex AI, Foundry), and the cross-platform script that automates it. The companion piece for the claude-auth-setup open-source release.
Ongoing · Field Notes newsletter
Field Notes: Production AI
A running series on the gap between research-prototype AI and production AI — the failure modes that don't show up in benchmarks, the metrics that matter at 3 AM, and the architectural choices that determine whether a system holds up under real load.

Get new essays as they drop.

Subscribe on Medium for new field notes on production agentic AI, evaluation, and security.

Subscribe on Medium →

Open-source contributions

Every framework, every architecture, every paper — code-first, Apache 2.0, on GitHub.

OSS · Apache 2.0

Bulwark

Five-layer agent security framework. Sanitizer + detector + RBAC + audit + human gate. pip install bulwark-agent-security

OSS · Apache 2.0

argus-ai

Reference implementation of G-ARVIS. Six-dimension LLM scoring + agentic metrics. pip install argus-ai

OSS

PulseFlow

MLOps platform for the production model lifecycle: experiment tracking, model registry, automated evaluation pipelines, drift detection, governance dashboards.

OSS

TorchForge

PyTorch production framework — standardized training loops, model registry, one-command deployment. Built because most PyTorch models never reach production.

OSS

CLARIFY

Explainable AI framework for regulated industries — human-readable explanations with audit-grade traceability for healthcare, finance, and compliance environments.

OSS

ScalableGen

Vitess-based horizontal MySQL sharding for healthcare and genomics. Zero-downtime migrations from monolithic to sharded production.

OSS · MIT

claude-auth-setup

Cross-platform Bash + Batch scripts for Claude Code authentication setup. Six auth paths (subscription, API key, OAuth, AWS Bedrock, Google Vertex AI, Microsoft Foundry) behind one guided flow. Production-tested on Windows, macOS, Linux, WSL.