Research & Original Work

Ideas we've shipped, not slides we've shown.

Original frameworks, architectures, and field notes from 28 years of shipping AI to production. Every artifact below is open-source, code-first, and battle-tested before it was named.

Frameworks & architectures

Production AI evaluation and security primitives we've designed, implemented, and released as open source.

Framework · 2025

G-ARVIS — Six dimensions of LLM quality in production

An evaluation framework that scores every LLM response across Groundedness, Accuracy, Reliability, Variance, Inference cost, and Safety. Composite score, sub-5ms heuristic implementation, weight profiles for enterprise / healthcare / finance / consumer / agentic. Released as pip install argus-ai.

→

Groundedness

Anchored in provided context vs. fabricating

Accuracy

Match to ground truth, internal consistency

Reliability

Format consistency, latency SLA compliance

Variance

Output determinism, drift detection

Inference Cost

Token efficiency, cost-per-correct-answer

Safety

PII leakage, toxicity, injection detection

Metrics · 2025

Agentic evaluation metrics: ASF, ERR, CPCS

Three metrics for autonomous workflow monitoring that BLEU, ROUGE, and perplexity were never designed to measure. Agent Stability Factor, Error Recovery Rate, Cost Per Completed Step — production thresholds for each.

→

ASF

Agent Stability Factor

Completion rate × failure resilience × retry consistency. Production threshold ≥ 0.85.

ERR

Error Recovery Rate

Recovered steps ÷ failed steps. Catches cascade failures vs. self-correction. Threshold ≥ 0.70.

CPCS

Cost Per Completed Step

Total spend ÷ successfully completed steps. Production threshold ≤ $0.10/step.

Architecture · 2026

Bulwark — Five-layer defense for production AI agents

An end-to-end agent security architecture: input sanitizer (zero-permission isolate) → injection detector (BERT + pattern catalog) → compartmentalized RBAC (zero trust for agents) → human confirmation gate (async, multi-channel) → encrypted audit trail (Fernet, 7-year retention). MCP-native, vendor-neutral. Released as pip install bulwark-agent-security.

→

Field notes & essays

Long-form writing on production AI engineering — what the failure modes actually look like, and how the architectural choices land.

2026 · The launch essay behind Bulwark

The Web Is Now Weaponized Against Your AI Agents

Why prompt injection is the SQL injection of the agent era — Google's April 2026 catalog of agent threat vectors, what it actually means for production teams, and the architectural principles behind a five-layer defense.

2026 · Developer experience · Companion to claude-auth-setup

Claude Code Authentication: Cross-Platform Setup Script

Why setting up Claude Code authentication takes 15 minutes across six docs pages, what each of the six auth paths actually requires (subscription, API key, OAuth, Bedrock, Vertex AI, Foundry), and the cross-platform script that automates it. The companion piece for the claude-auth-setup open-source release.

Ongoing · Field Notes newsletter

Field Notes: Production AI

A running series on the gap between research-prototype AI and production AI — the failure modes that don't show up in benchmarks, the metrics that matter at 3 AM, and the architectural choices that determine whether a system holds up under real load.

Get new essays as they drop.

Subscribe on Medium for new field notes on production agentic AI, evaluation, and security.

Subscribe on Medium →

Open-source contributions

Every framework, every architecture, every paper — code-first, Apache 2.0, on GitHub.

OSS · Apache 2.0

Bulwark

Five-layer agent security framework. Sanitizer + detector + RBAC + audit + human gate. pip install bulwark-agent-security

→

OSS · Apache 2.0

argus-ai

Reference implementation of G-ARVIS. Six-dimension LLM scoring + agentic metrics. pip install argus-ai

→

OSS

PulseFlow

MLOps platform for the production model lifecycle: experiment tracking, model registry, automated evaluation pipelines, drift detection, governance dashboards.

→

OSS

TorchForge

PyTorch production framework — standardized training loops, model registry, one-command deployment. Built because most PyTorch models never reach production.

→

OSS

CLARIFY

Explainable AI framework for regulated industries — human-readable explanations with audit-grade traceability for healthcare, finance, and compliance environments.

→

OSS

ScalableGen

Vitess-based horizontal MySQL sharding for healthcare and genomics. Zero-downtime migrations from monolithic to sharded production.

→

OSS · MIT

claude-auth-setup

Cross-platform Bash + Batch scripts for Claude Code authentication setup. Six auth paths (subscription, API key, OAuth, AWS Bedrock, Google Vertex AI, Microsoft Foundry) behind one guided flow. Production-tested on Windows, macOS, Linux, WSL.

→