v0.1.0 — Open source release live

Production LLM observability from detection to correction

argus-ai scores every LLM response across six quality dimensions. 3 lines of Python. Sub-5ms. Zero external API calls. See the degradation your dashboards miss.

Download on GitHub Watch Demo
pip install argus-ai
quickstart.py
agentic_eval.py
monitoring.py
# 3-line integration
import argus_ai
argus = argus_ai.init(profile="enterprise")
result = argus.evaluate(
prompt="What are the Q3 revenue trends?",
response="Revenue increased 12% YoY to $4.2B.",
context="Q3 financial report: 12% YoY growth.",
model_name="claude-sonnet-4",
latency_ms=1200.0,
)

2026-03-17 17:42:08 [info] garvis_scorer_initialized
profile=enterprise
G-ARVIS Composite Score: 0.847
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Groundedness 0.823
Accuracy 0.912
Reliability 0.858
Variance 0.770
Inference Cost 0.715
Safety 0.950
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Passing: ✓ TRUE Eval latency: 2.3ms Alerts: 0
from argus_ai.types import AgenticEvalRequest
workflow = AgenticEvalRequest(
prompt="Research competitors and generate report",
response="Report generated with 5 competitor analyses.",
steps_planned=8,
steps_completed=7,
steps_failed=2,
steps_recovered=1,
retries=3,
total_cost_usd=0.45,
)
result, agentic = argus.evaluate_agentic(workflow)

Agentic Metrics
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Agent Stability Factor 0.612
Error Recovery Rate 0.600
Cost Per Completed Step 0.357
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Steps: 7/8 completed Failures: 2 Recovered: 1
Total cost: $0.45 Per step: $0.064
from argus_ai.monitoring.thresholds import ThresholdConfig
from argus_ai.monitoring.alerts import AlertRule, AlertSeverity
argus = argus_ai.init(
profile="healthcare",
thresholds=ThresholdConfig(safety_min=0.90),
on_alert=lambda msg, res: pagerduty.trigger(msg),
)

Simulating production traffic...
Request 1 Composite: 0.872 Safety: 0.950 ✓ PASS
Request 2 Composite: 0.714 Safety: 0.720 ✗ FAIL
[CRITICAL] safety score 0.720 below threshold 0.90
PII detected: emails=1, phones=1
Request 3 Composite: 0.698 Safety: 0.850 ✗ FAIL
[HIGH] accuracy score 0.580 below threshold 0.65
speculation_penalty: 0.15
Monitor Statistics
safety: 2/3 breaches (67%)
accuracy: 1/3 breaches (33%)
composite: 2/3 breaches (67%)
[SUSTAINED] safety degradation: 67% of last 3 evaluations below threshold

Six dimensions of LLM quality

Each response scored 0 to 1. The weighted composite tells you, in a single number, whether your LLM is production-grade.

G
Groundedness
Is the response anchored in provided context or fabricating? Your hallucination detector for RAG, document QA, and knowledge-grounded applications.
A
Accuracy
Does it match ground truth? Internal consistency checks, numeric validation, and token-level F1 against reference answers.
R
Reliability
Format consistency, completeness detection, length proportionality, and latency SLA compliance. Is the output structurally usable?
V
Variance
Output determinism, confidence signals, and vocabulary diversity. Catches drift between what your prompt was designed to produce and what it actually produces today.
I
Inference Cost
Token efficiency, cost-per-word, and latency-to-value ratio. A 2,000-token response to a yes/no question is a cost problem this dimension catches.
S
Safety
PII leakage (email, SSN, credit card, phone), toxicity markers, prompt injection detection, jailbreak compliance, and harmful content patterns.

Agentic evaluation metrics

Three metrics for autonomous workflow monitoring that BLEU, ROUGE, and perplexity were never designed to measure.

ASF
Agent Stability Factor
Completion rate × failure resilience × retry consistency. Tells you whether your agent reliably finishes what it starts. Production threshold: ≥ 0.85.
ERR
Error Recovery Rate
Recovered steps divided by failed steps. Measures whether your agent self-corrects or cascades failures through the workflow. Production threshold: ≥ 0.70.
CPCS
Cost Per Completed Step
Total spend normalized against successfully completed steps. The real economic cost of autonomous execution. Production threshold: ≤ $0.10/step.

5 weight profiles

Enterprise, healthcare, finance, consumer, agentic. Pre-tuned for regulated and cost-sensitive workloads.

Threshold monitoring

Sliding window breach detection with configurable alert rules, severity levels, and callback hooks.

Prometheus & OTEL

Export G-ARVIS scores as Prometheus gauges/histograms or OpenTelemetry metrics for Grafana, Datadog, New Relic.

Provider wrappers

Drop-in InstrumentedAnthropic and InstrumentedOpenAI clients. Automatic scoring on every API call.

84 tests, 93% coverage

Production-grade test suite across scoring, monitoring, alerting, and SDK layers. mypy strict mode.

Sub-5ms evaluation

Heuristic-based scoring with zero external API calls. Runs inline in your production hot path.

Open core architecture

Open source — Apache 2.0

argus-ai

  • G-ARVIS composite scorer
  • 6 individual dimension scorers
  • ASF, ERR, CPCS agentic metrics
  • 3-line SDK with init/evaluate/score
  • Threshold monitoring & alerting
  • Prometheus & OpenTelemetry export
  • Anthropic & OpenAI wrappers
  • @argus_evaluate decorator
Proprietary — ARGUS Platform

Autonomous correction

  • Orchestrator agent
  • Prompt optimizer (auto-tune)
  • Closed-loop self-healing pipeline
  • LLM-as-judge evaluation
  • Multi-run variance analysis
  • Async batch processing
  • Dashboard UI
  • SOC2/HIPAA compliance

Install argus-ai. Score your LLM calls.

Open source. Production-grade. Three lines of code.

Download on GitHub Contact Us

Enterprise inquiries & ARGUS Platform access: anil@ambharii.com