G-ARVIS / argus-ai — Production-Grade LLM Observability

quickstart.py

agentic_eval.py

monitoring.py

# 3-line integration

import argus_ai

argus = argus_ai.init(profile="enterprise")

result = argus.evaluate(

prompt="What are the Q3 revenue trends?",

response="Revenue increased 12% YoY to $4.2B.",

context="Q3 financial report: 12% YoY growth.",

model_name="claude-sonnet-4",

latency_ms=1200.0,

)

2026-03-17 17:42:08 [info] garvis_scorer_initialized

profile=enterprise

G-ARVIS Composite Score: 0.847

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Groundedness 0.823

Accuracy 0.912

Reliability 0.858

Variance 0.770

Inference Cost 0.715

Safety 0.950

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Passing: ✓ TRUE Eval latency: 2.3ms Alerts: 0

from argus_ai.types import AgenticEvalRequest

workflow = AgenticEvalRequest(

prompt="Research competitors and generate report",

response="Report generated with 5 competitor analyses.",

steps_planned=8,

steps_completed=7,

steps_failed=2,

steps_recovered=1,

retries=3,

total_cost_usd=0.45,

)

result, agentic = argus.evaluate_agentic(workflow)

Agentic Metrics

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Agent Stability Factor 0.612

Error Recovery Rate 0.600

Cost Per Completed Step 0.357

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Steps: 7/8 completed Failures: 2 Recovered: 1

Total cost: $0.45 Per step: $0.064

from argus_ai.monitoring.thresholds import ThresholdConfig

from argus_ai.monitoring.alerts import AlertRule, AlertSeverity

argus = argus_ai.init(

profile="healthcare",

thresholds=ThresholdConfig(safety_min=0.90),

on_alert=lambda msg, res: pagerduty.trigger(msg),

)

Simulating production traffic...

Request 1 Composite: 0.872 Safety: 0.950 ✓ PASS

Request 2 Composite: 0.714 Safety: 0.720 ✗ FAIL

[CRITICAL] safety score 0.720 below threshold 0.90

PII detected: emails=1, phones=1

Request 3 Composite: 0.698 Safety: 0.850 ✗ FAIL

[HIGH] accuracy score 0.580 below threshold 0.65

speculation_penalty: 0.15

Monitor Statistics

safety: 2/3 breaches (67%)

accuracy: 1/3 breaches (33%)

composite: 2/3 breaches (67%)

[SUSTAINED] safety degradation: 67% of last 3 evaluations below threshold

Six dimensions of LLM quality

Each response scored 0 to 1. The weighted composite tells you, in a single number, whether your LLM is production-grade.

Groundedness

Is the response anchored in provided context or fabricating? Your hallucination detector for RAG, document QA, and knowledge-grounded applications.

Accuracy

Does it match ground truth? Internal consistency checks, numeric validation, and token-level F1 against reference answers.

Reliability

Format consistency, completeness detection, length proportionality, and latency SLA compliance. Is the output structurally usable?

Variance

Output determinism, confidence signals, and vocabulary diversity. Catches drift between what your prompt was designed to produce and what it actually produces today.

Inference Cost

Token efficiency, cost-per-word, and latency-to-value ratio. A 2,000-token response to a yes/no question is a cost problem this dimension catches.

Safety

PII leakage (email, SSN, credit card, phone), toxicity markers, prompt injection detection, jailbreak compliance, and harmful content patterns.

Agentic evaluation metrics

Three metrics for autonomous workflow monitoring that BLEU, ROUGE, and perplexity were never designed to measure.

ASF

Agent Stability Factor

Completion rate × failure resilience × retry consistency. Tells you whether your agent reliably finishes what it starts. Production threshold: ≥ 0.85.

ERR

Error Recovery Rate

Recovered steps divided by failed steps. Measures whether your agent self-corrects or cascades failures through the workflow. Production threshold: ≥ 0.70.

CPCS

Cost Per Completed Step

Total spend normalized against successfully completed steps. The real economic cost of autonomous execution. Production threshold: ≤ $0.10/step.

5 weight profiles

Enterprise, healthcare, finance, consumer, agentic. Pre-tuned for regulated and cost-sensitive workloads.

Threshold monitoring

Sliding window breach detection with configurable alert rules, severity levels, and callback hooks.

Prometheus & OTEL

Export G-ARVIS scores as Prometheus gauges/histograms or OpenTelemetry metrics for Grafana, Datadog, New Relic.

Provider wrappers

Drop-in InstrumentedAnthropic and InstrumentedOpenAI clients. Automatic scoring on every API call.

84 tests, 93% coverage

Production-grade test suite across scoring, monitoring, alerting, and SDK layers. mypy strict mode.

Sub-5ms evaluation

Heuristic-based scoring with zero external API calls. Runs inline in your production hot path.

Open core architecture

Open source — Apache 2.0

argus-ai

G-ARVIS composite scorer
6 individual dimension scorers
ASF, ERR, CPCS agentic metrics
3-line SDK with init/evaluate/score
Threshold monitoring & alerting
Prometheus & OpenTelemetry export
Anthropic & OpenAI wrappers
@argus_evaluate decorator

Proprietary — ARGUS Platform

Autonomous correction

Orchestrator agent
Prompt optimizer (auto-tune)
Closed-loop self-healing pipeline
LLM-as-judge evaluation
Multi-run variance analysis
Async batch processing
Dashboard UI
SOC2/HIPAA compliance

Production LLM observability from detection to correction

Six dimensions of LLM quality

Agentic evaluation metrics

5 weight profiles

Threshold monitoring

Prometheus & OTEL

Provider wrappers

84 tests, 93% coverage

Sub-5ms evaluation

Open core architecture

argus-ai

Autonomous correction

Install argus-ai. Score your LLM calls.