AccuracyAgent
AccuracyAgent evaluates whether an LLM’s answer is factually correct. It combines a relevancy sub-check (40%) and a factual accuracy check (60%) — both powered by GEval, Deepeval’s LLM-as-a-judge metric.
When a RAGProvider is supplied, the factual check is grounded against retrieved documents instead of the judge model’s own knowledge.
Constructor
AccuracyAgent( config_path: str | None = None, provider: str = "anthropic", model: str = "claude-haiku-4-5-20251001", rag: RAGProvider | None = None, threshold: float = 0.5,)| Parameter | Type | Default | Description |
|---|---|---|---|
config_path | str | None | None | Path to config.ini for API key loading. Falls back to env var then cwd. |
provider | str | "anthropic" | LLM provider for the judge model (any litellm provider). |
model | str | "claude-haiku-4-5-20251001" | Judge model identifier. |
rag | RAGProvider | None | None | Optional RAG retriever. When provided, factual checks are grounded against retrieved context. |
threshold | float | 0.5 | Minimum combined score required to pass. Applied to 0.4 × relevancy + 0.6 × factual. See note below before changing this. |
update_threshold(threshold)
def update_threshold(self, threshold: float) -> NoneUpdates the pass/fail threshold after construction. Also propagates the new threshold to the internal RelevancyAgent sub-check. Useful when evaluating the same agent against datasets from different domains without re-instantiating it.
| Parameter | Type | Description |
|---|---|---|
threshold | float | New minimum combined score to pass. Must be in [0.0, 1.0]. |
evaluate(data)
def evaluate( data: dict, # {"question": str, "answer": str} on_progress=None,) -> EvaluationResultRuns the relevancy sub-check and the factual accuracy check, then returns a combined score.
data must be a dict with both "question" and "answer" keys.
Return value:
{ "status": "PASS" | "FAIL", "score": float, # 0.4 * relevancy_score + 0.6 * factual_score "reason": "Relevancy (0.85): <reason> | Factual (0.90): <reason>"}The combined score must be ≥ 0.5 to pass.
How scoring works
combined = 0.4 × relevancy_score + 0.6 × factual_scorestatus = "PASS" if combined ≥ 0.5 else "FAIL"Both sub-scores come from GEval and are in the range [0.0, 1.0]. The factual check uses different evaluation criteria depending on whether a RAGProvider is attached:
- Without RAG: The judge uses its own knowledge to assess factual correctness. It is intentionally lenient on brief or single-word correct answers.
- With RAG: The judge uses retrieved context as the source of truth. Answers that contradict the context fail even if they’re plausible from general knowledge.
Examples
Without RAG
from llm_validation_framework import AccuracyAgent
agent = AccuracyAgent()
result = agent.evaluate({ "question": "Where is the Eiffel Tower?", "answer": "The Eiffel Tower is located in Paris, France.",})
print(result["status"]) # "PASS"print(f"{result['score']:.2f}") # e.g. "0.88"print(result["reason"])With RAG
from llm_validation_framework import AccuracyAgent, RAGProvider
# Any LangChain-compatible retrieverretriever = your_vectorstore.as_retriever()rag = RAGProvider(retriever)
agent = AccuracyAgent(rag=rag)
result = agent.evaluate({ "question": "Who founded the company according to our internal docs?", "answer": "Jane Smith founded the company in 2010.",})# Grounded against retrieved documents — will FAIL if docs say otherwiseUsing a different judge model
import os
agent = AccuracyAgent( provider="openai", model="gpt-4o-mini",)