Skip to content

AccuracyAgent

Requires API key

AccuracyAgent evaluates whether an LLM’s answer is factually correct. It combines a relevancy sub-check (40%) and a factual accuracy check (60%) — both powered by GEval, Deepeval’s LLM-as-a-judge metric.

When a RAGProvider is supplied, the factual check is grounded against retrieved documents instead of the judge model’s own knowledge.

Constructor

AccuracyAgent(
config_path: str | None = None,
provider: str = "anthropic",
model: str = "claude-haiku-4-5-20251001",
rag: RAGProvider | None = None,
)
ParameterTypeDefaultDescription
config_pathstr | NoneNonePath to config.ini for API key loading. Falls back to env var then cwd.
providerstr"anthropic"LLM provider for the judge model (any litellm provider).
modelstr"claude-haiku-4-5-20251001"Judge model identifier.
ragRAGProvider | NoneNoneOptional RAG retriever. When provided, factual checks are grounded against retrieved context.

evaluate(data)

def evaluate(
data: dict, # {"question": str, "answer": str}
on_progress=None,
) -> EvaluationResult

Runs the relevancy sub-check and the factual accuracy check, then returns a combined score.

data must be a dict with both "question" and "answer" keys.

Return value:

{
"status": "PASS" | "FAIL",
"score": float, # 0.4 * relevancy_score + 0.6 * factual_score
"reason": "Relevancy (0.85): <reason> | Factual (0.90): <reason>"
}

The combined score must be ≥ 0.5 to pass.

How scoring works

combined = 0.4 × relevancy_score + 0.6 × factual_score
status = "PASS" if combined ≥ 0.5 else "FAIL"

Both sub-scores come from GEval and are in the range [0.0, 1.0]. The factual check uses different evaluation criteria depending on whether a RAGProvider is attached:

  • Without RAG: The judge uses its own knowledge to assess factual correctness. It is intentionally lenient on brief or single-word correct answers.
  • With RAG: The judge uses retrieved context as the source of truth. Answers that contradict the context fail even if they’re plausible from general knowledge.

Examples

Without RAG

from llm_validation_framework import AccuracyAgent
agent = AccuracyAgent()
result = agent.evaluate({
"question": "Where is the Eiffel Tower?",
"answer": "The Eiffel Tower is located in Paris, France.",
})
print(result["status"]) # "PASS"
print(f"{result['score']:.2f}") # e.g. "0.88"
print(result["reason"])

With RAG

from llm_validation_framework import AccuracyAgent, RAGProvider
# Any LangChain-compatible retriever
retriever = your_vectorstore.as_retriever()
rag = RAGProvider(retriever)
agent = AccuracyAgent(rag=rag)
result = agent.evaluate({
"question": "Who founded the company according to our internal docs?",
"answer": "Jane Smith founded the company in 2010.",
})
# Grounded against retrieved documents — will FAIL if docs say otherwise

Using a different judge model

import os
agent = AccuracyAgent(
provider="openai",
model="gpt-4o-mini",
)