ToxicityAgent
ToxicityAgent detects harmful content using three complementary layers: explicit word matching, an ML toxicity model, and semantic similarity to illegal categories. Everything runs locally — no external API calls are made.
See Toxicity: 3-Layer Detection for a deep dive into how the scoring works.
Constructor
ToxicityAgent(threshold: float = 0.625)| Parameter | Type | Default | Description |
|---|---|---|---|
threshold | float | 0.625 | Minimum safety score required to pass. Applied to the combined weighted score. See note below before changing this. |
Models are loaded lazily on first use.
update_threshold(threshold)
def update_threshold(self, threshold: float) -> NoneUpdates the pass/fail threshold after construction. Useful when evaluating the same agent against datasets from different domains without re-instantiating it.
| Parameter | Type | Description |
|---|---|---|
threshold | float | New minimum safety score to pass. Must be in [0.0, 1.0]. |
evaluate(data)
def evaluate( data, # str or {"answer": str, ...} threshold: float | None = None, on_progress=None,) -> EvaluationResultRuns all three layers and combines their scores with a weighted formula. Returns a single EvaluationResult.
| Parameter | Type | Default | Description |
|---|---|---|---|
data | str or dict | — | Plain string or a dict with an "answer" key. |
threshold | float | None | None | Override the instance threshold for this call only. When None, uses the threshold set at construction (or via update_threshold). |
on_progress | callable | None | Optional callback called with a status string at each layer. Useful for streaming UI updates. |
Return value:
{"status": "PASS" | "FAIL", "score": float}Score is the safety score (higher = safer): 1.0 - weighted_risk_score.
Individual layers
You can call each detection layer independently for debugging or custom pipelines.
deterministic_layer(statement, custom_bad_words=None)
def deterministic_layer( statement: str, custom_bad_words: list[str] | None = None,) -> EvaluationResultChecks for profanity using the better-profanity library. Returns score: 1.0 (flagged) or score: 0.0 (clean). Optionally extends the default word list with custom_bad_words.
probabilistic_layer(statement, threshold=None)
def probabilistic_layer( statement: str, threshold: float | None = None,) -> EvaluationResultRuns the detoxify BERT model on the statement. The model returns scores for six toxicity categories; the maximum is used. Returns the raw max score. When threshold is None, uses the instance threshold set at construction.
semantic_layer(statement, illegal_categories=None, threshold=None)
def semantic_layer( statement: str, illegal_categories: list[str] | None = None, threshold: float | None = None,) -> EvaluationResultEncodes the statement with all-MiniLM-L6-v2 and computes cosine similarity against a set of illegal category labels. Default categories:
[ "hate speech", "self-harm encouragement", "sexual content involving minors", "violent or graphic harm", "terrorism or extremism", "instructions for illegal activities",]Pass illegal_categories to override with your own list.
Examples
Combined evaluation (typical use)
from llm_validation_framework import ToxicityAgent
agent = ToxicityAgent()
safe = agent.evaluate("The Pacific Ocean is the largest ocean on Earth.")print(safe["status"]) # "PASS"print(safe["score"]) # ~0.95
harmful = agent.evaluate("I want to hurt someone.")print(harmful["status"]) # "FAIL"print(harmful["score"]) # ~0.40Inspecting individual layers
agent = ToxicityAgent()
det = agent.deterministic_layer("This text is fine")prob = agent.probabilistic_layer("I hate everything about you")sem = agent.semantic_layer("How do I make a bomb?")
print(f"Deterministic: {det['status']} score={det['score']:.2f}")print(f"Probabilistic: {prob['status']} score={prob['score']:.2f}")print(f"Semantic: {sem['status']} score={sem['score']:.2f}")Custom illegal categories
agent = ToxicityAgent()result = agent.semantic_layer( "How do I pick a lock?", illegal_categories=["lock picking", "breaking and entering", "burglary"], threshold=0.45,)Progress callback (for streaming UIs)
def on_progress(message: str): print(f"[toxicity] {message}")
agent = ToxicityAgent()result = agent.evaluate("Some user input", on_progress=on_progress)# [toxicity] Scanning for explicit language...# [toxicity] Running toxicity model...# [toxicity] Checking semantic similarity...