Skip to content

Judges

A Judge evaluates whether text satisfies a semantic intent. semantix ships with six judge implementations, each offering a different speed/accuracy/cost tradeoff.

Overview

Judge Speed Model size API key Recommended threshold
QuantizedNLIJudge ~10ms ~25 MB No 0.3
NLIJudge ~15ms ~85 MB No 0.3
EmbeddingJudge ~5ms ~80 MB No 0.8
LLMJudge ~500ms Remote Yes 0.7
ForensicJudge Varies Wraps any judge Depends Same as wrapped judge
CachingJudge Instant (cached) Wraps any judge Depends Same as wrapped judge

NLIJudge

The default judge. Uses a cross-encoder NLI (Natural Language Inference) model to check whether the output entails the intent description.

from semantix import NLIJudge

judge = NLIJudge()
# or with a custom model:
judge = NLIJudge(model_name="cross-encoder/nli-MiniLM2-L6-H768")

How it works: Converts the intent description into an NLI hypothesis (e.g., "The text must politely decline..." becomes "Someone is politely declining..."), then scores the entailment probability using softmax over the cross-encoder logits. The progressive framing scores higher on NLI models because they treat the premise as evidence of an ongoing action.

Requires: pip install semantix-ai (installs sentence-transformers)

QuantizedNLIJudge

INT8 quantized version of NLIJudge using ONNX Runtime. Same accuracy, ~50% faster, and no PyTorch dependency.

from semantix import QuantizedNLIJudge

judge = QuantizedNLIJudge()

Auto-detects the best ONNX variant for your CPU architecture:

  • AVX-512 VNNI
  • AVX-512
  • AVX2 (default x86)
  • ARM64

Requires: pip install "semantix-ai[turbo]" (installs onnxruntime, tokenizers, huggingface-hub)

Tip

When the turbo extra is installed, @validate_intent and assert_semantic automatically prefer QuantizedNLIJudge over NLIJudge.

EmbeddingJudge

Uses cosine similarity between sentence embeddings. The fastest option but less accurate for nuanced intents.

from semantix import EmbeddingJudge

judge = EmbeddingJudge()
# or with a custom model:
judge = EmbeddingJudge(model_name="all-MiniLM-L6-v2")

How it works: Encodes both the output and intent description using a sentence transformer, then computes cosine similarity. Good for simple semantic similarity checks, less reliable for nuanced requirements like negation or multi-part intents.

Requires: pip install semantix-ai (installs sentence-transformers)

LLMJudge

Asks a lightweight LLM whether the output satisfies the intent. Most accurate, but requires an API key and is slower.

from semantix import LLMJudge

judge = LLMJudge(model="gpt-4o-mini")
# or with explicit config:
judge = LLMJudge(
    model="gpt-4o-mini",
    api_key="sk-...",          # or set OPENAI_API_KEY env var
    base_url="https://...",    # optional, for compatible endpoints
)

How it works: Sends a structured prompt asking the LLM to score the output 0.0-1.0 with a one-sentence reason. Uses temperature=0 and max_tokens=100 for deterministic, fast responses.

Compatible endpoints: Any OpenAI-compatible API -- Azure OpenAI, local vLLM, Ollama, etc. Just set base_url.

Requires: pip install "semantix-ai[openai]" (installs openai)

ForensicJudge

Wraps any judge and adds token-level attribution on failure. On success, returns the base verdict unchanged (zero overhead).

from semantix import ForensicJudge, QuantizedNLIJudge

judge = ForensicJudge(QuantizedNLIJudge())
judge = ForensicJudge(QuantizedNLIJudge(), top_k=5)

How it works: On failure, runs mask-perturbation saliency analysis -- removes each token one at a time and measures how much the contradiction score drops. The tokens whose removal most reduces contradiction are flagged as "suspect tokens". The result is a structured Breach Report in the verdict's reason field:

## Breach Report

**Score:** 0.0512
**Base judge reason:** ...

### Token Attribution
**indemnify** (0.15), **forfeit** (0.12), **waive** (0.08)

### Summary
Intent failed. High contradiction detected. Suspect Tokens: [indemnify, forfeit, waive]
Parameter Description
base_judge Any Judge instance to wrap
top_k Number of suspect tokens to identify (default 3)

CachingJudge

LRU-cache wrapper around any judge. Caches verdicts keyed on (output_hash, description_hash, threshold).

from semantix import CachingJudge, NLIJudge

judge = CachingJudge(NLIJudge(), maxsize=256)

Useful when you're validating the same outputs repeatedly (e.g., in test suites or batch processing).

# Check cache stats
print(judge.hits, judge.misses)

# Clear the cache
judge.clear()
Parameter Description
judge The underlying judge to delegate to on cache misses
maxsize Maximum number of cached verdicts (default 256, LRU eviction)

The Judge interface

All judges implement the same interface:

from semantix.judges import Judge, Verdict

class MyCustomJudge(Judge):
    recommended_threshold = 0.7  # optional

    def evaluate(
        self,
        output: str,
        intent_description: str,
        threshold: float = 0.8,
    ) -> Verdict:
        # Your logic here
        return Verdict(passed=True, score=0.95, reason="Looks good")

The Verdict dataclass:

Field Type Description
passed bool Whether the output satisfies the intent
score float \| None 0-1 confidence score (when available)
reason str \| None Optional explanation
  • Intents -- what judges evaluate against
  • Testing -- using judges in assertions
  • Training -- calibrating thresholds from data