Skip to content

DSPy

semantix integrates with DSPy (v2.6+) by providing reward and metric functions that are compatible with dspy.Refine, dspy.BestOfN, dspy.Evaluate, and optimizers like dspy.MIPROv2 — running locally on a quantized NLI model, with no API calls and ~15 ms per evaluation.

Install

pip install "semantix-ai[dspy]"

semantic_reward

Use semantic_reward with dspy.Refine or dspy.BestOfN:

import dspy
from semantix import Intent
from semantix.integrations.dspy import semantic_reward

class Polite(Intent):
    """The text must be polite and professional."""

qa = dspy.ChainOfThought("question -> answer")
refined = dspy.Refine(module=qa, N=3, reward_fn=semantic_reward(Polite))
result = refined(question="Handle an angry customer")

With dspy.BestOfN:

best = dspy.BestOfN(module=qa, N=5, reward_fn=semantic_reward(Polite))
result = best(question="Handle an angry customer")

Parameters

semantic_reward(
    intent: type[Intent] | str,
    *,
    field: str | None = None,
    judge: Judge | None = None,
    threshold: float | None = None,
) -> Callable[[dict, Any], float]
Parameter Description
intent An Intent subclass or a plain English description string
field The prediction field to validate. If None, uses the last field.
judge Judge backend override. Defaults to QuantizedNLIJudge.
threshold Score threshold. Outputs at or above get reward 1.0, below get raw score.

Using a plain string

reward = semantic_reward("must be polite and professional")
refined = dspy.Refine(module=qa, N=3, reward_fn=reward)

Specifying a field

reward = semantic_reward(Polite, field="answer")

semantic_metric

Use semantic_metric with dspy.Evaluate and DSPy optimizers like dspy.MIPROv2:

from semantix.integrations.dspy import semantic_metric

metric = semantic_metric(Polite)

# With dspy.Evaluate
evaluator = dspy.Evaluate(devset=dev_data, metric=metric)
score = evaluator(my_module)

# With optimizers
optimized = dspy.MIPROv2(metric=metric, ...)(my_module, trainset=train_data)

Parameters

semantic_metric(
    intent: type[Intent] | str,
    *,
    field: str | None = None,
    judge: Judge | None = None,
    threshold: float | None = None,
) -> Callable[..., float]

Same parameters as semantic_reward. The returned function has the signature metric(example, pred) -> float expected by DSPy evaluators and optimizers.

How it works

The reward/metric function extracts text from the prediction's specified field (or the last field by default), evaluates it against the intent using a local NLI judge, and returns the entailment score as a float between 0.0 and 1.0.

Benchmarks

semantic_reward has been benchmarked against a 70B LLM-judge baseline (Groq Llama 3.3 70B) across two DSPy optimization tasks. Summary:

Dimension semantix (local NLI) Groq Llama 3.3 70B
Avg latency per evaluation ~15 ms ~500-900 ms
Cost per 1,000 evaluations $0 ~$0.13
API key required No Yes
Reward-agreement with Gemini 2.5 Flash proxy
BestOfN(N=5) paired win-rate

Full methodology, raw CSVs, run metadata, and notebooks (which render on GitHub): benchmarks/dspy/.

Two experiments per task

  • Reward-agreement — every judge scores every output; measure Pearson r vs Gemini 2.5 Flash (operational proxy-ground-truth).
  • Optimization-impactdspy.BestOfN(N=5) with semantix as reward_fn, vs the same loop with Groq as reward_fn. Final output scored by Flash. Report paired win/loss/tie.

When to prefer semantix over an LLM-judge reward

  • You're iterating on a DSPy program and want reward evaluation to not gate your feedback loop.
  • You need deterministic, seedable rewards (NLI scores are stable; LLM judges are not).
  • You want to run optimization on CI without API costs or rate limits.

When to prefer an LLM-judge

  • The intent requires world knowledge or multi-step reasoning beyond entailment.
  • You need freeform rationales, not a single score.
  • Evaluation volume is low (<100 calls) and the latency/cost trade-off doesn't matter.