T

kschappell 2519641fa3

- Refactor CLI metric computation to eliminate code duplication
- Update version format to PEP 440 compliance (0.1.0.dev0)
- Cache Settings instance via @lru_cache for performance
- Document composite validators' protocol deviation
- Consolidate redundant empty checks in ROUGE-L computation
- Add Phase 10 (Portfolio Demos) to implementation plan

2025-05-25 13:06:51 +00:00

examples

example: benchmark regression

2025-05-17 11:02:05 +00:00

src/veritext

cli benchmark subcommands

2025-05-10 12:01:08 +00:00

tests

wip config + logging tests

2025-05-24 10:14:26 +00:00

.gitignore

gitignore, clean cached files

2025-03-12 19:13:31 +00:00

changelog.md

project setup: pyproject.toml, deps, tooling

2025-03-08 14:03:32 +00:00

pyproject.toml

clean up CLI, misc polish

2025-05-25 13:06:51 +00:00

readme.md

misc fixes before release

2025-05-22 20:10:54 +00:00

uv.lock

project setup: pyproject.toml, deps, tooling

2025-03-08 14:03:32 +00:00

readme.md

Veritext

Semantic text validation framework for Python.

Veritext validates text outputs against quality criteria using metrics like BLEU, ROUGE, and semantic similarity. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching.

Features

Multiple metrics — BLEU, ROUGE, lexical similarity, readability, semantic embeddings
Composable validators — Build complex checks from simple primitives
Native pytest integration — validate_text() assertion for test suites
Quality benchmarking — Track metrics over time with regression detection
CLI tools — Command-line validation and benchmark management

Installation

pip install veritext

# With semantic similarity support (sentence-transformers)
pip install veritext[semantic]

Quick Start

from veritext.core.types import ValidationContext
from veritext.validators import all_of, bleu, length, rouge

# Create a validator
validator = all_of([
    bleu(min_score=0.5),
    rouge(min_score=0.6),
    length(max_chars=500),
])

# Validate text
context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
result = validator.check("A fast brown fox leaps over a sleepy dog.", context)

if result.passed:
    print("Validation passed!")
else:
    print(result.failure_summary)

Metrics

Veritext provides several metrics for text evaluation.

BLEU

Measures n-gram precision against reference text. Useful for translation and generation quality.

from veritext.metrics import Bleu

bleu = Bleu()
result = bleu.score(
    candidate="The cat sat on the mat.",
    reference="A cat is sitting on the mat.",
)
print(f"BLEU-4: {result.bleu4:.3f}")  # Uses 1-4 gram precision
print(f"BLEU-1: {result.bleu1:.3f}")  # Unigram precision only

ROUGE

Measures recall-oriented overlap with reference text. Useful for summarisation.

from veritext.metrics import Rouge

rouge = Rouge()
result = rouge.score(
    candidate="Scientists found a new planet.",
    reference="Researchers discovered a new planet in the solar system.",
)
print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}")  # Unigram overlap
print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}")  # Longest common subsequence

Lexical Similarity

Measures token overlap using Jaccard similarity.

from veritext.metrics import Lexical

lexical = Lexical()
result = lexical.score(
    candidate="The quick brown fox",
    reference="The fast brown fox",
)
print(f"Jaccard: {result.jaccard:.3f}")
print(f"Token overlap: {result.token_overlap:.3f}")

Readability

Computes Flesch-Kincaid scores for text complexity.

from veritext.metrics import Readability

readability = Readability()
result = readability.score("This is a simple sentence.")
print(f"Grade level: {result.flesch_kincaid_grade:.1f}")
print(f"Reading ease: {result.flesch_reading_ease:.1f}")

Semantic Similarity (Optional)

Requires pip install veritext[semantic].

from veritext.semantic import SemanticSimilarity

semantic = SemanticSimilarity()
result = semantic.score(
    candidate="The dog is running in the park.",
    reference="A canine is jogging through the garden.",
)
print(f"Similarity: {result.score:.3f}")

Validators

Validators wrap metrics with thresholds to make pass/fail decisions.

Metric-Based Validators

from veritext.core.types import ValidationContext
from veritext.validators import bleu, lexical, rouge

context = ValidationContext(reference="Reference text here.")

# BLEU validation
validator = bleu(min_score=0.5, variant=4)  # BLEU-4
result = validator.check("Candidate text here.", context)

# ROUGE validation
validator = rouge(min_score=0.6, variant="l")  # ROUGE-L
result = validator.check("Candidate text here.", context)

# Lexical validation
validator = lexical(min_jaccard=0.3, min_overlap=0.5)
result = validator.check("Candidate text here.", context)

Constraint Validators

These don't require reference text.

from veritext.core.types import ValidationContext
from veritext.validators import contains, excludes, length, readability

context = ValidationContext()  # No reference needed

# Length constraints
validator = length(min_chars=50, max_chars=500, min_words=10)
result = validator.check("Your text here...", context)

# Readability constraints
validator = readability(max_grade=8.0, min_ease=60.0)
result = validator.check("Your text here...", context)

# Content requirements
validator = contains(patterns=["important", "keyword"])
result = validator.check("This important text has a keyword.", context)

# Content exclusions
validator = excludes(patterns=["forbidden", "banned"])
result = validator.check("This text is clean.", context)

Composite Validators

Combine multiple checks with logical operators.

from veritext.validators import all_of, any_of, bleu, length, rouge

# All checks must pass
validator = all_of([
    bleu(min_score=0.5),
    rouge(min_score=0.6),
    length(max_chars=500),
])

# At least one check must pass
validator = any_of([
    bleu(min_score=0.7),
    rouge(min_score=0.7),
])

Pytest Plugin

Veritext provides native pytest integration for testing text quality.

Basic Usage

from veritext.pytest_plugin import validate_text


def test_response_quality():
    response = "This is a helpful response to your question."

    validate_text(
        response,
        min_length=20,
        max_length=200,
        max_reading_grade=10.0,
        must_contain=["helpful"],
        must_exclude=["error", "sorry"],
    )


def test_summary_similarity():
    summary = "Scientists discovered a new planet."
    reference = "Researchers found a new planet in our solar system."

    validate_text(
        summary,
        reference=reference,
        min_rouge=0.5,
        min_length=10,
    )

Available Parameters

Parameter	Description
`reference`	Reference text for comparison metrics
`min_bleu`	Minimum BLEU-4 score (0.0-1.0)
`min_rouge`	Minimum ROUGE-L F1 score (0.0-1.0)
`min_semantic`	Minimum semantic similarity (0.0-1.0)
`min_length`	Minimum character count
`max_length`	Maximum character count
`max_reading_grade`	Maximum Flesch-Kincaid grade level
`must_contain`	List of required patterns
`must_exclude`	List of forbidden patterns

Benchmarking

Track text quality over time and detect regressions.

Running Benchmarks

from veritext.benchmark import Benchmark

# Create a benchmark suite
bench = Benchmark("summariser_quality", storage_path="benchmarks/")

# Evaluate a batch of outputs
candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
references = ["Reference 1...", "Reference 2...", "Reference 3..."]

run = bench.evaluate(
    candidates=candidates,
    references=references,
    metrics=["rouge_l", "bleu4"],
    metadata={"model": "v1.2", "git_sha": "abc123"},
)

print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
print(f"BLEU-4: {run.metrics['bleu4']:.4f}")

Regression Detection

from veritext.benchmark import Benchmark
from veritext.core.exceptions import RegressionDetectedError

bench = Benchmark("summariser_quality")

# Check for regression against historical baseline
report = bench.check_regression(tolerance=0.05, window=10)
if report.detected:
    print("Quality regression detected!")
    for metric, delta in report.deltas.items():
        print(f"  {metric}: {delta:+.4f}")

# Or raise an exception for CI integration
try:
    bench.assert_no_regression(tolerance=0.05)
except RegressionDetectedError as e:
    print(f"CI failure: {e}")
    exit(1)

Viewing History

bench = Benchmark("summariser_quality")

for run in bench.get_history(limit=10):
    print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")

CLI

Veritext provides a command-line interface for validation and benchmarking.

Validate Text

# Inline validation
veritext validate "Candidate text" -r "Reference text" -m bleu,rouge

# File-based batch validation (JSONL with "candidate" and "reference" fields)
veritext validate -f outputs.jsonl -m bleu,rouge,lexical

# With threshold for pass/fail
veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple

# Output formats: table (default), json, simple
veritext validate "Text" -r "Reference" -m bleu -o json

Benchmark Commands

# Run a benchmark evaluation
veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4

# View benchmark history
veritext benchmark show my_bench --last 10

# Check for regression (exits with code 1 if detected)
veritext benchmark check my_bench --tolerance 0.05 --window 10

JSONL Format

For file-based operations, use JSONL with candidate and reference fields:

{"candidate": "Model output 1", "reference": "Expected output 1"}
{"candidate": "Model output 2", "reference": "Expected output 2"}

Configuration

Veritext uses environment variables for configuration:

Variable	Default	Description
`VERITEXT_LOG_LEVEL`	`INFO`	Logging level
`VERITEXT_LOG_FORMAT`	`console`	Log format (`console` or `json`)

Development

Setup

git clone https://gitea.kschappell.com/kschappell/veritext.git
cd veritext
uv sync --all-extras

Quality Checks

# Linting
uv run ruff check .

# Formatting
uv run ruff format --check .

# Type checking
uv run mypy src/

# Tests
uv run pytest

Running Examples

uv run python examples/basic_validation.py
uv run pytest examples/chatbot_testing.py -v
uv run python examples/benchmark_regression.py

Licence

MIT