- Refactor CLI metric computation to eliminate code duplication - Update version format to PEP 440 compliance (0.1.0.dev0) - Cache Settings instance via @lru_cache for performance - Document composite validators' protocol deviation - Consolidate redundant empty checks in ROUGE-L computation - Add Phase 10 (Portfolio Demos) to implementation plan
Veritext
Semantic text validation framework for Python.
Veritext validates text outputs against quality criteria using metrics like BLEU, ROUGE, and semantic similarity. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching.
Features
- Multiple metrics — BLEU, ROUGE, lexical similarity, readability, semantic embeddings
- Composable validators — Build complex checks from simple primitives
- Native pytest integration —
validate_text()assertion for test suites - Quality benchmarking — Track metrics over time with regression detection
- CLI tools — Command-line validation and benchmark management
Installation
pip install veritext
# With semantic similarity support (sentence-transformers)
pip install veritext[semantic]
Quick Start
from veritext.core.types import ValidationContext
from veritext.validators import all_of, bleu, length, rouge
# Create a validator
validator = all_of([
bleu(min_score=0.5),
rouge(min_score=0.6),
length(max_chars=500),
])
# Validate text
context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
result = validator.check("A fast brown fox leaps over a sleepy dog.", context)
if result.passed:
print("Validation passed!")
else:
print(result.failure_summary)
Metrics
Veritext provides several metrics for text evaluation.
BLEU
Measures n-gram precision against reference text. Useful for translation and generation quality.
from veritext.metrics import Bleu
bleu = Bleu()
result = bleu.score(
candidate="The cat sat on the mat.",
reference="A cat is sitting on the mat.",
)
print(f"BLEU-4: {result.bleu4:.3f}") # Uses 1-4 gram precision
print(f"BLEU-1: {result.bleu1:.3f}") # Unigram precision only
ROUGE
Measures recall-oriented overlap with reference text. Useful for summarisation.
from veritext.metrics import Rouge
rouge = Rouge()
result = rouge.score(
candidate="Scientists found a new planet.",
reference="Researchers discovered a new planet in the solar system.",
)
print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}") # Unigram overlap
print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}") # Longest common subsequence
Lexical Similarity
Measures token overlap using Jaccard similarity.
from veritext.metrics import Lexical
lexical = Lexical()
result = lexical.score(
candidate="The quick brown fox",
reference="The fast brown fox",
)
print(f"Jaccard: {result.jaccard:.3f}")
print(f"Token overlap: {result.token_overlap:.3f}")
Readability
Computes Flesch-Kincaid scores for text complexity.
from veritext.metrics import Readability
readability = Readability()
result = readability.score("This is a simple sentence.")
print(f"Grade level: {result.flesch_kincaid_grade:.1f}")
print(f"Reading ease: {result.flesch_reading_ease:.1f}")
Semantic Similarity (Optional)
Requires pip install veritext[semantic].
from veritext.semantic import SemanticSimilarity
semantic = SemanticSimilarity()
result = semantic.score(
candidate="The dog is running in the park.",
reference="A canine is jogging through the garden.",
)
print(f"Similarity: {result.score:.3f}")
Validators
Validators wrap metrics with thresholds to make pass/fail decisions.
Metric-Based Validators
from veritext.core.types import ValidationContext
from veritext.validators import bleu, lexical, rouge
context = ValidationContext(reference="Reference text here.")
# BLEU validation
validator = bleu(min_score=0.5, variant=4) # BLEU-4
result = validator.check("Candidate text here.", context)
# ROUGE validation
validator = rouge(min_score=0.6, variant="l") # ROUGE-L
result = validator.check("Candidate text here.", context)
# Lexical validation
validator = lexical(min_jaccard=0.3, min_overlap=0.5)
result = validator.check("Candidate text here.", context)
Constraint Validators
These don't require reference text.
from veritext.core.types import ValidationContext
from veritext.validators import contains, excludes, length, readability
context = ValidationContext() # No reference needed
# Length constraints
validator = length(min_chars=50, max_chars=500, min_words=10)
result = validator.check("Your text here...", context)
# Readability constraints
validator = readability(max_grade=8.0, min_ease=60.0)
result = validator.check("Your text here...", context)
# Content requirements
validator = contains(patterns=["important", "keyword"])
result = validator.check("This important text has a keyword.", context)
# Content exclusions
validator = excludes(patterns=["forbidden", "banned"])
result = validator.check("This text is clean.", context)
Composite Validators
Combine multiple checks with logical operators.
from veritext.validators import all_of, any_of, bleu, length, rouge
# All checks must pass
validator = all_of([
bleu(min_score=0.5),
rouge(min_score=0.6),
length(max_chars=500),
])
# At least one check must pass
validator = any_of([
bleu(min_score=0.7),
rouge(min_score=0.7),
])
Pytest Plugin
Veritext provides native pytest integration for testing text quality.
Basic Usage
from veritext.pytest_plugin import validate_text
def test_response_quality():
response = "This is a helpful response to your question."
validate_text(
response,
min_length=20,
max_length=200,
max_reading_grade=10.0,
must_contain=["helpful"],
must_exclude=["error", "sorry"],
)
def test_summary_similarity():
summary = "Scientists discovered a new planet."
reference = "Researchers found a new planet in our solar system."
validate_text(
summary,
reference=reference,
min_rouge=0.5,
min_length=10,
)
Available Parameters
| Parameter | Description |
|---|---|
reference |
Reference text for comparison metrics |
min_bleu |
Minimum BLEU-4 score (0.0-1.0) |
min_rouge |
Minimum ROUGE-L F1 score (0.0-1.0) |
min_semantic |
Minimum semantic similarity (0.0-1.0) |
min_length |
Minimum character count |
max_length |
Maximum character count |
max_reading_grade |
Maximum Flesch-Kincaid grade level |
must_contain |
List of required patterns |
must_exclude |
List of forbidden patterns |
Benchmarking
Track text quality over time and detect regressions.
Running Benchmarks
from veritext.benchmark import Benchmark
# Create a benchmark suite
bench = Benchmark("summariser_quality", storage_path="benchmarks/")
# Evaluate a batch of outputs
candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
references = ["Reference 1...", "Reference 2...", "Reference 3..."]
run = bench.evaluate(
candidates=candidates,
references=references,
metrics=["rouge_l", "bleu4"],
metadata={"model": "v1.2", "git_sha": "abc123"},
)
print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
Regression Detection
from veritext.benchmark import Benchmark
from veritext.core.exceptions import RegressionDetectedError
bench = Benchmark("summariser_quality")
# Check for regression against historical baseline
report = bench.check_regression(tolerance=0.05, window=10)
if report.detected:
print("Quality regression detected!")
for metric, delta in report.deltas.items():
print(f" {metric}: {delta:+.4f}")
# Or raise an exception for CI integration
try:
bench.assert_no_regression(tolerance=0.05)
except RegressionDetectedError as e:
print(f"CI failure: {e}")
exit(1)
Viewing History
bench = Benchmark("summariser_quality")
for run in bench.get_history(limit=10):
print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
CLI
Veritext provides a command-line interface for validation and benchmarking.
Validate Text
# Inline validation
veritext validate "Candidate text" -r "Reference text" -m bleu,rouge
# File-based batch validation (JSONL with "candidate" and "reference" fields)
veritext validate -f outputs.jsonl -m bleu,rouge,lexical
# With threshold for pass/fail
veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple
# Output formats: table (default), json, simple
veritext validate "Text" -r "Reference" -m bleu -o json
Benchmark Commands
# Run a benchmark evaluation
veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
# View benchmark history
veritext benchmark show my_bench --last 10
# Check for regression (exits with code 1 if detected)
veritext benchmark check my_bench --tolerance 0.05 --window 10
JSONL Format
For file-based operations, use JSONL with candidate and reference fields:
{"candidate": "Model output 1", "reference": "Expected output 1"}
{"candidate": "Model output 2", "reference": "Expected output 2"}
Configuration
Veritext uses environment variables for configuration:
| Variable | Default | Description |
|---|---|---|
VERITEXT_LOG_LEVEL |
INFO |
Logging level |
VERITEXT_LOG_FORMAT |
console |
Log format (console or json) |
Development
Setup
git clone https://gitea.kschappell.com/kschappell/veritext.git
cd veritext
uv sync --all-extras
Quality Checks
# Linting
uv run ruff check .
# Formatting
uv run ruff format --check .
# Type checking
uv run mypy src/
# Tests
uv run pytest
Running Examples
uv run python examples/basic_validation.py
uv run pytest examples/chatbot_testing.py -v
uv run python examples/benchmark_regression.py
Licence
MIT