Expands readme with detailed coverage of metrics, validators, pytest plugin, benchmark module, CLI commands, and development setup.
401 lines
9.7 KiB
Markdown
401 lines
9.7 KiB
Markdown
# Veritext
|
|
|
|
Semantic text validation framework for Python.
|
|
|
|
Veritext validates text outputs against quality criteria using metrics like BLEU,
|
|
ROUGE, and semantic similarity. Designed for developers building systems that produce
|
|
text (chatbots, content generators, summarisation tools) who need automated quality
|
|
assurance beyond simple string matching.
|
|
|
|
## Features
|
|
|
|
- **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic
|
|
embeddings
|
|
- **Composable validators** — Build complex checks from simple primitives
|
|
- **Native pytest integration** — `validate_text()` assertion for test suites
|
|
- **Quality benchmarking** — Track metrics over time with regression detection
|
|
- **CLI tools** — Command-line validation and benchmark management
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
pip install veritext
|
|
|
|
# With semantic similarity support (sentence-transformers)
|
|
pip install veritext[semantic]
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```python
|
|
from veritext.core.types import ValidationContext
|
|
from veritext.validators import all_of, bleu, length, rouge
|
|
|
|
# Create a validator
|
|
validator = all_of([
|
|
bleu(min_score=0.5),
|
|
rouge(min_score=0.6),
|
|
length(max_chars=500),
|
|
])
|
|
|
|
# Validate text
|
|
context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
|
|
result = validator.check("A fast brown fox leaps over a sleepy dog.", context)
|
|
|
|
if result.passed:
|
|
print("Validation passed!")
|
|
else:
|
|
print(result.failure_summary)
|
|
```
|
|
|
|
## Metrics
|
|
|
|
Veritext provides several metrics for text evaluation.
|
|
|
|
### BLEU
|
|
|
|
Measures n-gram precision against reference text. Useful for translation and
|
|
generation quality.
|
|
|
|
```python
|
|
from veritext.metrics import Bleu
|
|
|
|
bleu = Bleu()
|
|
result = bleu.score(
|
|
candidate="The cat sat on the mat.",
|
|
reference="A cat is sitting on the mat.",
|
|
)
|
|
print(f"BLEU-4: {result.bleu4:.3f}") # Uses 1-4 gram precision
|
|
print(f"BLEU-1: {result.bleu1:.3f}") # Unigram precision only
|
|
```
|
|
|
|
### ROUGE
|
|
|
|
Measures recall-oriented overlap with reference text. Useful for summarisation.
|
|
|
|
```python
|
|
from veritext.metrics import Rouge
|
|
|
|
rouge = Rouge()
|
|
result = rouge.score(
|
|
candidate="Scientists found a new planet.",
|
|
reference="Researchers discovered a new planet in the solar system.",
|
|
)
|
|
print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}") # Unigram overlap
|
|
print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}") # Longest common subsequence
|
|
```
|
|
|
|
### Lexical Similarity
|
|
|
|
Measures token overlap using Jaccard similarity.
|
|
|
|
```python
|
|
from veritext.metrics import Lexical
|
|
|
|
lexical = Lexical()
|
|
result = lexical.score(
|
|
candidate="The quick brown fox",
|
|
reference="The fast brown fox",
|
|
)
|
|
print(f"Jaccard: {result.jaccard:.3f}")
|
|
print(f"Token overlap: {result.token_overlap:.3f}")
|
|
```
|
|
|
|
### Readability
|
|
|
|
Computes Flesch-Kincaid scores for text complexity.
|
|
|
|
```python
|
|
from veritext.metrics import Readability
|
|
|
|
readability = Readability()
|
|
result = readability.score("This is a simple sentence.")
|
|
print(f"Grade level: {result.grade_level:.1f}")
|
|
print(f"Reading ease: {result.reading_ease:.1f}")
|
|
```
|
|
|
|
### Semantic Similarity (Optional)
|
|
|
|
Requires `pip install veritext[semantic]`.
|
|
|
|
```python
|
|
from veritext.semantic import SemanticSimilarity
|
|
|
|
semantic = SemanticSimilarity()
|
|
result = semantic.score(
|
|
candidate="The dog is running in the park.",
|
|
reference="A canine is jogging through the garden.",
|
|
)
|
|
print(f"Similarity: {result.score:.3f}")
|
|
```
|
|
|
|
## Validators
|
|
|
|
Validators wrap metrics with thresholds to make pass/fail decisions.
|
|
|
|
### Metric-Based Validators
|
|
|
|
```python
|
|
from veritext.core.types import ValidationContext
|
|
from veritext.validators import bleu, lexical, rouge
|
|
|
|
context = ValidationContext(reference="Reference text here.")
|
|
|
|
# BLEU validation
|
|
validator = bleu(min_score=0.5, variant=4) # BLEU-4
|
|
result = validator.check("Candidate text here.", context)
|
|
|
|
# ROUGE validation
|
|
validator = rouge(min_score=0.6, variant="l") # ROUGE-L
|
|
result = validator.check("Candidate text here.", context)
|
|
|
|
# Lexical validation
|
|
validator = lexical(min_jaccard=0.3, min_overlap=0.5)
|
|
result = validator.check("Candidate text here.", context)
|
|
```
|
|
|
|
### Constraint Validators
|
|
|
|
These don't require reference text.
|
|
|
|
```python
|
|
from veritext.core.types import ValidationContext
|
|
from veritext.validators import contains, excludes, length, readability
|
|
|
|
context = ValidationContext() # No reference needed
|
|
|
|
# Length constraints
|
|
validator = length(min_chars=50, max_chars=500, min_words=10)
|
|
result = validator.check("Your text here...", context)
|
|
|
|
# Readability constraints
|
|
validator = readability(max_grade=8.0, min_ease=60.0)
|
|
result = validator.check("Your text here...", context)
|
|
|
|
# Content requirements
|
|
validator = contains(patterns=["important", "keyword"])
|
|
result = validator.check("This important text has a keyword.", context)
|
|
|
|
# Content exclusions
|
|
validator = excludes(patterns=["forbidden", "banned"])
|
|
result = validator.check("This text is clean.", context)
|
|
```
|
|
|
|
### Composite Validators
|
|
|
|
Combine multiple checks with logical operators.
|
|
|
|
```python
|
|
from veritext.validators import all_of, any_of, bleu, length, rouge
|
|
|
|
# All checks must pass
|
|
validator = all_of([
|
|
bleu(min_score=0.5),
|
|
rouge(min_score=0.6),
|
|
length(max_chars=500),
|
|
])
|
|
|
|
# At least one check must pass
|
|
validator = any_of([
|
|
bleu(min_score=0.7),
|
|
rouge(min_score=0.7),
|
|
])
|
|
```
|
|
|
|
## Pytest Plugin
|
|
|
|
Veritext provides native pytest integration for testing text quality.
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from veritext.pytest_plugin import validate_text
|
|
|
|
|
|
def test_response_quality():
|
|
response = "This is a helpful response to your question."
|
|
|
|
validate_text(
|
|
response,
|
|
min_length=20,
|
|
max_length=200,
|
|
max_reading_grade=10.0,
|
|
must_contain=["helpful"],
|
|
must_exclude=["error", "sorry"],
|
|
)
|
|
|
|
|
|
def test_summary_similarity():
|
|
summary = "Scientists discovered a new planet."
|
|
reference = "Researchers found a new planet in our solar system."
|
|
|
|
validate_text(
|
|
summary,
|
|
reference=reference,
|
|
min_rouge=0.5,
|
|
min_length=10,
|
|
)
|
|
```
|
|
|
|
### Available Parameters
|
|
|
|
| Parameter | Description |
|
|
|-----------|-------------|
|
|
| `reference` | Reference text for comparison metrics |
|
|
| `min_bleu` | Minimum BLEU-4 score (0.0-1.0) |
|
|
| `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) |
|
|
| `min_semantic` | Minimum semantic similarity (0.0-1.0) |
|
|
| `min_length` | Minimum character count |
|
|
| `max_length` | Maximum character count |
|
|
| `max_reading_grade` | Maximum Flesch-Kincaid grade level |
|
|
| `must_contain` | List of required patterns |
|
|
| `must_exclude` | List of forbidden patterns |
|
|
|
|
## Benchmarking
|
|
|
|
Track text quality over time and detect regressions.
|
|
|
|
### Running Benchmarks
|
|
|
|
```python
|
|
from veritext.benchmark import Benchmark
|
|
|
|
# Create a benchmark suite
|
|
bench = Benchmark("summariser_quality", storage_path="benchmarks/")
|
|
|
|
# Evaluate a batch of outputs
|
|
candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
|
|
references = ["Reference 1...", "Reference 2...", "Reference 3..."]
|
|
|
|
run = bench.evaluate(
|
|
candidates=candidates,
|
|
references=references,
|
|
metrics=["rouge_l", "bleu4"],
|
|
metadata={"model": "v1.2", "git_sha": "abc123"},
|
|
)
|
|
|
|
print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
|
|
print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
|
|
```
|
|
|
|
### Regression Detection
|
|
|
|
```python
|
|
from veritext.benchmark import Benchmark
|
|
from veritext.core.exceptions import RegressionDetectedError
|
|
|
|
bench = Benchmark("summariser_quality")
|
|
|
|
# Check for regression against historical baseline
|
|
report = bench.check_regression(tolerance=0.05, window=10)
|
|
if report.detected:
|
|
print("Quality regression detected!")
|
|
for metric, delta in report.deltas.items():
|
|
print(f" {metric}: {delta:+.4f}")
|
|
|
|
# Or raise an exception for CI integration
|
|
try:
|
|
bench.assert_no_regression(tolerance=0.05)
|
|
except RegressionDetectedError as e:
|
|
print(f"CI failure: {e}")
|
|
exit(1)
|
|
```
|
|
|
|
### Viewing History
|
|
|
|
```python
|
|
bench = Benchmark("summariser_quality")
|
|
|
|
for run in bench.get_history(limit=10):
|
|
print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
|
|
```
|
|
|
|
## CLI
|
|
|
|
Veritext provides a command-line interface for validation and benchmarking.
|
|
|
|
### Validate Text
|
|
|
|
```bash
|
|
# Inline validation
|
|
veritext validate "Candidate text" -r "Reference text" -m bleu,rouge
|
|
|
|
# File-based batch validation (JSONL with "candidate" and "reference" fields)
|
|
veritext validate -f outputs.jsonl -m bleu,rouge,lexical
|
|
|
|
# With threshold for pass/fail
|
|
veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple
|
|
|
|
# Output formats: table (default), json, simple
|
|
veritext validate "Text" -r "Reference" -m bleu -o json
|
|
```
|
|
|
|
### Benchmark Commands
|
|
|
|
```bash
|
|
# Run a benchmark evaluation
|
|
veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
|
|
|
|
# View benchmark history
|
|
veritext benchmark show my_bench --last 10
|
|
|
|
# Check for regression (exits with code 1 if detected)
|
|
veritext benchmark check my_bench --tolerance 0.05 --window 10
|
|
```
|
|
|
|
### JSONL Format
|
|
|
|
For file-based operations, use JSONL with `candidate` and `reference` fields:
|
|
|
|
```json
|
|
{"candidate": "Model output 1", "reference": "Expected output 1"}
|
|
{"candidate": "Model output 2", "reference": "Expected output 2"}
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Veritext uses environment variables for configuration:
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `VERITEXT_LOG_LEVEL` | `INFO` | Logging level |
|
|
| `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) |
|
|
|
|
## Development
|
|
|
|
### Setup
|
|
|
|
```bash
|
|
git clone https://gitea.kschappell.com/kschappell/veritext.git
|
|
cd veritext
|
|
uv sync --all-extras
|
|
```
|
|
|
|
### Quality Checks
|
|
|
|
```bash
|
|
# Linting
|
|
uv run ruff check .
|
|
|
|
# Formatting
|
|
uv run ruff format --check .
|
|
|
|
# Type checking
|
|
uv run mypy src/
|
|
|
|
# Tests
|
|
uv run pytest
|
|
```
|
|
|
|
### Running Examples
|
|
|
|
```bash
|
|
uv run python examples/basic_validation.py
|
|
uv run pytest examples/chatbot_testing.py -v
|
|
uv run python examples/benchmark_regression.py
|
|
```
|
|
|
|
## Licence
|
|
|
|
MIT
|