diff --git a/readme.md b/readme.md index 64a8b49..ea7970c 100644 --- a/readme.md +++ b/readme.md @@ -2,48 +2,398 @@ Semantic text validation framework for Python. -Validates text outputs against quality criteria using metrics like BLEU, ROUGE, -and semantic similarity. Designed for developers building systems that produce -text (chatbots, content generators, summarisation tools) who need automated -quality assurance beyond simple string matching. +Veritext validates text outputs against quality criteria using metrics like BLEU, +ROUGE, and semantic similarity. Designed for developers building systems that produce +text (chatbots, content generators, summarisation tools) who need automated quality +assurance beyond simple string matching. -## Status +## Features -Under active development. See [changelog.md](changelog.md) for progress. +- **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic + embeddings +- **Composable validators** — Build complex checks from simple primitives +- **Native pytest integration** — `validate_text()` assertion for test suites +- **Quality benchmarking** — Track metrics over time with regression detection +- **CLI tools** — Command-line validation and benchmark management ## Installation ```bash pip install veritext -# With semantic similarity support +# With semantic similarity support (sentence-transformers) pip install veritext[semantic] ``` ## Quick Start ```python -from veritext import validators as v from veritext.core.types import ValidationContext +from veritext.validators import all_of, bleu, length, rouge -# Create validators -validator = v.all_of([ - v.bleu(min_score=0.7), - v.length(max_chars=500), +# Create a validator +validator = all_of([ + bleu(min_score=0.5), + rouge(min_score=0.6), + length(max_chars=500), ]) # Validate text -context = ValidationContext(reference="The cat sat on the mat.") -result = validator.check("A cat is sitting on the mat.", context) +context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.") +result = validator.check("A fast brown fox leaps over a sleepy dog.", context) -if not result.passed: +if result.passed: + print("Validation passed!") +else: print(result.failure_summary) ``` -## Documentation +## Metrics -- [Project Plan](docs/project-plan.md) -- [Implementation Plan](docs/implementation-plan.md) +Veritext provides several metrics for text evaluation. + +### BLEU + +Measures n-gram precision against reference text. Useful for translation and +generation quality. + +```python +from veritext.metrics import Bleu + +bleu = Bleu() +result = bleu.score( + candidate="The cat sat on the mat.", + reference="A cat is sitting on the mat.", +) +print(f"BLEU-4: {result.bleu4:.3f}") # Uses 1-4 gram precision +print(f"BLEU-1: {result.bleu1:.3f}") # Unigram precision only +``` + +### ROUGE + +Measures recall-oriented overlap with reference text. Useful for summarisation. + +```python +from veritext.metrics import Rouge + +rouge = Rouge() +result = rouge.score( + candidate="Scientists found a new planet.", + reference="Researchers discovered a new planet in the solar system.", +) +print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}") # Unigram overlap +print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}") # Longest common subsequence +``` + +### Lexical Similarity + +Measures token overlap using Jaccard similarity. + +```python +from veritext.metrics import Lexical + +lexical = Lexical() +result = lexical.score( + candidate="The quick brown fox", + reference="The fast brown fox", +) +print(f"Jaccard: {result.jaccard:.3f}") +print(f"Token overlap: {result.token_overlap:.3f}") +``` + +### Readability + +Computes Flesch-Kincaid scores for text complexity. + +```python +from veritext.metrics import Readability + +readability = Readability() +result = readability.score("This is a simple sentence.") +print(f"Grade level: {result.grade_level:.1f}") +print(f"Reading ease: {result.reading_ease:.1f}") +``` + +### Semantic Similarity (Optional) + +Requires `pip install veritext[semantic]`. + +```python +from veritext.semantic import SemanticSimilarity + +semantic = SemanticSimilarity() +result = semantic.score( + candidate="The dog is running in the park.", + reference="A canine is jogging through the garden.", +) +print(f"Similarity: {result.score:.3f}") +``` + +## Validators + +Validators wrap metrics with thresholds to make pass/fail decisions. + +### Metric-Based Validators + +```python +from veritext.core.types import ValidationContext +from veritext.validators import bleu, lexical, rouge + +context = ValidationContext(reference="Reference text here.") + +# BLEU validation +validator = bleu(min_score=0.5, variant=4) # BLEU-4 +result = validator.check("Candidate text here.", context) + +# ROUGE validation +validator = rouge(min_score=0.6, variant="l") # ROUGE-L +result = validator.check("Candidate text here.", context) + +# Lexical validation +validator = lexical(min_jaccard=0.3, min_overlap=0.5) +result = validator.check("Candidate text here.", context) +``` + +### Constraint Validators + +These don't require reference text. + +```python +from veritext.core.types import ValidationContext +from veritext.validators import contains, excludes, length, readability + +context = ValidationContext() # No reference needed + +# Length constraints +validator = length(min_chars=50, max_chars=500, min_words=10) +result = validator.check("Your text here...", context) + +# Readability constraints +validator = readability(max_grade=8.0, min_ease=60.0) +result = validator.check("Your text here...", context) + +# Content requirements +validator = contains(patterns=["important", "keyword"]) +result = validator.check("This important text has a keyword.", context) + +# Content exclusions +validator = excludes(patterns=["forbidden", "banned"]) +result = validator.check("This text is clean.", context) +``` + +### Composite Validators + +Combine multiple checks with logical operators. + +```python +from veritext.validators import all_of, any_of, bleu, length, rouge + +# All checks must pass +validator = all_of([ + bleu(min_score=0.5), + rouge(min_score=0.6), + length(max_chars=500), +]) + +# At least one check must pass +validator = any_of([ + bleu(min_score=0.7), + rouge(min_score=0.7), +]) +``` + +## Pytest Plugin + +Veritext provides native pytest integration for testing text quality. + +### Basic Usage + +```python +from veritext.pytest_plugin import validate_text + + +def test_response_quality(): + response = "This is a helpful response to your question." + + validate_text( + response, + min_length=20, + max_length=200, + max_reading_grade=10.0, + must_contain=["helpful"], + must_exclude=["error", "sorry"], + ) + + +def test_summary_similarity(): + summary = "Scientists discovered a new planet." + reference = "Researchers found a new planet in our solar system." + + validate_text( + summary, + reference=reference, + min_rouge=0.5, + min_length=10, + ) +``` + +### Available Parameters + +| Parameter | Description | +|-----------|-------------| +| `reference` | Reference text for comparison metrics | +| `min_bleu` | Minimum BLEU-4 score (0.0-1.0) | +| `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) | +| `min_semantic` | Minimum semantic similarity (0.0-1.0) | +| `min_length` | Minimum character count | +| `max_length` | Maximum character count | +| `max_reading_grade` | Maximum Flesch-Kincaid grade level | +| `must_contain` | List of required patterns | +| `must_exclude` | List of forbidden patterns | + +## Benchmarking + +Track text quality over time and detect regressions. + +### Running Benchmarks + +```python +from veritext.benchmark import Benchmark + +# Create a benchmark suite +bench = Benchmark("summariser_quality", storage_path="benchmarks/") + +# Evaluate a batch of outputs +candidates = ["Summary 1...", "Summary 2...", "Summary 3..."] +references = ["Reference 1...", "Reference 2...", "Reference 3..."] + +run = bench.evaluate( + candidates=candidates, + references=references, + metrics=["rouge_l", "bleu4"], + metadata={"model": "v1.2", "git_sha": "abc123"}, +) + +print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}") +print(f"BLEU-4: {run.metrics['bleu4']:.4f}") +``` + +### Regression Detection + +```python +from veritext.benchmark import Benchmark +from veritext.core.exceptions import RegressionDetectedError + +bench = Benchmark("summariser_quality") + +# Check for regression against historical baseline +report = bench.check_regression(tolerance=0.05, window=10) +if report.detected: + print("Quality regression detected!") + for metric, delta in report.deltas.items(): + print(f" {metric}: {delta:+.4f}") + +# Or raise an exception for CI integration +try: + bench.assert_no_regression(tolerance=0.05) +except RegressionDetectedError as e: + print(f"CI failure: {e}") + exit(1) +``` + +### Viewing History + +```python +bench = Benchmark("summariser_quality") + +for run in bench.get_history(limit=10): + print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}") +``` + +## CLI + +Veritext provides a command-line interface for validation and benchmarking. + +### Validate Text + +```bash +# Inline validation +veritext validate "Candidate text" -r "Reference text" -m bleu,rouge + +# File-based batch validation (JSONL with "candidate" and "reference" fields) +veritext validate -f outputs.jsonl -m bleu,rouge,lexical + +# With threshold for pass/fail +veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple + +# Output formats: table (default), json, simple +veritext validate "Text" -r "Reference" -m bleu -o json +``` + +### Benchmark Commands + +```bash +# Run a benchmark evaluation +veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4 + +# View benchmark history +veritext benchmark show my_bench --last 10 + +# Check for regression (exits with code 1 if detected) +veritext benchmark check my_bench --tolerance 0.05 --window 10 +``` + +### JSONL Format + +For file-based operations, use JSONL with `candidate` and `reference` fields: + +```json +{"candidate": "Model output 1", "reference": "Expected output 1"} +{"candidate": "Model output 2", "reference": "Expected output 2"} +``` + +## Configuration + +Veritext uses environment variables for configuration: + +| Variable | Default | Description | +|----------|---------|-------------| +| `VERITEXT_LOG_LEVEL` | `INFO` | Logging level | +| `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) | + +## Development + +### Setup + +```bash +git clone https://gitea.kschappell.com/kschappell/veritext.git +cd veritext +uv sync --all-extras +``` + +### Quality Checks + +```bash +# Linting +uv run ruff check . + +# Formatting +uv run ruff format --check . + +# Type checking +uv run mypy src/ + +# Tests +uv run pytest +``` + +### Running Examples + +```bash +uv run python examples/basic_validation.py +uv run pytest examples/chatbot_testing.py -v +uv run python examples/benchmark_regression.py +``` ## Licence