docs(readme): comprehensive documentation

Expands readme with detailed coverage of metrics, validators, pytest plugin, benchmark module, CLI commands, and development setup.
2026-02-03 19:16:14 +00:00
parent 93515707cc
commit 13c869f5d6
1 changed files with 368 additions and 18 deletions
@@ -2,48 +2,398 @@

 Semantic text validation framework for Python.

-Validates text outputs against quality criteria using metrics like BLEU, ROUGE,
-and semantic similarity. Designed for developers building systems that produce
-text (chatbots, content generators, summarisation tools) who need automated
-quality assurance beyond simple string matching.
+Veritext validates text outputs against quality criteria using metrics like BLEU,
+ROUGE, and semantic similarity. Designed for developers building systems that produce
+text (chatbots, content generators, summarisation tools) who need automated quality
+assurance beyond simple string matching.

-## Status
+## Features

-Under active development. See [changelog.md](changelog.md) for progress.
+- **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic
+  embeddings
+- **Composable validators** — Build complex checks from simple primitives
+- **Native pytest integration** — `validate_text()` assertion for test suites
+- **Quality benchmarking** — Track metrics over time with regression detection
+- **CLI tools** — Command-line validation and benchmark management

 ## Installation

 ```bash
 pip install veritext

-# With semantic similarity support
+# With semantic similarity support (sentence-transformers)
 pip install veritext[semantic]
 ```

 ## Quick Start

 ```python
-from veritext import validators as v
 from veritext.core.types import ValidationContext
+from veritext.validators import all_of, bleu, length, rouge

-# Create validators
-validator = v.all_of([
-    v.bleu(min_score=0.7),
-    v.length(max_chars=500),
+# Create a validator
+validator = all_of([
+    bleu(min_score=0.5),
+    rouge(min_score=0.6),
+    length(max_chars=500),
 ])

 # Validate text
-context = ValidationContext(reference="The cat sat on the mat.")
-result = validator.check("A cat is sitting on the mat.", context)
+context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
+result = validator.check("A fast brown fox leaps over a sleepy dog.", context)

-if not result.passed:
+if result.passed:
+    print("Validation passed!")
+else:
    print(result.failure_summary)
 ```

-## Documentation
+## Metrics

- [Project Plan](docs/project-plan.md)
- [Implementation Plan](docs/implementation-plan.md)
+Veritext provides several metrics for text evaluation.
+
+### BLEU
+
+Measures n-gram precision against reference text. Useful for translation and
+generation quality.
+
+```python
+from veritext.metrics import Bleu
+
+bleu = Bleu()
+result = bleu.score(
+    candidate="The cat sat on the mat.",
+    reference="A cat is sitting on the mat.",
+)
+print(f"BLEU-4: {result.bleu4:.3f}")  # Uses 1-4 gram precision
+print(f"BLEU-1: {result.bleu1:.3f}")  # Unigram precision only
+```
+
+### ROUGE
+
+Measures recall-oriented overlap with reference text. Useful for summarisation.
+
+```python
+from veritext.metrics import Rouge
+
+rouge = Rouge()
+result = rouge.score(
+    candidate="Scientists found a new planet.",
+    reference="Researchers discovered a new planet in the solar system.",
+)
+print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}")  # Unigram overlap
+print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}")  # Longest common subsequence
+```
+
+### Lexical Similarity
+
+Measures token overlap using Jaccard similarity.
+
+```python
+from veritext.metrics import Lexical
+
+lexical = Lexical()
+result = lexical.score(
+    candidate="The quick brown fox",
+    reference="The fast brown fox",
+)
+print(f"Jaccard: {result.jaccard:.3f}")
+print(f"Token overlap: {result.token_overlap:.3f}")
+```
+
+### Readability
+
+Computes Flesch-Kincaid scores for text complexity.
+
+```python
+from veritext.metrics import Readability
+
+readability = Readability()
+result = readability.score("This is a simple sentence.")
+print(f"Grade level: {result.grade_level:.1f}")
+print(f"Reading ease: {result.reading_ease:.1f}")
+```
+
+### Semantic Similarity (Optional)
+
+Requires `pip install veritext[semantic]`.
+
+```python
+from veritext.semantic import SemanticSimilarity
+
+semantic = SemanticSimilarity()
+result = semantic.score(
+    candidate="The dog is running in the park.",
+    reference="A canine is jogging through the garden.",
+)
+print(f"Similarity: {result.score:.3f}")
+```
+
+## Validators
+
+Validators wrap metrics with thresholds to make pass/fail decisions.
+
+### Metric-Based Validators
+
+```python
+from veritext.core.types import ValidationContext
+from veritext.validators import bleu, lexical, rouge
+
+context = ValidationContext(reference="Reference text here.")
+
+# BLEU validation
+validator = bleu(min_score=0.5, variant=4)  # BLEU-4
+result = validator.check("Candidate text here.", context)
+
+# ROUGE validation
+validator = rouge(min_score=0.6, variant="l")  # ROUGE-L
+result = validator.check("Candidate text here.", context)
+
+# Lexical validation
+validator = lexical(min_jaccard=0.3, min_overlap=0.5)
+result = validator.check("Candidate text here.", context)
+```
+
+### Constraint Validators
+
+These don't require reference text.
+
+```python
+from veritext.core.types import ValidationContext
+from veritext.validators import contains, excludes, length, readability
+
+context = ValidationContext()  # No reference needed
+
+# Length constraints
+validator = length(min_chars=50, max_chars=500, min_words=10)
+result = validator.check("Your text here...", context)
+
+# Readability constraints
+validator = readability(max_grade=8.0, min_ease=60.0)
+result = validator.check("Your text here...", context)
+
+# Content requirements
+validator = contains(patterns=["important", "keyword"])
+result = validator.check("This important text has a keyword.", context)
+
+# Content exclusions
+validator = excludes(patterns=["forbidden", "banned"])
+result = validator.check("This text is clean.", context)
+```
+
+### Composite Validators
+
+Combine multiple checks with logical operators.
+
+```python
+from veritext.validators import all_of, any_of, bleu, length, rouge
+
+# All checks must pass
+validator = all_of([
+    bleu(min_score=0.5),
+    rouge(min_score=0.6),
+    length(max_chars=500),
+])
+
+# At least one check must pass
+validator = any_of([
+    bleu(min_score=0.7),
+    rouge(min_score=0.7),
+])
+```
+
+## Pytest Plugin
+
+Veritext provides native pytest integration for testing text quality.
+
+### Basic Usage
+
+```python
+from veritext.pytest_plugin import validate_text
+
+
+def test_response_quality():
+    response = "This is a helpful response to your question."
+
+    validate_text(
+        response,
+        min_length=20,
+        max_length=200,
+        max_reading_grade=10.0,
+        must_contain=["helpful"],
+        must_exclude=["error", "sorry"],
+    )
+
+
+def test_summary_similarity():
+    summary = "Scientists discovered a new planet."
+    reference = "Researchers found a new planet in our solar system."
+
+    validate_text(
+        summary,
+        reference=reference,
+        min_rouge=0.5,
+        min_length=10,
+    )
+```
+
+### Available Parameters
+
+| Parameter | Description |
+|-----------|-------------|
+| `reference` | Reference text for comparison metrics |
+| `min_bleu` | Minimum BLEU-4 score (0.0-1.0) |
+| `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) |
+| `min_semantic` | Minimum semantic similarity (0.0-1.0) |
+| `min_length` | Minimum character count |
+| `max_length` | Maximum character count |
+| `max_reading_grade` | Maximum Flesch-Kincaid grade level |
+| `must_contain` | List of required patterns |
+| `must_exclude` | List of forbidden patterns |
+
+## Benchmarking
+
+Track text quality over time and detect regressions.
+
+### Running Benchmarks
+
+```python
+from veritext.benchmark import Benchmark
+
+# Create a benchmark suite
+bench = Benchmark("summariser_quality", storage_path="benchmarks/")
+
+# Evaluate a batch of outputs
+candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
+references = ["Reference 1...", "Reference 2...", "Reference 3..."]
+
+run = bench.evaluate(
+    candidates=candidates,
+    references=references,
+    metrics=["rouge_l", "bleu4"],
+    metadata={"model": "v1.2", "git_sha": "abc123"},
+)
+
+print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
+print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
+```
+
+### Regression Detection
+
+```python
+from veritext.benchmark import Benchmark
+from veritext.core.exceptions import RegressionDetectedError
+
+bench = Benchmark("summariser_quality")
+
+# Check for regression against historical baseline
+report = bench.check_regression(tolerance=0.05, window=10)
+if report.detected:
+    print("Quality regression detected!")
+    for metric, delta in report.deltas.items():
+        print(f"  {metric}: {delta:+.4f}")
+
+# Or raise an exception for CI integration
+try:
+    bench.assert_no_regression(tolerance=0.05)
+except RegressionDetectedError as e:
+    print(f"CI failure: {e}")
+    exit(1)
+```
+
+### Viewing History
+
+```python
+bench = Benchmark("summariser_quality")
+
+for run in bench.get_history(limit=10):
+    print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
+```
+
+## CLI
+
+Veritext provides a command-line interface for validation and benchmarking.
+
+### Validate Text
+
+```bash
+# Inline validation
+veritext validate "Candidate text" -r "Reference text" -m bleu,rouge
+
+# File-based batch validation (JSONL with "candidate" and "reference" fields)
+veritext validate -f outputs.jsonl -m bleu,rouge,lexical
+
+# With threshold for pass/fail
+veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple
+
+# Output formats: table (default), json, simple
+veritext validate "Text" -r "Reference" -m bleu -o json
+```
+
+### Benchmark Commands
+
+```bash
+# Run a benchmark evaluation
+veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
+
+# View benchmark history
+veritext benchmark show my_bench --last 10
+
+# Check for regression (exits with code 1 if detected)
+veritext benchmark check my_bench --tolerance 0.05 --window 10
+```
+
+### JSONL Format
+
+For file-based operations, use JSONL with `candidate` and `reference` fields:
+
+```json
+{"candidate": "Model output 1", "reference": "Expected output 1"}
+{"candidate": "Model output 2", "reference": "Expected output 2"}
+```
+
+## Configuration
+
+Veritext uses environment variables for configuration:
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `VERITEXT_LOG_LEVEL` | `INFO` | Logging level |
+| `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) |
+
+## Development
+
+### Setup
+
+```bash
+git clone https://gitea.kschappell.com/kschappell/veritext.git
+cd veritext
+uv sync --all-extras
+```
+
+### Quality Checks
+
+```bash
+# Linting
+uv run ruff check .
+
+# Formatting
+uv run ruff format --check .
+
+# Type checking
+uv run mypy src/
+
+# Tests
+uv run pytest
+```
+
+### Running Examples
+
+```bash
+uv run python examples/basic_validation.py
+uv run pytest examples/chatbot_testing.py -v
+uv run python examples/benchmark_regression.py
+```

 ## Licence