docs(readme): comprehensive documentation
Expands readme with detailed coverage of metrics, validators, pytest plugin, benchmark module, CLI commands, and development setup.
This commit is contained in:
386
readme.md
386
readme.md
@@ -2,48 +2,398 @@
|
||||
|
||||
Semantic text validation framework for Python.
|
||||
|
||||
Validates text outputs against quality criteria using metrics like BLEU, ROUGE,
|
||||
and semantic similarity. Designed for developers building systems that produce
|
||||
text (chatbots, content generators, summarisation tools) who need automated
|
||||
quality assurance beyond simple string matching.
|
||||
Veritext validates text outputs against quality criteria using metrics like BLEU,
|
||||
ROUGE, and semantic similarity. Designed for developers building systems that produce
|
||||
text (chatbots, content generators, summarisation tools) who need automated quality
|
||||
assurance beyond simple string matching.
|
||||
|
||||
## Status
|
||||
## Features
|
||||
|
||||
Under active development. See [changelog.md](changelog.md) for progress.
|
||||
- **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic
|
||||
embeddings
|
||||
- **Composable validators** — Build complex checks from simple primitives
|
||||
- **Native pytest integration** — `validate_text()` assertion for test suites
|
||||
- **Quality benchmarking** — Track metrics over time with regression detection
|
||||
- **CLI tools** — Command-line validation and benchmark management
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install veritext
|
||||
|
||||
# With semantic similarity support
|
||||
# With semantic similarity support (sentence-transformers)
|
||||
pip install veritext[semantic]
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from veritext import validators as v
|
||||
from veritext.core.types import ValidationContext
|
||||
from veritext.validators import all_of, bleu, length, rouge
|
||||
|
||||
# Create validators
|
||||
validator = v.all_of([
|
||||
v.bleu(min_score=0.7),
|
||||
v.length(max_chars=500),
|
||||
# Create a validator
|
||||
validator = all_of([
|
||||
bleu(min_score=0.5),
|
||||
rouge(min_score=0.6),
|
||||
length(max_chars=500),
|
||||
])
|
||||
|
||||
# Validate text
|
||||
context = ValidationContext(reference="The cat sat on the mat.")
|
||||
result = validator.check("A cat is sitting on the mat.", context)
|
||||
context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
|
||||
result = validator.check("A fast brown fox leaps over a sleepy dog.", context)
|
||||
|
||||
if not result.passed:
|
||||
if result.passed:
|
||||
print("Validation passed!")
|
||||
else:
|
||||
print(result.failure_summary)
|
||||
```
|
||||
|
||||
## Documentation
|
||||
## Metrics
|
||||
|
||||
- [Project Plan](docs/project-plan.md)
|
||||
- [Implementation Plan](docs/implementation-plan.md)
|
||||
Veritext provides several metrics for text evaluation.
|
||||
|
||||
### BLEU
|
||||
|
||||
Measures n-gram precision against reference text. Useful for translation and
|
||||
generation quality.
|
||||
|
||||
```python
|
||||
from veritext.metrics import Bleu
|
||||
|
||||
bleu = Bleu()
|
||||
result = bleu.score(
|
||||
candidate="The cat sat on the mat.",
|
||||
reference="A cat is sitting on the mat.",
|
||||
)
|
||||
print(f"BLEU-4: {result.bleu4:.3f}") # Uses 1-4 gram precision
|
||||
print(f"BLEU-1: {result.bleu1:.3f}") # Unigram precision only
|
||||
```
|
||||
|
||||
### ROUGE
|
||||
|
||||
Measures recall-oriented overlap with reference text. Useful for summarisation.
|
||||
|
||||
```python
|
||||
from veritext.metrics import Rouge
|
||||
|
||||
rouge = Rouge()
|
||||
result = rouge.score(
|
||||
candidate="Scientists found a new planet.",
|
||||
reference="Researchers discovered a new planet in the solar system.",
|
||||
)
|
||||
print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}") # Unigram overlap
|
||||
print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}") # Longest common subsequence
|
||||
```
|
||||
|
||||
### Lexical Similarity
|
||||
|
||||
Measures token overlap using Jaccard similarity.
|
||||
|
||||
```python
|
||||
from veritext.metrics import Lexical
|
||||
|
||||
lexical = Lexical()
|
||||
result = lexical.score(
|
||||
candidate="The quick brown fox",
|
||||
reference="The fast brown fox",
|
||||
)
|
||||
print(f"Jaccard: {result.jaccard:.3f}")
|
||||
print(f"Token overlap: {result.token_overlap:.3f}")
|
||||
```
|
||||
|
||||
### Readability
|
||||
|
||||
Computes Flesch-Kincaid scores for text complexity.
|
||||
|
||||
```python
|
||||
from veritext.metrics import Readability
|
||||
|
||||
readability = Readability()
|
||||
result = readability.score("This is a simple sentence.")
|
||||
print(f"Grade level: {result.grade_level:.1f}")
|
||||
print(f"Reading ease: {result.reading_ease:.1f}")
|
||||
```
|
||||
|
||||
### Semantic Similarity (Optional)
|
||||
|
||||
Requires `pip install veritext[semantic]`.
|
||||
|
||||
```python
|
||||
from veritext.semantic import SemanticSimilarity
|
||||
|
||||
semantic = SemanticSimilarity()
|
||||
result = semantic.score(
|
||||
candidate="The dog is running in the park.",
|
||||
reference="A canine is jogging through the garden.",
|
||||
)
|
||||
print(f"Similarity: {result.score:.3f}")
|
||||
```
|
||||
|
||||
## Validators
|
||||
|
||||
Validators wrap metrics with thresholds to make pass/fail decisions.
|
||||
|
||||
### Metric-Based Validators
|
||||
|
||||
```python
|
||||
from veritext.core.types import ValidationContext
|
||||
from veritext.validators import bleu, lexical, rouge
|
||||
|
||||
context = ValidationContext(reference="Reference text here.")
|
||||
|
||||
# BLEU validation
|
||||
validator = bleu(min_score=0.5, variant=4) # BLEU-4
|
||||
result = validator.check("Candidate text here.", context)
|
||||
|
||||
# ROUGE validation
|
||||
validator = rouge(min_score=0.6, variant="l") # ROUGE-L
|
||||
result = validator.check("Candidate text here.", context)
|
||||
|
||||
# Lexical validation
|
||||
validator = lexical(min_jaccard=0.3, min_overlap=0.5)
|
||||
result = validator.check("Candidate text here.", context)
|
||||
```
|
||||
|
||||
### Constraint Validators
|
||||
|
||||
These don't require reference text.
|
||||
|
||||
```python
|
||||
from veritext.core.types import ValidationContext
|
||||
from veritext.validators import contains, excludes, length, readability
|
||||
|
||||
context = ValidationContext() # No reference needed
|
||||
|
||||
# Length constraints
|
||||
validator = length(min_chars=50, max_chars=500, min_words=10)
|
||||
result = validator.check("Your text here...", context)
|
||||
|
||||
# Readability constraints
|
||||
validator = readability(max_grade=8.0, min_ease=60.0)
|
||||
result = validator.check("Your text here...", context)
|
||||
|
||||
# Content requirements
|
||||
validator = contains(patterns=["important", "keyword"])
|
||||
result = validator.check("This important text has a keyword.", context)
|
||||
|
||||
# Content exclusions
|
||||
validator = excludes(patterns=["forbidden", "banned"])
|
||||
result = validator.check("This text is clean.", context)
|
||||
```
|
||||
|
||||
### Composite Validators
|
||||
|
||||
Combine multiple checks with logical operators.
|
||||
|
||||
```python
|
||||
from veritext.validators import all_of, any_of, bleu, length, rouge
|
||||
|
||||
# All checks must pass
|
||||
validator = all_of([
|
||||
bleu(min_score=0.5),
|
||||
rouge(min_score=0.6),
|
||||
length(max_chars=500),
|
||||
])
|
||||
|
||||
# At least one check must pass
|
||||
validator = any_of([
|
||||
bleu(min_score=0.7),
|
||||
rouge(min_score=0.7),
|
||||
])
|
||||
```
|
||||
|
||||
## Pytest Plugin
|
||||
|
||||
Veritext provides native pytest integration for testing text quality.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from veritext.pytest_plugin import validate_text
|
||||
|
||||
|
||||
def test_response_quality():
|
||||
response = "This is a helpful response to your question."
|
||||
|
||||
validate_text(
|
||||
response,
|
||||
min_length=20,
|
||||
max_length=200,
|
||||
max_reading_grade=10.0,
|
||||
must_contain=["helpful"],
|
||||
must_exclude=["error", "sorry"],
|
||||
)
|
||||
|
||||
|
||||
def test_summary_similarity():
|
||||
summary = "Scientists discovered a new planet."
|
||||
reference = "Researchers found a new planet in our solar system."
|
||||
|
||||
validate_text(
|
||||
summary,
|
||||
reference=reference,
|
||||
min_rouge=0.5,
|
||||
min_length=10,
|
||||
)
|
||||
```
|
||||
|
||||
### Available Parameters
|
||||
|
||||
| Parameter | Description |
|
||||
|-----------|-------------|
|
||||
| `reference` | Reference text for comparison metrics |
|
||||
| `min_bleu` | Minimum BLEU-4 score (0.0-1.0) |
|
||||
| `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) |
|
||||
| `min_semantic` | Minimum semantic similarity (0.0-1.0) |
|
||||
| `min_length` | Minimum character count |
|
||||
| `max_length` | Maximum character count |
|
||||
| `max_reading_grade` | Maximum Flesch-Kincaid grade level |
|
||||
| `must_contain` | List of required patterns |
|
||||
| `must_exclude` | List of forbidden patterns |
|
||||
|
||||
## Benchmarking
|
||||
|
||||
Track text quality over time and detect regressions.
|
||||
|
||||
### Running Benchmarks
|
||||
|
||||
```python
|
||||
from veritext.benchmark import Benchmark
|
||||
|
||||
# Create a benchmark suite
|
||||
bench = Benchmark("summariser_quality", storage_path="benchmarks/")
|
||||
|
||||
# Evaluate a batch of outputs
|
||||
candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
|
||||
references = ["Reference 1...", "Reference 2...", "Reference 3..."]
|
||||
|
||||
run = bench.evaluate(
|
||||
candidates=candidates,
|
||||
references=references,
|
||||
metrics=["rouge_l", "bleu4"],
|
||||
metadata={"model": "v1.2", "git_sha": "abc123"},
|
||||
)
|
||||
|
||||
print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
|
||||
print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
|
||||
```
|
||||
|
||||
### Regression Detection
|
||||
|
||||
```python
|
||||
from veritext.benchmark import Benchmark
|
||||
from veritext.core.exceptions import RegressionDetectedError
|
||||
|
||||
bench = Benchmark("summariser_quality")
|
||||
|
||||
# Check for regression against historical baseline
|
||||
report = bench.check_regression(tolerance=0.05, window=10)
|
||||
if report.detected:
|
||||
print("Quality regression detected!")
|
||||
for metric, delta in report.deltas.items():
|
||||
print(f" {metric}: {delta:+.4f}")
|
||||
|
||||
# Or raise an exception for CI integration
|
||||
try:
|
||||
bench.assert_no_regression(tolerance=0.05)
|
||||
except RegressionDetectedError as e:
|
||||
print(f"CI failure: {e}")
|
||||
exit(1)
|
||||
```
|
||||
|
||||
### Viewing History
|
||||
|
||||
```python
|
||||
bench = Benchmark("summariser_quality")
|
||||
|
||||
for run in bench.get_history(limit=10):
|
||||
print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
|
||||
```
|
||||
|
||||
## CLI
|
||||
|
||||
Veritext provides a command-line interface for validation and benchmarking.
|
||||
|
||||
### Validate Text
|
||||
|
||||
```bash
|
||||
# Inline validation
|
||||
veritext validate "Candidate text" -r "Reference text" -m bleu,rouge
|
||||
|
||||
# File-based batch validation (JSONL with "candidate" and "reference" fields)
|
||||
veritext validate -f outputs.jsonl -m bleu,rouge,lexical
|
||||
|
||||
# With threshold for pass/fail
|
||||
veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple
|
||||
|
||||
# Output formats: table (default), json, simple
|
||||
veritext validate "Text" -r "Reference" -m bleu -o json
|
||||
```
|
||||
|
||||
### Benchmark Commands
|
||||
|
||||
```bash
|
||||
# Run a benchmark evaluation
|
||||
veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
|
||||
|
||||
# View benchmark history
|
||||
veritext benchmark show my_bench --last 10
|
||||
|
||||
# Check for regression (exits with code 1 if detected)
|
||||
veritext benchmark check my_bench --tolerance 0.05 --window 10
|
||||
```
|
||||
|
||||
### JSONL Format
|
||||
|
||||
For file-based operations, use JSONL with `candidate` and `reference` fields:
|
||||
|
||||
```json
|
||||
{"candidate": "Model output 1", "reference": "Expected output 1"}
|
||||
{"candidate": "Model output 2", "reference": "Expected output 2"}
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Veritext uses environment variables for configuration:
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `VERITEXT_LOG_LEVEL` | `INFO` | Logging level |
|
||||
| `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) |
|
||||
|
||||
## Development
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
git clone https://gitea.kschappell.com/kschappell/veritext.git
|
||||
cd veritext
|
||||
uv sync --all-extras
|
||||
```
|
||||
|
||||
### Quality Checks
|
||||
|
||||
```bash
|
||||
# Linting
|
||||
uv run ruff check .
|
||||
|
||||
# Formatting
|
||||
uv run ruff format --check .
|
||||
|
||||
# Type checking
|
||||
uv run mypy src/
|
||||
|
||||
# Tests
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
### Running Examples
|
||||
|
||||
```bash
|
||||
uv run python examples/basic_validation.py
|
||||
uv run pytest examples/chatbot_testing.py -v
|
||||
uv run python examples/benchmark_regression.py
|
||||
```
|
||||
|
||||
## Licence
|
||||
|
||||
|
||||
Reference in New Issue
Block a user