veritext/readme.md

# Veritext

Semantic text validation framework for Python.

Veritext validates text outputs against quality criteria using metrics like BLEU,
ROUGE, and semantic similarity. Designed for developers building systems that produce
text (chatbots, content generators, summarisation tools) who need automated quality
assurance beyond simple string matching.

## Features

- **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic
  embeddings
- **Composable validators** — Build complex checks from simple primitives
- **Native pytest integration** — `validate_text()` assertion for test suites
- **Quality benchmarking** — Track metrics over time with regression detection
- **CLI tools** — Command-line validation and benchmark management

## Installation

```bash
pip install veritext

# With semantic similarity support (sentence-transformers)
pip install veritext[semantic]
```

## Quick Start

```python
from veritext.core.types import ValidationContext
from veritext.validators import all_of, bleu, length, rouge

# Create a validator
validator = all_of([
    bleu(min_score=0.5),
    rouge(min_score=0.6),
    length(max_chars=500),
])

# Validate text
context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
result = validator.check("A fast brown fox leaps over a sleepy dog.", context)

if result.passed:
    print("Validation passed!")
else:
    print(result.failure_summary)
```

## Metrics

Veritext provides several metrics for text evaluation.

### BLEU

Measures n-gram precision against reference text. Useful for translation and
generation quality.

```python
from veritext.metrics import Bleu

bleu = Bleu()
result = bleu.score(
    candidate="The cat sat on the mat.",
    reference="A cat is sitting on the mat.",
)
print(f"BLEU-4: {result.bleu4:.3f}")  # Uses 1-4 gram precision
print(f"BLEU-1: {result.bleu1:.3f}")  # Unigram precision only
```

### ROUGE

Measures recall-oriented overlap with reference text. Useful for summarisation.

```python
from veritext.metrics import Rouge

rouge = Rouge()
result = rouge.score(
    candidate="Scientists found a new planet.",
    reference="Researchers discovered a new planet in the solar system.",
)
print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}")  # Unigram overlap
print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}")  # Longest common subsequence
```

### Lexical Similarity

Measures token overlap using Jaccard similarity.

```python
from veritext.metrics import Lexical

lexical = Lexical()
result = lexical.score(
    candidate="The quick brown fox",
    reference="The fast brown fox",
)
print(f"Jaccard: {result.jaccard:.3f}")
print(f"Token overlap: {result.token_overlap:.3f}")
```

### Readability

Computes Flesch-Kincaid scores for text complexity.

```python
from veritext.metrics import Readability

readability = Readability()
result = readability.score("This is a simple sentence.")
print(f"Grade level: {result.grade_level:.1f}")
print(f"Reading ease: {result.reading_ease:.1f}")
```

### Semantic Similarity (Optional)

Requires `pip install veritext[semantic]`.

```python
from veritext.semantic import SemanticSimilarity

semantic = SemanticSimilarity()
result = semantic.score(
    candidate="The dog is running in the park.",
    reference="A canine is jogging through the garden.",
)
print(f"Similarity: {result.score:.3f}")
```

## Validators

Validators wrap metrics with thresholds to make pass/fail decisions.

### Metric-Based Validators

```python
from veritext.core.types import ValidationContext
from veritext.validators import bleu, lexical, rouge

context = ValidationContext(reference="Reference text here.")

# BLEU validation
validator = bleu(min_score=0.5, variant=4)  # BLEU-4
result = validator.check("Candidate text here.", context)

# ROUGE validation
validator = rouge(min_score=0.6, variant="l")  # ROUGE-L
result = validator.check("Candidate text here.", context)

# Lexical validation
validator = lexical(min_jaccard=0.3, min_overlap=0.5)
result = validator.check("Candidate text here.", context)
```

### Constraint Validators

These don't require reference text.

```python
from veritext.core.types import ValidationContext
from veritext.validators import contains, excludes, length, readability

context = ValidationContext()  # No reference needed

# Length constraints
validator = length(min_chars=50, max_chars=500, min_words=10)
result = validator.check("Your text here...", context)

# Readability constraints
validator = readability(max_grade=8.0, min_ease=60.0)
result = validator.check("Your text here...", context)

# Content requirements
validator = contains(patterns=["important", "keyword"])
result = validator.check("This important text has a keyword.", context)

# Content exclusions
validator = excludes(patterns=["forbidden", "banned"])
result = validator.check("This text is clean.", context)
```

### Composite Validators

Combine multiple checks with logical operators.

```python
from veritext.validators import all_of, any_of, bleu, length, rouge

# All checks must pass
validator = all_of([
    bleu(min_score=0.5),
    rouge(min_score=0.6),
    length(max_chars=500),
])

# At least one check must pass
validator = any_of([
    bleu(min_score=0.7),
    rouge(min_score=0.7),
])
```

## Pytest Plugin

Veritext provides native pytest integration for testing text quality.

### Basic Usage

```python
from veritext.pytest_plugin import validate_text


def test_response_quality():
    response = "This is a helpful response to your question."

    validate_text(
        response,
        min_length=20,
        max_length=200,
        max_reading_grade=10.0,
        must_contain=["helpful"],
        must_exclude=["error", "sorry"],
    )


def test_summary_similarity():
    summary = "Scientists discovered a new planet."
    reference = "Researchers found a new planet in our solar system."

    validate_text(
        summary,
        reference=reference,
        min_rouge=0.5,
        min_length=10,
    )
```

### Available Parameters

| Parameter | Description |
|-----------|-------------|
| `reference` | Reference text for comparison metrics |
| `min_bleu` | Minimum BLEU-4 score (0.0-1.0) |
| `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) |
| `min_semantic` | Minimum semantic similarity (0.0-1.0) |
| `min_length` | Minimum character count |
| `max_length` | Maximum character count |
| `max_reading_grade` | Maximum Flesch-Kincaid grade level |
| `must_contain` | List of required patterns |
| `must_exclude` | List of forbidden patterns |

## Benchmarking

Track text quality over time and detect regressions.

### Running Benchmarks

```python
from veritext.benchmark import Benchmark

# Create a benchmark suite
bench = Benchmark("summariser_quality", storage_path="benchmarks/")

# Evaluate a batch of outputs
candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
references = ["Reference 1...", "Reference 2...", "Reference 3..."]

run = bench.evaluate(
    candidates=candidates,
    references=references,
    metrics=["rouge_l", "bleu4"],
    metadata={"model": "v1.2", "git_sha": "abc123"},
)

print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
```

### Regression Detection

```python
from veritext.benchmark import Benchmark
from veritext.core.exceptions import RegressionDetectedError

bench = Benchmark("summariser_quality")

# Check for regression against historical baseline
report = bench.check_regression(tolerance=0.05, window=10)
if report.detected:
    print("Quality regression detected!")
    for metric, delta in report.deltas.items():
        print(f"  {metric}: {delta:+.4f}")

# Or raise an exception for CI integration
try:
    bench.assert_no_regression(tolerance=0.05)
except RegressionDetectedError as e:
    print(f"CI failure: {e}")
    exit(1)
```

### Viewing History

```python
bench = Benchmark("summariser_quality")

for run in bench.get_history(limit=10):
    print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
```

## CLI

Veritext provides a command-line interface for validation and benchmarking.

### Validate Text

```bash
# Inline validation
veritext validate "Candidate text" -r "Reference text" -m bleu,rouge

# File-based batch validation (JSONL with "candidate" and "reference" fields)
veritext validate -f outputs.jsonl -m bleu,rouge,lexical

# With threshold for pass/fail
veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple

# Output formats: table (default), json, simple
veritext validate "Text" -r "Reference" -m bleu -o json
```

### Benchmark Commands

```bash
# Run a benchmark evaluation
veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4

# View benchmark history
veritext benchmark show my_bench --last 10

# Check for regression (exits with code 1 if detected)
veritext benchmark check my_bench --tolerance 0.05 --window 10
```

### JSONL Format

For file-based operations, use JSONL with `candidate` and `reference` fields:

```json
{"candidate": "Model output 1", "reference": "Expected output 1"}
{"candidate": "Model output 2", "reference": "Expected output 2"}
```

## Configuration

Veritext uses environment variables for configuration:

| Variable | Default | Description |
|----------|---------|-------------|
| `VERITEXT_LOG_LEVEL` | `INFO` | Logging level |
| `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) |

## Development

### Setup

```bash
git clone https://gitea.kschappell.com/kschappell/veritext.git
cd veritext
uv sync --all-extras
```

### Quality Checks

```bash
# Linting
uv run ruff check .

# Formatting
uv run ruff format --check .

# Type checking
uv run mypy src/

# Tests
uv run pytest
```

### Running Examples

```bash
uv run python examples/basic_validation.py
uv run pytest examples/chatbot_testing.py -v
uv run python examples/benchmark_regression.py
```

## Licence

MIT