refactor: CLI cleanup and documentation updates

- Refactor CLI metric computation to eliminate code duplication - Update version format to PEP 440 compliance (0.1.0.dev0) - Cache Settings instance via @lru_cache for performance - Document composite validators' protocol deviation - Consolidate redundant empty checks in ROUGE-L computation - Add Phase 10 (Portfolio Demos) to implementation plan
fix(pytest-plugin): remove duplicate plugin registration in tests
2026-02-04 15:38:46 +00:00 · 2026-02-04 00:43:20 +00:00 · 2026-02-04 00:23:06 +00:00 · 2026-02-04 00:22:57 +00:00 · 2026-02-04 00:22:47 +00:00 · 2026-02-03 21:31:48 +00:00
49 changed files with 4955 additions and 39 deletions
@@ -83,6 +83,11 @@ Each layer depends only on layers below it.

 ## Git Workflow

+### Before Starting Work
+
+When starting work from a plan, create a new branch matching the plan's scope before
+making any changes. Do not reuse an existing branch from previous work, even if related.
+
 ### Commits

 - Format: `type(scope): description`
@@ -7,27 +7,108 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ## [Unreleased]

+### Changed
+
+- Refactored CLI metric computation to eliminate code duplication
+- Version format updated from `0.1.0-dev` to `0.1.0.dev0` (PEP 440 compliance)
+- Settings instance is now cached via `@lru_cache` for better performance
+- Documented composite validators' intentional deviation from `Check` protocol return type
+
+### Fixed
+
+- Consolidated redundant empty checks in ROUGE-L computation
+- Fixed README example using incorrect property names (`grade_level` → `flesch_kincaid_grade`, `reading_ease` → `flesch_reading_ease`)
+
+### Documentation
+
+- Added Phase 10 (Portfolio Demos) to implementation plan: Streamlit demo and Jupyter notebooks
+- Updated project plan with portfolio demo section
+- Fixed potential crash in ROUGE metric when all references are empty after tokenisation
+- Fixed potential division by zero in readability metric when text has no sentence endings
+- Fixed unbounded cache growth in `SemanticSimilarity` by implementing LRU eviction with configurable max size
+- Fixed mutable list aliasing in `AllOf` and `AnyOf` composite validators
+- Fixed regex pattern validation in `ContainsValidator` and `ExcludesValidator` to fail at init time rather than during `check()`
+- Fixed pytest plugin tests failing with duplicate plugin registration error
+
 ### Added

+- Added `.score` property to `LexicalResult` for API consistency with other result types
+- Added `cache_max_size` parameter to `SemanticSimilarity` (default: 1000 embeddings)
+- Added test coverage for `core/config.py` and `core/logging.py` modules
+
+## [0.1.0] — 2026-02-03
+
+Initial release of Veritext, a semantic text validation framework for Python.
+
+### Added
+
+#### Core
+
 - Project scaffold with pyproject.toml and development tooling
 - Core exception hierarchy (`VeritextError` and subclasses)
 - Core types: `ValidationContext`, `CheckResult`, `ValidationResult`
 - Word tokeniser with Unicode normalisation support
 - Configuration module with pydantic-settings
 - Structured logging with structlog
+
+#### Metrics
+
 - Metrics module with `Metric` protocol, `AggregateStats`, and `BatchResult` types
 - BLEU metric implementation (BLEU-1 through BLEU-4 with brevity penalty)
- Lexical similarity metric (Jaccard similarity and token overlap)
 - ROUGE metric (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F-measure)
+- Lexical similarity metric (Jaccard similarity and token overlap)
 - Flesch-Kincaid readability metrics (grade level and reading ease)
 - Batch scoring with aggregate statistics for all metrics
+
+#### Validators
+
 - Validators module with `Check` protocol for validation checks
 - Metric-based validators: `BleuValidator`, `RougeValidator`, `LexicalValidator`
 - Constraint validators: `LengthValidator`, `ReadabilityValidator`, `ContainsValidator`, `ExcludesValidator`
 - Composite validators: `AllOf` (all checks must pass), `AnyOf` (any check must pass)
 - Factory functions for clean validator API (`bleu()`, `rouge()`, `lexical()`, `length()`, `readability()`, `contains()`, `excludes()`, `all_of()`, `any_of()`)
+
+#### Semantic Similarity
+
 - Semantic similarity module with embedding-based text comparison (requires `veritext[semantic]` extra)
 - `SemanticSimilarity` metric using sentence-transformers for semantic relatedness
 - `SemanticValidator` for threshold-based semantic similarity validation
 - `semantic()` factory function for creating semantic validators
 - Embedding caching for performance optimisation in repeated comparisons
+
+#### Pytest Plugin
+
+- Native pytest plugin for CI/CD integration (entry point: `pytest11`)
+- `validate_text()` assertion function for expressive test assertions
+- `text_validation` marker for filtering validation tests
+- Pytest fixtures: `text_validator` factory and `validation_context` helper
+- Detailed failure messages with text preview and check diagnostics
+
+#### Benchmarking
+
+- Benchmark module for quality tracking and regression detection
+- `Benchmark` class for evaluating text quality over time with metric storage
+- `BenchmarkRun` and `RegressionReport` data models for tracking runs
+- SQLite storage backend with WAL mode for concurrent access
+- Rolling window baseline computation for historical comparison
+- `check_regression()` for statistical comparison against baseline
+- `assert_no_regression()` raises `RegressionDetectedError` for CI integration
+- Customisable tolerance threshold and window size for regression detection
+- Metadata support for tracking git SHA, model versions, etc.
+
+#### CLI
+
+- Command-line interface (CLI) via `veritext` command
+- `veritext validate` command for inline and file-based text validation
+- JSONL input format support for batch validation (`--file` option)
+- Separate candidate/reference file support (`--reference-file` option)
+- Multiple output formats: table (default), JSON, and simple text
+- `veritext benchmark run` command for running evaluations and storing results
+- `veritext benchmark show` command for viewing benchmark history
+- `veritext benchmark check` command for regression detection with exit code 1 on failure
+- Rich-formatted terminal output with tables and coloured panels
+
+#### Documentation
+
+- Comprehensive readme with usage examples
+- Example scripts: basic validation, chatbot testing, benchmark regression
@@ -871,6 +871,59 @@ uv run pytest --cov=src/veritext --cov-report=term-missing

 ---

+### Phase 10: Portfolio Demos
+
+**Goal:** Interactive demos for showcasing Veritext without installation.
+
+**Step 1 — Streamlit Demo:**
+
+Build a quick interactive web UI for general visitors.
+
+- [ ] Create `demo/streamlit_app.py`
+- [ ] Text input boxes (candidate + reference)
+- [ ] Metric selector (BLEU, ROUGE, lexical, readability)
+- [ ] Threshold sliders for pass/fail validation
+- [ ] Results table with scores and status
+- [ ] Deploy to homeserver (e.g., `veritext.kschappell.com`)
+
+**Step 2 — Jupyter Notebook Collection:**
+
+Deep-dive notebooks targeting data science and ML recruiters.
+
+- [ ] Create `notebooks/` directory
+- [ ] `01-metrics-overview.ipynb` — Introduction to each metric with visualisations
+- [ ] `02-batch-evaluation.ipynb` — Evaluating model outputs at scale
+- [ ] `03-regression-detection.ipynb` — Tracking quality over time
+- [ ] `04-chatbot-validation.ipynb` — Real-world use case
+
+**Step 3 — JupyterLite Deployment:**
+
+Host notebooks as static files running in the browser.
+
+- [ ] Configure JupyterLite build with veritext pre-installed
+- [ ] Bundle notebooks into static site
+- [ ] Deploy alongside Streamlit demo
+
+**Files:**
+- `demo/streamlit_app.py`
+- `notebooks/01-metrics-overview.ipynb`
+- `notebooks/02-batch-evaluation.ipynb`
+- `notebooks/03-regression-detection.ipynb`
+- `notebooks/04-chatbot-validation.ipynb`
+- `notebooks/jupyterlite-config.json`
+
+**Verification:**
+```bash
+# Streamlit
+uv run streamlit run demo/streamlit_app.py
+
+# JupyterLite (local preview)
+jupyter lite build --contents notebooks/
+jupyter lite serve
+```
+
+---
+
 ## Dependencies

 ```toml
@@ -488,3 +488,47 @@ benchmark.assert_no_regression(tolerance=0.03)

 5. **Natural portfolio narrative** — "I was building X and needed a better way to test
   it, so I built this tool." Every interviewer has faced similar problems.
+
+---
+
+## Portfolio Demos (Future)
+
+Interactive demos to showcase Veritext without requiring installation.
+
+### Streamlit Demo
+
+A quick interactive web UI for general visitors and recruiters.
+
+**Features:**
+- Text input boxes (candidate + reference)
+- Metric selector (BLEU, ROUGE, lexical, readability)
+- Threshold sliders for pass/fail validation
+- Results table with scores and status
+
+**Deployment:** Self-hosted on homeserver (e.g., `veritext.kschappell.com`)
+
+**Effort:** ~30 minutes
+
+### Jupyter Notebook Collection
+
+Deep-dive notebooks targeting data science and ML recruiters.
+
+**Notebooks:**
+
+| Notebook | Purpose |
+|----------|---------|
+| `01-metrics-overview.ipynb` | Introduction to each metric with visualisations |
+| `02-batch-evaluation.ipynb` | Evaluating model outputs at scale, statistical analysis |
+| `03-regression-detection.ipynb` | Tracking quality over time, detecting degradation |
+| `04-chatbot-validation.ipynb` | Real-world use case: validating chatbot responses |
+
+**Hosting:** JupyterLite (static files, runs in browser via WebAssembly)
+
+**Deployment:** Self-hosted alongside Streamlit demo
+
+**Why both:**
+
+| Demo Type | Audience | Value |
+|-----------|----------|-------|
+| Streamlit | General visitors | Quick, interactive, no friction |
+| Notebooks | Data/ML recruiters | Shows analytical depth, speaks their language |
@@ -0,0 +1,135 @@
+"""Basic text validation examples.
+
+Demonstrates core Veritext functionality:
+- Single metric scoring (BLEU, ROUGE)
+- Validator usage with thresholds
+- Composite validators (all_of, any_of)
+- Constraint validators (length, readability)
+"""
+
+from veritext.core.types import ValidationContext
+from veritext.metrics import Bleu, Rouge
+from veritext.validators import (
+    all_of,
+    any_of,
+    bleu,
+    contains,
+    excludes,
+    length,
+    readability,
+    rouge,
+)
+
+
+def metric_scoring_example() -> None:
+    """Score text using individual metrics."""
+    candidate = "The quick brown fox jumps over the lazy dog."
+    reference = "A fast brown fox leaps over a sleepy dog."
+
+    # BLEU scoring (translation quality)
+    bleu_metric = Bleu()
+    bleu_result = bleu_metric.score(candidate, reference)
+    print("BLEU Scores:")
+    print(f"  BLEU-1: {bleu_result.bleu1:.3f}")
+    print(f"  BLEU-4: {bleu_result.bleu4:.3f}")
+    print(f"  Brevity penalty: {bleu_result.brevity_penalty:.3f}")
+
+    # ROUGE scoring (summary quality)
+    rouge_metric = Rouge()
+    rouge_result = rouge_metric.score(candidate, reference)
+    print("\nROUGE Scores:")
+    print(f"  ROUGE-1 F1: {rouge_result.rouge1.fmeasure:.3f}")
+    print(f"  ROUGE-L F1: {rouge_result.rouge_l.fmeasure:.3f}")
+
+
+def validator_example() -> None:
+    """Use validators to make pass/fail decisions."""
+    reference = "Machine learning models require training data."
+    candidate = "ML models need training data to learn patterns."
+
+    context = ValidationContext(reference=reference)
+
+    # BLEU validator with minimum threshold
+    bleu_validator = bleu(min_score=0.3)
+    result = bleu_validator.check(candidate, context)
+    print(f"\nBLEU validation (min 0.3): {'PASS' if result.passed else 'FAIL'}")
+
+    # ROUGE validator
+    rouge_validator = rouge(min_score=0.5)
+    result = rouge_validator.check(candidate, context)
+    print(f"ROUGE validation (min 0.5): {'PASS' if result.passed else 'FAIL'}")
+
+
+def composite_validator_example() -> None:
+    """Combine validators with all_of and any_of."""
+    reference = "The product launch exceeded all expectations."
+    candidate = "The product release performed beyond expectations."
+
+    context = ValidationContext(reference=reference)
+
+    # All checks must pass
+    strict_validator = all_of(
+        [
+            bleu(min_score=0.2),
+            rouge(min_score=0.4),
+            length(max_chars=100),
+        ]
+    )
+    result = strict_validator.check(candidate, context)
+    print(f"\nStrict (all_of): {'PASS' if result.passed else 'FAIL'}")
+    if not result.passed:
+        print(f"  Failures: {result.failure_summary}")
+
+    # At least one check must pass
+    flexible_validator = any_of(
+        [
+            bleu(min_score=0.8),  # Unlikely to pass
+            rouge(min_score=0.4),  # More likely
+        ]
+    )
+    result = flexible_validator.check(candidate, context)
+    print(f"Flexible (any_of): {'PASS' if result.passed else 'FAIL'}")
+
+
+def constraint_validator_example() -> None:
+    """Use constraint validators for text properties."""
+    text = "This short guide explains the basics clearly."
+    context = ValidationContext()  # No reference needed for constraints
+
+    # Length constraints
+    length_validator = length(min_chars=20, max_chars=100, min_words=5, max_words=20)
+    result = length_validator.check(text, context)
+    print(f"\nLength check: {'PASS' if result.passed else 'FAIL'}")
+
+    # Readability (Flesch-Kincaid)
+    readability_validator = readability(max_grade=10.0)
+    result = readability_validator.check(text, context)
+    print(f"Readability (grade <= 10): {'PASS' if result.passed else 'FAIL'}")
+
+    # Content patterns
+    contains_validator = contains(patterns=["guide", "basics"])
+    result = contains_validator.check(text, context)
+    print(f"Contains required terms: {'PASS' if result.passed else 'FAIL'}")
+
+    excludes_validator = excludes(patterns=["error", "warning"])
+    result = excludes_validator.check(text, context)
+    print(f"Excludes forbidden terms: {'PASS' if result.passed else 'FAIL'}")
+
+
+def main() -> None:
+    """Run all examples."""
+    print("=" * 60)
+    print("Veritext Basic Validation Examples")
+    print("=" * 60)
+
+    metric_scoring_example()
+    validator_example()
+    composite_validator_example()
+    constraint_validator_example()
+
+    print("\n" + "=" * 60)
+    print("All examples completed.")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,160 @@
+"""Benchmark quality tracking with regression detection.
+
+Demonstrates Veritext's benchmark module for CI integration:
+- Creating a benchmark suite
+- Running evaluations and storing results
+- Checking for quality regression
+- CI integration pattern with exit codes
+"""
+
+import tempfile
+from pathlib import Path
+
+from veritext.benchmark import Benchmark
+from veritext.core.exceptions import RegressionDetectedError
+
+
+def create_sample_data() -> tuple[list[str], list[str]]:
+    """Create sample candidate/reference pairs for benchmarking."""
+    # Simulated summarisation outputs and references
+    candidates = [
+        "The new policy aims to reduce carbon emissions by 50% by 2030.",
+        "Scientists discovered a new species of deep-sea fish.",
+        "The company reported record profits in the third quarter.",
+        "Researchers developed a breakthrough treatment for the disease.",
+        "The city plans to expand public transportation routes.",
+    ]
+    references = [
+        "The policy targets a 50% reduction in carbon emissions by 2030.",
+        "A new deep-sea fish species was discovered by marine biologists.",
+        "Record profits were announced by the company for Q3.",
+        "A breakthrough disease treatment was developed by researchers.",
+        "Public transport expansion is planned for the city.",
+    ]
+    return candidates, references
+
+
+def run_benchmark_example() -> None:
+    """Run a benchmark evaluation and view results."""
+    # Use a temp directory for this example
+    with tempfile.TemporaryDirectory() as tmpdir:
+        storage_path = Path(tmpdir) / "benchmarks"
+
+        # Create benchmark suite
+        bench = Benchmark("summariser_quality", storage_path=storage_path)
+
+        candidates, references = create_sample_data()
+
+        # Run evaluation
+        print("Running benchmark evaluation...")
+        run = bench.evaluate(
+            candidates=candidates,
+            references=references,
+            metrics=["rouge_l", "bleu4"],
+            metadata={"model": "v1.0", "dataset": "test"},
+        )
+
+        print("\nBenchmark run completed:")
+        print(f"  Run ID: {run.id[:8]}...")
+        print(f"  Samples: {run.sample_count}")
+        print("  Metrics:")
+        for name, value in run.metrics.items():
+            print(f"    {name}: {value:.4f}")
+
+
+def regression_detection_example() -> None:
+    """Demonstrate regression detection with historical comparison."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        storage_path = Path(tmpdir) / "benchmarks"
+        bench = Benchmark("summariser_quality", storage_path=storage_path)
+
+        candidates, references = create_sample_data()
+
+        # Simulate historical runs with stable quality
+        print("\nBuilding baseline with historical runs...")
+        for i in range(5):
+            bench.evaluate(
+                candidates=candidates,
+                references=references,
+                metrics=["rouge_l", "bleu4"],
+                metadata={"run": f"baseline_{i}"},
+            )
+            print(f"  Baseline run {i + 1} recorded")
+
+        # Check regression (no degradation expected)
+        report = bench.check_regression(tolerance=0.05, window=5)
+        print(f"\nRegression check: {'DETECTED' if report.detected else 'NONE'}")
+
+        # Simulate a degraded model
+        print("\nSimulating degraded model output...")
+        degraded_candidates = [
+            "Policy carbon emissions.",  # Much shorter/worse
+            "Fish discovered.",
+            "Company profits.",
+            "Treatment developed.",
+            "Transport expansion.",
+        ]
+        bench.evaluate(
+            candidates=degraded_candidates,
+            references=references,
+            metrics=["rouge_l", "bleu4"],
+            metadata={"model": "v1.1-broken"},
+        )
+
+        # Check regression (should detect)
+        report = bench.check_regression(tolerance=0.05, window=5)
+        print(f"Regression check: {'DETECTED' if report.detected else 'NONE'}")
+        if report.detected:
+            print("\nRegression details:")
+            for metric, delta in report.deltas.items():
+                baseline = report.baseline.get(metric, 0)
+                current = report.current.get(metric, 0)
+                print(f"  {metric}: {baseline:.4f} -> {current:.4f} ({delta:+.4f})")
+
+
+def ci_integration_example() -> None:
+    """CI integration pattern using assert_no_regression()."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        storage_path = Path(tmpdir) / "benchmarks"
+        bench = Benchmark("ci_check", storage_path=storage_path)
+
+        candidates, references = create_sample_data()
+
+        # Build baseline
+        for _ in range(3):
+            bench.evaluate(candidates, references, metrics=["rouge_l"])
+
+        # Simulate CI check
+        print("\n" + "=" * 50)
+        print("CI Integration Example")
+        print("=" * 50)
+
+        print("\nRunning evaluation...")
+        bench.evaluate(candidates, references, metrics=["rouge_l"])
+
+        print("Checking for regression...")
+        try:
+            bench.assert_no_regression(tolerance=0.05, window=3)
+            print("No regression detected.")
+            print("CI status: EXIT 0")
+        except RegressionDetectedError as e:
+            print(f"Regression detected: {e}")
+            print("CI status: EXIT 1")
+
+
+def main() -> None:
+    """Run all benchmark examples."""
+    print("=" * 60)
+    print("Veritext Benchmark & Regression Detection Examples")
+    print("=" * 60)
+
+    run_benchmark_example()
+    regression_detection_example()
+    ci_integration_example()
+
+    print("\n" + "=" * 60)
+    print("All examples completed.")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,140 @@
+"""Pytest integration for chatbot testing.
+
+Demonstrates Veritext's pytest plugin for testing chatbot responses:
+- validate_text() assertion function
+- Custom test fixtures
+- Test organisation with markers
+"""
+
+import pytest
+
+from veritext.pytest_plugin import validate_text
+
+# Sample chatbot responses for testing
+CHATBOT_RESPONSES = {
+    "greeting": {
+        "input": "Hello!",
+        "response": "Hi there! How can I help you today?",
+        "expected_keywords": ["help", "hi"],
+    },
+    "weather": {
+        "input": "What's the weather like?",
+        "response": "I don't have access to real-time weather data, but you can "
+        "check a weather service like weather.com for current conditions.",
+        "expected_keywords": ["weather", "check"],
+    },
+    "farewell": {
+        "input": "Goodbye!",
+        "response": "Goodbye! Have a great day!",
+        "expected_keywords": ["goodbye", "day"],
+    },
+}
+
+
+# Fixtures for common test setup
+@pytest.fixture
+def greeting_response() -> str:
+    """Provide a sample greeting response."""
+    return CHATBOT_RESPONSES["greeting"]["response"]
+
+
+@pytest.fixture
+def weather_response() -> str:
+    """Provide a sample weather response."""
+    return CHATBOT_RESPONSES["weather"]["response"]
+
+
+# Basic validation tests
+class TestResponseQuality:
+    """Test chatbot response quality using Veritext."""
+
+    def test_greeting_length(self, greeting_response: str) -> None:
+        """Greeting responses should be concise."""
+        validate_text(
+            greeting_response,
+            min_length=10,
+            max_length=100,
+        )
+
+    def test_greeting_readability(self, greeting_response: str) -> None:
+        """Greeting responses should be easy to read."""
+        validate_text(
+            greeting_response,
+            max_reading_grade=8.0,
+        )
+
+    def test_greeting_contains_keywords(self, greeting_response: str) -> None:
+        """Greeting should contain expected terms."""
+        validate_text(
+            greeting_response,
+            must_contain=["help"],
+        )
+
+    def test_weather_response_quality(self, weather_response: str) -> None:
+        """Weather response should be informative and readable."""
+        validate_text(
+            weather_response,
+            min_length=50,
+            max_length=500,
+            max_reading_grade=10.0,
+            must_contain=["weather"],
+        )
+
+
+# Tests with reference comparison
+class TestResponseSimilarity:
+    """Test response similarity against reference texts."""
+
+    def test_greeting_similarity(self) -> None:
+        """Greeting should match expected style."""
+        reference = "Hello! How may I assist you today?"
+        response = CHATBOT_RESPONSES["greeting"]["response"]
+
+        validate_text(
+            response,
+            reference=reference,
+            min_rouge=0.3,  # Allow variation in wording
+            min_length=10,
+        )
+
+    def test_farewell_similarity(self) -> None:
+        """Farewell should match expected style."""
+        reference = "Goodbye! Have a wonderful day!"
+        response = CHATBOT_RESPONSES["farewell"]["response"]
+
+        validate_text(
+            response,
+            reference=reference,
+            min_rouge=0.5,
+            must_contain=["goodbye"],
+        )
+
+
+# Content safety tests
+class TestContentSafety:
+    """Test responses for inappropriate content."""
+
+    @pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
+    def test_no_profanity(self, response_key: str) -> None:
+        """Responses should not contain profanity."""
+        response = CHATBOT_RESPONSES[response_key]["response"]
+        validate_text(
+            response,
+            must_exclude=["damn", "hell", "crap"],
+            min_length=1,
+        )
+
+    @pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
+    def test_no_harmful_content(self, response_key: str) -> None:
+        """Responses should not contain harmful instructions."""
+        response = CHATBOT_RESPONSES[response_key]["response"]
+        validate_text(
+            response,
+            must_exclude=["hack", "exploit", "attack"],
+            min_length=1,
+        )
+
+
+# Run tests when executed directly
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
@@ -1,6 +1,6 @@
 [project]
 name = "veritext"
-version = "0.1.0-dev"
+version = "0.1.0.dev0"
 description = "Semantic text validation framework"
 readme = "readme.md"
 requires-python = ">=3.11"
@@ -2,48 +2,398 @@

 Semantic text validation framework for Python.

-Validates text outputs against quality criteria using metrics like BLEU, ROUGE,
-and semantic similarity. Designed for developers building systems that produce
-text (chatbots, content generators, summarisation tools) who need automated
-quality assurance beyond simple string matching.
+Veritext validates text outputs against quality criteria using metrics like BLEU,
+ROUGE, and semantic similarity. Designed for developers building systems that produce
+text (chatbots, content generators, summarisation tools) who need automated quality
+assurance beyond simple string matching.

-## Status
+## Features

-Under active development. See [changelog.md](changelog.md) for progress.
+- **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic
+  embeddings
+- **Composable validators** — Build complex checks from simple primitives
+- **Native pytest integration** — `validate_text()` assertion for test suites
+- **Quality benchmarking** — Track metrics over time with regression detection
+- **CLI tools** — Command-line validation and benchmark management

 ## Installation

 ```bash
 pip install veritext

-# With semantic similarity support
+# With semantic similarity support (sentence-transformers)
 pip install veritext[semantic]
 ```

 ## Quick Start

 ```python
-from veritext import validators as v
 from veritext.core.types import ValidationContext
+from veritext.validators import all_of, bleu, length, rouge

-# Create validators
-validator = v.all_of([
-    v.bleu(min_score=0.7),
-    v.length(max_chars=500),
+# Create a validator
+validator = all_of([
+    bleu(min_score=0.5),
+    rouge(min_score=0.6),
+    length(max_chars=500),
 ])

 # Validate text
-context = ValidationContext(reference="The cat sat on the mat.")
-result = validator.check("A cat is sitting on the mat.", context)
+context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
+result = validator.check("A fast brown fox leaps over a sleepy dog.", context)

-if not result.passed:
+if result.passed:
+    print("Validation passed!")
+else:
    print(result.failure_summary)
 ```

-## Documentation
+## Metrics

- [Project Plan](docs/project-plan.md)
- [Implementation Plan](docs/implementation-plan.md)
+Veritext provides several metrics for text evaluation.
+
+### BLEU
+
+Measures n-gram precision against reference text. Useful for translation and
+generation quality.
+
+```python
+from veritext.metrics import Bleu
+
+bleu = Bleu()
+result = bleu.score(
+    candidate="The cat sat on the mat.",
+    reference="A cat is sitting on the mat.",
+)
+print(f"BLEU-4: {result.bleu4:.3f}")  # Uses 1-4 gram precision
+print(f"BLEU-1: {result.bleu1:.3f}")  # Unigram precision only
+```
+
+### ROUGE
+
+Measures recall-oriented overlap with reference text. Useful for summarisation.
+
+```python
+from veritext.metrics import Rouge
+
+rouge = Rouge()
+result = rouge.score(
+    candidate="Scientists found a new planet.",
+    reference="Researchers discovered a new planet in the solar system.",
+)
+print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}")  # Unigram overlap
+print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}")  # Longest common subsequence
+```
+
+### Lexical Similarity
+
+Measures token overlap using Jaccard similarity.
+
+```python
+from veritext.metrics import Lexical
+
+lexical = Lexical()
+result = lexical.score(
+    candidate="The quick brown fox",
+    reference="The fast brown fox",
+)
+print(f"Jaccard: {result.jaccard:.3f}")
+print(f"Token overlap: {result.token_overlap:.3f}")
+```
+
+### Readability
+
+Computes Flesch-Kincaid scores for text complexity.
+
+```python
+from veritext.metrics import Readability
+
+readability = Readability()
+result = readability.score("This is a simple sentence.")
+print(f"Grade level: {result.flesch_kincaid_grade:.1f}")
+print(f"Reading ease: {result.flesch_reading_ease:.1f}")
+```
+
+### Semantic Similarity (Optional)
+
+Requires `pip install veritext[semantic]`.
+
+```python
+from veritext.semantic import SemanticSimilarity
+
+semantic = SemanticSimilarity()
+result = semantic.score(
+    candidate="The dog is running in the park.",
+    reference="A canine is jogging through the garden.",
+)
+print(f"Similarity: {result.score:.3f}")
+```
+
+## Validators
+
+Validators wrap metrics with thresholds to make pass/fail decisions.
+
+### Metric-Based Validators
+
+```python
+from veritext.core.types import ValidationContext
+from veritext.validators import bleu, lexical, rouge
+
+context = ValidationContext(reference="Reference text here.")
+
+# BLEU validation
+validator = bleu(min_score=0.5, variant=4)  # BLEU-4
+result = validator.check("Candidate text here.", context)
+
+# ROUGE validation
+validator = rouge(min_score=0.6, variant="l")  # ROUGE-L
+result = validator.check("Candidate text here.", context)
+
+# Lexical validation
+validator = lexical(min_jaccard=0.3, min_overlap=0.5)
+result = validator.check("Candidate text here.", context)
+```
+
+### Constraint Validators
+
+These don't require reference text.
+
+```python
+from veritext.core.types import ValidationContext
+from veritext.validators import contains, excludes, length, readability
+
+context = ValidationContext()  # No reference needed
+
+# Length constraints
+validator = length(min_chars=50, max_chars=500, min_words=10)
+result = validator.check("Your text here...", context)
+
+# Readability constraints
+validator = readability(max_grade=8.0, min_ease=60.0)
+result = validator.check("Your text here...", context)
+
+# Content requirements
+validator = contains(patterns=["important", "keyword"])
+result = validator.check("This important text has a keyword.", context)
+
+# Content exclusions
+validator = excludes(patterns=["forbidden", "banned"])
+result = validator.check("This text is clean.", context)
+```
+
+### Composite Validators
+
+Combine multiple checks with logical operators.
+
+```python
+from veritext.validators import all_of, any_of, bleu, length, rouge
+
+# All checks must pass
+validator = all_of([
+    bleu(min_score=0.5),
+    rouge(min_score=0.6),
+    length(max_chars=500),
+])
+
+# At least one check must pass
+validator = any_of([
+    bleu(min_score=0.7),
+    rouge(min_score=0.7),
+])
+```
+
+## Pytest Plugin
+
+Veritext provides native pytest integration for testing text quality.
+
+### Basic Usage
+
+```python
+from veritext.pytest_plugin import validate_text
+
+
+def test_response_quality():
+    response = "This is a helpful response to your question."
+
+    validate_text(
+        response,
+        min_length=20,
+        max_length=200,
+        max_reading_grade=10.0,
+        must_contain=["helpful"],
+        must_exclude=["error", "sorry"],
+    )
+
+
+def test_summary_similarity():
+    summary = "Scientists discovered a new planet."
+    reference = "Researchers found a new planet in our solar system."
+
+    validate_text(
+        summary,
+        reference=reference,
+        min_rouge=0.5,
+        min_length=10,
+    )
+```
+
+### Available Parameters
+
+| Parameter | Description |
+|-----------|-------------|
+| `reference` | Reference text for comparison metrics |
+| `min_bleu` | Minimum BLEU-4 score (0.0-1.0) |
+| `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) |
+| `min_semantic` | Minimum semantic similarity (0.0-1.0) |
+| `min_length` | Minimum character count |
+| `max_length` | Maximum character count |
+| `max_reading_grade` | Maximum Flesch-Kincaid grade level |
+| `must_contain` | List of required patterns |
+| `must_exclude` | List of forbidden patterns |
+
+## Benchmarking
+
+Track text quality over time and detect regressions.
+
+### Running Benchmarks
+
+```python
+from veritext.benchmark import Benchmark
+
+# Create a benchmark suite
+bench = Benchmark("summariser_quality", storage_path="benchmarks/")
+
+# Evaluate a batch of outputs
+candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
+references = ["Reference 1...", "Reference 2...", "Reference 3..."]
+
+run = bench.evaluate(
+    candidates=candidates,
+    references=references,
+    metrics=["rouge_l", "bleu4"],
+    metadata={"model": "v1.2", "git_sha": "abc123"},
+)
+
+print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
+print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
+```
+
+### Regression Detection
+
+```python
+from veritext.benchmark import Benchmark
+from veritext.core.exceptions import RegressionDetectedError
+
+bench = Benchmark("summariser_quality")
+
+# Check for regression against historical baseline
+report = bench.check_regression(tolerance=0.05, window=10)
+if report.detected:
+    print("Quality regression detected!")
+    for metric, delta in report.deltas.items():
+        print(f"  {metric}: {delta:+.4f}")
+
+# Or raise an exception for CI integration
+try:
+    bench.assert_no_regression(tolerance=0.05)
+except RegressionDetectedError as e:
+    print(f"CI failure: {e}")
+    exit(1)
+```
+
+### Viewing History
+
+```python
+bench = Benchmark("summariser_quality")
+
+for run in bench.get_history(limit=10):
+    print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
+```
+
+## CLI
+
+Veritext provides a command-line interface for validation and benchmarking.
+
+### Validate Text
+
+```bash
+# Inline validation
+veritext validate "Candidate text" -r "Reference text" -m bleu,rouge
+
+# File-based batch validation (JSONL with "candidate" and "reference" fields)
+veritext validate -f outputs.jsonl -m bleu,rouge,lexical
+
+# With threshold for pass/fail
+veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple
+
+# Output formats: table (default), json, simple
+veritext validate "Text" -r "Reference" -m bleu -o json
+```
+
+### Benchmark Commands
+
+```bash
+# Run a benchmark evaluation
+veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
+
+# View benchmark history
+veritext benchmark show my_bench --last 10
+
+# Check for regression (exits with code 1 if detected)
+veritext benchmark check my_bench --tolerance 0.05 --window 10
+```
+
+### JSONL Format
+
+For file-based operations, use JSONL with `candidate` and `reference` fields:
+
+```json
+{"candidate": "Model output 1", "reference": "Expected output 1"}
+{"candidate": "Model output 2", "reference": "Expected output 2"}
+```
+
+## Configuration
+
+Veritext uses environment variables for configuration:
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `VERITEXT_LOG_LEVEL` | `INFO` | Logging level |
+| `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) |
+
+## Development
+
+### Setup
+
+```bash
+git clone https://gitea.kschappell.com/kschappell/veritext.git
+cd veritext
+uv sync --all-extras
+```
+
+### Quality Checks
+
+```bash
+# Linting
+uv run ruff check .
+
+# Formatting
+uv run ruff format --check .
+
+# Type checking
+uv run mypy src/
+
+# Tests
+uv run pytest
+```
+
+### Running Examples
+
+```bash
+uv run python examples/basic_validation.py
+uv run pytest examples/chatbot_testing.py -v
+uv run python examples/benchmark_regression.py
+```

 ## Licence

@@ -0,0 +1,12 @@
+"""Benchmark module for quality tracking and regression detection."""
+
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+from veritext.benchmark.runner import Benchmark
+from veritext.benchmark.storage import BenchmarkStorage
+
+__all__ = [
+    "Benchmark",
+    "BenchmarkRun",
+    "BenchmarkStorage",
+    "RegressionReport",
+]
@@ -0,0 +1,72 @@
+"""Benchmark data models."""
+
+from datetime import datetime
+from typing import Any
+
+from pydantic import BaseModel, ConfigDict, Field
+
+
+class BenchmarkRun(BaseModel):
+    """Record of a single benchmark execution."""
+
+    model_config = ConfigDict(frozen=True)
+
+    id: str
+    """UUID for this run."""
+
+    benchmark_name: str
+    """Name identifying this benchmark suite."""
+
+    timestamp: datetime
+    """When the benchmark was executed."""
+
+    veritext_version: str
+    """Version of veritext used."""
+
+    metrics: dict[str, float]
+    """Metric results, e.g. {"rouge_l": 0.82, "bleu4": 0.71}."""
+
+    sample_count: int
+    """Number of samples evaluated."""
+
+    metadata: dict[str, Any] = Field(default_factory=dict)
+    """Optional metadata (git_sha, model version, etc.)."""
+
+
+class RegressionReport(BaseModel):
+    """Report comparing current run against baseline."""
+
+    model_config = ConfigDict(frozen=True)
+
+    detected: bool
+    """Whether a regression was detected."""
+
+    baseline: dict[str, float]
+    """Baseline metric values (rolling average)."""
+
+    current: dict[str, float]
+    """Current run metric values."""
+
+    deltas: dict[str, float]
+    """Difference from baseline (negative = regression)."""
+
+    tolerance: float
+    """Tolerance threshold used for detection."""
+
+    @property
+    def summary(self) -> str:
+        """Human-readable summary of the report."""
+        if not self.detected:
+            return "No regression detected. All metrics within tolerance."
+
+        regressions = [
+            f"  {metric}: {self.current.get(metric, 0.0):.4f} "
+            f"(baseline: {self.baseline.get(metric, 0.0):.4f}, "
+            f"delta: {delta:+.4f})"
+            for metric, delta in self.deltas.items()
+            if delta < -self.tolerance
+        ]
+
+        return f"Regression detected (tolerance: {self.tolerance:.2%}):\n" + "\n".join(
+            regressions
+        )
@@ -0,0 +1,87 @@
+"""Regression detection using rolling window comparison."""
+
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+
+
+def compute_baseline(
+    runs: list[BenchmarkRun],
+    window: int = 10,
+) -> dict[str, float]:
+    """
+    Compute rolling average baseline from recent runs.
+
+    Args:
+        runs: List of benchmark runs (most recent first).
+        window: Number of runs to include in the baseline.
+
+    Returns:
+        Dictionary of metric names to their average values.
+    """
+    if not runs:
+        return {}
+
+    # Take up to `window` runs
+    recent_runs = runs[:window]
+
+    # Collect all metric values
+    metric_values: dict[str, list[float]] = {}
+    for run in recent_runs:
+        for metric_name, value in run.metrics.items():
+            if metric_name not in metric_values:
+                metric_values[metric_name] = []
+            metric_values[metric_name].append(value)
+
+    # Compute averages
+    return {
+        metric: sum(values) / len(values) for metric, values in metric_values.items()
+    }
+
+
+def detect_regression(
+    current: dict[str, float],
+    baseline: dict[str, float],
+    tolerance: float = 0.05,
+) -> RegressionReport:
+    """
+    Compare current metrics against baseline.
+
+    A regression is detected if any metric drops by more than the tolerance
+    threshold (relative to its baseline value).
+
+    Args:
+        current: Current metric values.
+        baseline: Baseline metric values.
+        tolerance: Maximum allowed drop before regression is flagged (e.g., 0.05 = 5%).
+
+    Returns:
+        RegressionReport with comparison results.
+    """
+    if not baseline:
+        # No baseline means no regression possible
+        return RegressionReport(
+            detected=False,
+            baseline=baseline,
+            current=current,
+            deltas={},
+            tolerance=tolerance,
+        )
+
+    deltas: dict[str, float] = {}
+    detected = False
+
+    for metric, baseline_value in baseline.items():
+        current_value = current.get(metric, 0.0)
+        delta = current_value - baseline_value
+        deltas[metric] = delta
+
+        # Check if this metric regressed beyond tolerance
+        if delta < -tolerance:
+            detected = True
+
+    return RegressionReport(
+        detected=detected,
+        baseline=baseline,
+        current=current,
+        deltas=deltas,
+        tolerance=tolerance,
+    )
@@ -0,0 +1,186 @@
+"""Benchmark execution and tracking."""
+
+import uuid
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Any
+
+import veritext
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+from veritext.benchmark.regression import compute_baseline, detect_regression
+from veritext.benchmark.storage import BenchmarkStorage
+from veritext.core.exceptions import RegressionDetectedError
+from veritext.metrics.bleu import Bleu
+from veritext.metrics.rouge import Rouge
+
+# Default metrics to use for evaluation
+DEFAULT_METRICS = ["rouge_l", "bleu4"]
+
+
+class Benchmark:
+    """Track text quality over time."""
+
+    def __init__(
+        self,
+        name: str,
+        storage_path: str | Path = "benchmarks/",
+    ) -> None:
+        """
+        Initialise a benchmark tracker.
+
+        Args:
+            name: Name identifying this benchmark suite.
+            storage_path: Directory for storing benchmark data.
+        """
+        self._name = name
+        self._storage_path = Path(storage_path)
+        self._storage = BenchmarkStorage(self._storage_path / f"{name}.db")
+
+        # Initialise metrics
+        self._bleu = Bleu()
+        self._rouge = Rouge()
+
+    @property
+    def name(self) -> str:
+        """Return the benchmark name."""
+        return self._name
+
+    def _compute_metrics(
+        self,
+        candidates: list[str],
+        references: list[str] | list[list[str]],
+        metric_names: list[str],
+    ) -> dict[str, float]:
+        """Compute requested metrics for the given samples."""
+        results: dict[str, float] = {}
+
+        for metric_name in metric_names:
+            if metric_name in ("bleu1", "bleu2", "bleu3", "bleu4"):
+                batch_result = self._bleu.batch_score(candidates, references)
+                stats = batch_result.stats.get(metric_name)
+                if stats:
+                    results[metric_name] = stats.mean
+
+            elif metric_name in (
+                "rouge1",
+                "rouge2",
+                "rouge_l",
+                "rouge1_fmeasure",
+                "rouge2_fmeasure",
+                "rouge_l_fmeasure",
+            ):
+                rouge_result = self._rouge.batch_score(candidates, references)
+                # Map short names to stat names
+                stat_name = metric_name
+                if metric_name == "rouge1":
+                    stat_name = "rouge1_fmeasure"
+                elif metric_name == "rouge2":
+                    stat_name = "rouge2_fmeasure"
+                elif metric_name == "rouge_l":
+                    stat_name = "rouge_l_fmeasure"
+
+                stats = rouge_result.stats.get(stat_name)
+                if stats:
+                    results[metric_name] = stats.mean
+
+        return results
+
+    def evaluate(
+        self,
+        candidates: list[str],
+        references: list[str] | list[list[str]],
+        metrics: list[str] | None = None,
+        metadata: dict[str, Any] | None = None,
+    ) -> BenchmarkRun:
+        """
+        Evaluate candidates against references, store results, and return the run.
+
+        Args:
+            candidates: List of candidate texts to evaluate.
+            references: Reference text(s) for each candidate.
+            metrics: List of metrics to compute. Defaults to ["rouge_l", "bleu4"].
+            metadata: Optional metadata (git_sha, model version, etc.).
+
+        Returns:
+            The BenchmarkRun record that was created and stored.
+        """
+        metric_names = metrics or DEFAULT_METRICS
+        metric_results = self._compute_metrics(candidates, references, metric_names)
+
+        run = BenchmarkRun(
+            id=str(uuid.uuid4()),
+            benchmark_name=self._name,
+            timestamp=datetime.now(UTC),
+            veritext_version=veritext.__version__,
+            metrics=metric_results,
+            sample_count=len(candidates),
+            metadata=metadata or {},
+        )
+
+        self._storage.save_run(run)
+        return run
+
+    def check_regression(
+        self,
+        tolerance: float = 0.05,
+        window: int = 10,
+    ) -> RegressionReport:
+        """
+        Compare latest run against historical baseline.
+
+        Args:
+            tolerance: Maximum allowed metric drop before regression is flagged.
+            window: Number of historical runs to include in baseline.
+
+        Returns:
+            RegressionReport with comparison results.
+        """
+        runs = self._storage.get_runs(self._name)
+
+        if not runs:
+            # No runs at all
+            return RegressionReport(
+                detected=False,
+                baseline={},
+                current={},
+                deltas={},
+                tolerance=tolerance,
+            )
+
+        current_run = runs[0]
+        # Baseline excludes the current run
+        historical_runs = runs[1:]
+        baseline = compute_baseline(historical_runs, window=window)
+
+        return detect_regression(current_run.metrics, baseline, tolerance)
+
+    def assert_no_regression(
+        self,
+        tolerance: float = 0.05,
+        window: int = 10,
+    ) -> None:
+        """
+        Raise RegressionDetectedError if quality dropped.
+
+        Args:
+            tolerance: Maximum allowed metric drop before regression is flagged.
+            window: Number of historical runs to include in baseline.
+
+        Raises:
+            RegressionDetectedError: If a regression is detected.
+        """
+        report = self.check_regression(tolerance=tolerance, window=window)
+        if report.detected:
+            raise RegressionDetectedError(report.summary)
+
+    def get_history(self, limit: int = 20) -> list[BenchmarkRun]:
+        """
+        Get recent benchmark runs.
+
+        Args:
+            limit: Maximum number of runs to return.
+
+        Returns:
+            List of BenchmarkRun objects, most recent first.
+        """
+        return self._storage.get_runs(self._name, limit=limit)
@@ -0,0 +1,179 @@
+"""SQLite storage for benchmark history."""
+
+import json
+import sqlite3
+from datetime import datetime
+from pathlib import Path
+
+from veritext.benchmark.models import BenchmarkRun
+from veritext.core.exceptions import StorageError
+
+
+class BenchmarkStorage:
+    """SQLite-backed storage for benchmark runs."""
+
+    def __init__(self, db_path: Path) -> None:
+        """
+        Initialise storage, creating tables if needed.
+
+        Args:
+            db_path: Path to the SQLite database file.
+        """
+        self._db_path = db_path
+        self._ensure_parent_exists()
+        self._init_database()
+
+    def _ensure_parent_exists(self) -> None:
+        """Ensure the parent directory exists."""
+        self._db_path.parent.mkdir(parents=True, exist_ok=True)
+
+    def _get_connection(self) -> sqlite3.Connection:
+        """Get a database connection with WAL mode enabled."""
+        conn = sqlite3.connect(str(self._db_path), timeout=30.0)
+        conn.execute("PRAGMA journal_mode=WAL")
+        conn.execute("PRAGMA foreign_keys=ON")
+        conn.row_factory = sqlite3.Row
+        return conn
+
+    def _init_database(self) -> None:
+        """Create tables if they don't exist."""
+        try:
+            with self._get_connection() as conn:
+                conn.executescript("""
+                    CREATE TABLE IF NOT EXISTS benchmark_runs (
+                        id TEXT PRIMARY KEY,
+                        benchmark_name TEXT NOT NULL,
+                        timestamp TEXT NOT NULL,
+                        veritext_version TEXT NOT NULL,
+                        sample_count INTEGER NOT NULL,
+                        metadata TEXT
+                    );
+
+                    CREATE TABLE IF NOT EXISTS benchmark_metrics (
+                        run_id TEXT REFERENCES benchmark_runs(id) ON DELETE CASCADE,
+                        metric_name TEXT NOT NULL,
+                        value REAL NOT NULL,
+                        PRIMARY KEY (run_id, metric_name)
+                    );
+
+                    CREATE INDEX IF NOT EXISTS idx_benchmark_name
+                    ON benchmark_runs(benchmark_name, timestamp DESC);
+                """)
+        except sqlite3.Error as e:
+            raise StorageError(f"Failed to initialise database: {e}") from e
+
+    def save_run(self, run: BenchmarkRun) -> None:
+        """
+        Persist a benchmark run.
+
+        Args:
+            run: The benchmark run to save.
+
+        Raises:
+            StorageError: If the save operation fails.
+        """
+        try:
+            with self._get_connection() as conn:
+                # Insert the run
+                conn.execute(
+                    """
+                    INSERT INTO benchmark_runs
+                    (id, benchmark_name, timestamp, veritext_version, sample_count, metadata)
+                    VALUES (?, ?, ?, ?, ?, ?)
+                    """,
+                    (
+                        run.id,
+                        run.benchmark_name,
+                        run.timestamp.isoformat(),
+                        run.veritext_version,
+                        run.sample_count,
+                        json.dumps(run.metadata) if run.metadata else None,
+                    ),
+                )
+
+                # Insert metrics
+                for metric_name, value in run.metrics.items():
+                    conn.execute(
+                        """
+                        INSERT INTO benchmark_metrics (run_id, metric_name, value)
+                        VALUES (?, ?, ?)
+                        """,
+                        (run.id, metric_name, value),
+                    )
+        except sqlite3.IntegrityError as e:
+            raise StorageError(f"Run with id '{run.id}' already exists") from e
+        except sqlite3.Error as e:
+            raise StorageError(f"Failed to save benchmark run: {e}") from e
+
+    def get_runs(
+        self,
+        benchmark_name: str,
+        limit: int | None = None,
+    ) -> list[BenchmarkRun]:
+        """
+        Retrieve runs for a benchmark, most recent first.
+
+        Args:
+            benchmark_name: Name of the benchmark to retrieve runs for.
+            limit: Maximum number of runs to return.
+
+        Returns:
+            List of BenchmarkRun objects, most recent first.
+
+        Raises:
+            StorageError: If the retrieval fails.
+        """
+        try:
+            with self._get_connection() as conn:
+                query = """
+                    SELECT id, benchmark_name, timestamp, veritext_version,
+                           sample_count, metadata
+                    FROM benchmark_runs
+                    WHERE benchmark_name = ?
+                    ORDER BY timestamp DESC
+                """
+                if limit is not None:
+                    query += " LIMIT ?"
+                    rows = conn.execute(query, (benchmark_name, limit)).fetchall()
+                else:
+                    rows = conn.execute(query, (benchmark_name,)).fetchall()
+
+                runs = []
+                for row in rows:
+                    # Get metrics for this run
+                    metrics_rows = conn.execute(
+                        "SELECT metric_name, value FROM benchmark_metrics WHERE run_id = ?",
+                        (row["id"],),
+                    ).fetchall()
+                    metrics = {m["metric_name"]: m["value"] for m in metrics_rows}
+
+                    metadata = json.loads(row["metadata"]) if row["metadata"] else {}
+
+                    runs.append(
+                        BenchmarkRun(
+                            id=row["id"],
+                            benchmark_name=row["benchmark_name"],
+                            timestamp=datetime.fromisoformat(row["timestamp"]),
+                            veritext_version=row["veritext_version"],
+                            sample_count=row["sample_count"],
+                            metrics=metrics,
+                            metadata=metadata,
+                        )
+                    )
+
+                return runs
+        except sqlite3.Error as e:
+            raise StorageError(f"Failed to retrieve benchmark runs: {e}") from e
+
+    def get_latest_run(self, benchmark_name: str) -> BenchmarkRun | None:
+        """
+        Get the most recent run for a benchmark.
+
+        Args:
+            benchmark_name: Name of the benchmark.
+
+        Returns:
+            The most recent BenchmarkRun, or None if no runs exist.
+        """
+        runs = self.get_runs(benchmark_name, limit=1)
+        return runs[0] if runs else None
@@ -0,0 +1,5 @@
+"""CLI module: Command-line interface for Veritext."""
+
+from veritext.cli.main import app
+
+__all__ = ["app"]
@@ -0,0 +1,166 @@
+"""Benchmark commands for quality tracking."""
+
+from pathlib import Path
+from typing import Annotated
+
+import typer
+
+from veritext.benchmark import Benchmark
+from veritext.cli.formatters import (
+    console,
+    format_benchmark_history,
+    format_regression_report,
+)
+from veritext.cli.readers import read_jsonl
+
+benchmark_app = typer.Typer(
+    name="benchmark",
+    help="Track and compare text quality over time.",
+    no_args_is_help=True,
+)
+
+
+@benchmark_app.command("run")
+def benchmark_run(
+    name: Annotated[
+        str,
+        typer.Argument(help="Name for this benchmark suite."),
+    ],
+    file: Annotated[
+        Path,
+        typer.Option("--file", "-f", help="JSONL file with candidate/reference pairs."),
+    ],
+    metrics: Annotated[
+        str,
+        typer.Option(
+            "--metrics",
+            "-m",
+            help="Comma-separated metrics to track (e.g., rouge_l,bleu4).",
+        ),
+    ] = "rouge_l,bleu4",
+    storage_path: Annotated[
+        Path,
+        typer.Option(
+            "--storage",
+            "-s",
+            help="Directory for benchmark data storage.",
+        ),
+    ] = Path("benchmarks"),
+) -> None:
+    """
+    Run a benchmark evaluation and store the results.
+
+    Example:
+        veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
+    """
+    # Read text pairs
+    try:
+        pairs = read_jsonl(file)
+    except (FileNotFoundError, ValueError) as e:
+        console.print(f"[red]Error:[/red] {e}")
+        raise typer.Exit(code=1) from e
+
+    if not pairs:
+        console.print("[yellow]Warning:[/yellow] No text pairs found in file.")
+        raise typer.Exit(code=0)
+
+    # Parse metrics
+    metric_names = [m.strip() for m in metrics.split(",")]
+
+    candidates = [p.candidate for p in pairs]
+    references = [p.reference for p in pairs]
+
+    # Run benchmark
+    bench = Benchmark(name, storage_path=storage_path)
+    run = bench.evaluate(candidates, references, metrics=metric_names)
+
+    console.print(f"[green]Benchmark '{name}' completed.[/green]")
+    console.print(f"Samples: {run.sample_count}")
+    console.print("\nMetrics:")
+    for metric_name, value in sorted(run.metrics.items()):
+        console.print(f"  {metric_name}: {value:.4f}")
+
+
+@benchmark_app.command("show")
+def benchmark_show(
+    name: Annotated[
+        str,
+        typer.Argument(help="Name of the benchmark suite."),
+    ],
+    last: Annotated[
+        int,
+        typer.Option("--last", "-n", help="Number of recent runs to show."),
+    ] = 20,
+    storage_path: Annotated[
+        Path,
+        typer.Option(
+            "--storage",
+            "-s",
+            help="Directory for benchmark data storage.",
+        ),
+    ] = Path("benchmarks"),
+) -> None:
+    """
+    Show benchmark history for a suite.
+
+    Example:
+        veritext benchmark show my_bench --last 10
+    """
+    bench = Benchmark(name, storage_path=storage_path)
+    runs = bench.get_history(limit=last)
+
+    if not runs:
+        console.print(f"[yellow]No benchmark runs found for '{name}'.[/yellow]")
+        raise typer.Exit(code=0)
+
+    table = format_benchmark_history(runs)
+    console.print(table)
+
+
+@benchmark_app.command("check")
+def benchmark_check(
+    name: Annotated[
+        str,
+        typer.Argument(help="Name of the benchmark suite."),
+    ],
+    tolerance: Annotated[
+        float,
+        typer.Option(
+            "--tolerance",
+            "-t",
+            help="Maximum allowed metric drop (e.g., 0.05 = 5%).",
+        ),
+    ] = 0.05,
+    window: Annotated[
+        int,
+        typer.Option(
+            "--window",
+            "-w",
+            help="Number of historical runs for baseline.",
+        ),
+    ] = 10,
+    storage_path: Annotated[
+        Path,
+        typer.Option(
+            "--storage",
+            "-s",
+            help="Directory for benchmark data storage.",
+        ),
+    ] = Path("benchmarks"),
+) -> None:
+    """
+    Check for quality regression against historical baseline.
+
+    Exits with code 1 if regression detected (for CI integration).
+
+    Example:
+        veritext benchmark check my_bench --tolerance 0.05
+    """
+    bench = Benchmark(name, storage_path=storage_path)
+    report = bench.check_regression(tolerance=tolerance, window=window)
+
+    panel = format_regression_report(report)
+    console.print(panel)
+
+    if report.detected:
+        raise typer.Exit(code=1)
@@ -0,0 +1,170 @@
+"""Rich output formatters for CLI display."""
+
+import json
+
+from rich.console import Console
+from rich.panel import Panel
+from rich.table import Table
+
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+
+console = Console()
+
+
+def format_validation_table(
+    results: dict[str, float],
+    threshold: float | None = None,
+) -> Table:
+    """
+    Format validation results as a Rich table.
+
+    Args:
+        results: Dictionary of metric names to scores.
+        threshold: Optional threshold for pass/fail colouring.
+
+    Returns:
+        Rich Table object.
+    """
+    table = Table(title="Validation Results", show_header=True, header_style="bold")
+    table.add_column("Metric", style="cyan")
+    table.add_column("Score", justify="right")
+
+    if threshold is not None:
+        table.add_column("Status", justify="center")
+
+    for metric, score in sorted(results.items()):
+        score_str = f"{score:.4f}"
+
+        if threshold is not None:
+            status = "[green]PASS[/green]" if score >= threshold else "[red]FAIL[/red]"
+            table.add_row(metric, score_str, status)
+        else:
+            table.add_row(metric, score_str)
+
+    return table
+
+
+def format_validation_json(results: dict[str, float]) -> str:
+    """
+    Format validation results as JSON.
+
+    Args:
+        results: Dictionary of metric names to scores.
+
+    Returns:
+        JSON string.
+    """
+    return json.dumps(results, indent=2)
+
+
+def format_validation_simple(results: dict[str, float]) -> str:
+    """
+    Format validation results as simple text output.
+
+    Args:
+        results: Dictionary of metric names to scores.
+
+    Returns:
+        Simple text string with one metric per line.
+    """
+    lines = [f"{metric}: {score:.4f}" for metric, score in sorted(results.items())]
+    return "\n".join(lines)
+
+
+def format_benchmark_history(runs: list[BenchmarkRun]) -> Table:
+    """
+    Format benchmark run history as a Rich table.
+
+    Args:
+        runs: List of BenchmarkRun objects (most recent first).
+
+    Returns:
+        Rich Table object.
+    """
+    if not runs:
+        table = Table(title="Benchmark History")
+        table.add_column("No runs found")
+        return table
+
+    # Get all metric names from the runs
+    metric_names: set[str] = set()
+    for run in runs:
+        metric_names.update(run.metrics.keys())
+    sorted_metrics = sorted(metric_names)
+
+    table = Table(title="Benchmark History", show_header=True, header_style="bold")
+    table.add_column("Timestamp", style="cyan")
+    table.add_column("Samples", justify="right")
+    for metric in sorted_metrics:
+        table.add_column(metric, justify="right")
+
+    for run in runs:
+        timestamp = run.timestamp.strftime("%Y-%m-%d %H:%M")
+        samples = str(run.sample_count)
+        metric_values = [f"{run.metrics.get(m, 0.0):.4f}" for m in sorted_metrics]
+        table.add_row(timestamp, samples, *metric_values)
+
+    return table
+
+
+def format_regression_report(report: RegressionReport) -> Panel:
+    """
+    Format a regression report as a Rich panel.
+
+    Args:
+        report: RegressionReport object.
+
+    Returns:
+        Rich Panel object with formatted report.
+    """
+    if not report.detected:
+        content = (
+            f"[green]No regression detected.[/green]\nTolerance: {report.tolerance:.2%}"
+        )
+        return Panel(content, title="Regression Check", border_style="green")
+
+    # Build regression details
+    lines = [
+        "[red]Regression detected![/red]",
+        f"Tolerance: {report.tolerance:.2%}",
+        "",
+        "Metric details:",
+    ]
+
+    for metric in sorted(report.deltas.keys()):
+        baseline = report.baseline.get(metric, 0.0)
+        current = report.current.get(metric, 0.0)
+        delta = report.deltas[metric]
+
+        if delta < -report.tolerance:
+            status = "[red]REGRESSED[/red]"
+        else:
+            status = "[green]OK[/green]"
+
+        lines.append(
+            f"  {metric}: {current:.4f} (baseline: {baseline:.4f}, "
+            f"delta: {delta:+.4f}) {status}"
+        )
+
+    return Panel("\n".join(lines), title="Regression Check", border_style="red")
+
+
+def print_validation_output(
+    results: dict[str, float],
+    output_format: str = "table",
+    threshold: float | None = None,
+) -> None:
+    """
+    Print validation results in the specified format.
+
+    Args:
+        results: Dictionary of metric names to scores.
+        output_format: Output format ('table', 'json', or 'simple').
+        threshold: Optional threshold for pass/fail colouring (table only).
+    """
+    if output_format == "json":
+        console.print(format_validation_json(results))
+    elif output_format == "simple":
+        console.print(format_validation_simple(results))
+    else:
+        console.print(format_validation_table(results, threshold))
@@ -0,0 +1,37 @@
+"""Veritext CLI entry point."""
+
+import typer
+
+import veritext
+from veritext.cli.benchmark import benchmark_app
+from veritext.cli.validate import validate
+
+app = typer.Typer(
+    name="veritext",
+    help="Semantic text validation framework.",
+    no_args_is_help=True,
+)
+
+# Register commands
+app.command()(validate)
+app.add_typer(benchmark_app)
+
+
+@app.callback(invoke_without_command=True)
+def main(
+    version: bool | None = typer.Option(
+        None,
+        "--version",
+        "-V",
+        help="Show version and exit.",
+        is_eager=True,
+    ),
+) -> None:
+    """Veritext: Semantic text validation framework for Python."""
+    if version:
+        typer.echo(f"veritext {veritext.__version__}")
+        raise typer.Exit()
+
+
+if __name__ == "__main__":
+    app()
@@ -0,0 +1,120 @@
+"""Input readers for CLI operations."""
+
+import json
+from dataclasses import dataclass
+from pathlib import Path
+
+
+@dataclass
+class TextPair:
+    """A candidate-reference text pair for validation."""
+
+    candidate: str
+    reference: str
+
+
+def read_jsonl(path: Path) -> list[TextPair]:
+    """
+    Read text pairs from a JSONL file.
+
+    Each line must be a JSON object with 'candidate' and 'reference' keys.
+
+    Args:
+        path: Path to the JSONL file.
+
+    Returns:
+        List of TextPair objects.
+
+    Raises:
+        FileNotFoundError: If the file does not exist.
+        ValueError: If any line is malformed or missing required keys.
+    """
+    if not path.exists():
+        raise FileNotFoundError(f"File not found: {path}")
+
+    pairs: list[TextPair] = []
+    with path.open() as f:
+        for line_num, line in enumerate(f, start=1):
+            line = line.strip()
+            if not line:
+                continue
+
+            try:
+                data = json.loads(line)
+            except json.JSONDecodeError as e:
+                raise ValueError(f"Invalid JSON on line {line_num}: {e}") from e
+
+            if "candidate" not in data:
+                raise ValueError(f"Missing 'candidate' key on line {line_num}")
+            if "reference" not in data:
+                raise ValueError(f"Missing 'reference' key on line {line_num}")
+
+            pairs.append(
+                TextPair(
+                    candidate=str(data["candidate"]),
+                    reference=str(data["reference"]),
+                )
+            )
+
+    return pairs
+
+
+def read_paired_jsonl(candidates_path: Path, references_path: Path) -> list[TextPair]:
+    """
+    Read text pairs from separate candidate and reference JSONL files.
+
+    Each file should contain one JSON object per line with a 'text' key.
+
+    Args:
+        candidates_path: Path to the candidates JSONL file.
+        references_path: Path to the references JSONL file.
+
+    Returns:
+        List of TextPair objects.
+
+    Raises:
+        FileNotFoundError: If either file does not exist.
+        ValueError: If files have different lengths or are malformed.
+    """
+    candidates = _read_text_jsonl(candidates_path, "candidates")
+    references = _read_text_jsonl(references_path, "references")
+
+    if len(candidates) != len(references):
+        raise ValueError(
+            f"Number of candidates ({len(candidates)}) does not match "
+            f"number of references ({len(references)})"
+        )
+
+    return [
+        TextPair(candidate=c, reference=r)
+        for c, r in zip(candidates, references, strict=True)
+    ]
+
+
+def _read_text_jsonl(path: Path, label: str) -> list[str]:
+    """Read text values from a JSONL file with 'text' key per line."""
+    if not path.exists():
+        raise FileNotFoundError(f"{label.capitalize()} file not found: {path}")
+
+    texts: list[str] = []
+    with path.open() as f:
+        for line_num, line in enumerate(f, start=1):
+            line = line.strip()
+            if not line:
+                continue
+
+            try:
+                data = json.loads(line)
+            except json.JSONDecodeError as e:
+                raise ValueError(
+                    f"Invalid JSON in {label} file on line {line_num}: {e}"
+                ) from e
+
+            if "text" not in data:
+                raise ValueError(
+                    f"Missing 'text' key in {label} file on line {line_num}"
+                )
+
+            texts.append(str(data["text"]))
+
+    return texts
@@ -0,0 +1,250 @@
+"""Validate command for computing text metrics."""
+
+from pathlib import Path
+from typing import Annotated
+
+import typer
+
+from veritext.cli.formatters import console, print_validation_output
+from veritext.cli.readers import read_jsonl, read_paired_jsonl
+from veritext.metrics.bleu import Bleu
+from veritext.metrics.lexical import Lexical
+from veritext.metrics.rouge import Rouge
+
+# Available metrics
+AVAILABLE_METRICS = frozenset(
+    {"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"}
+)
+
+# Lazily-initialised metric instances
+_bleu: Bleu | None = None
+_rouge: Rouge | None = None
+_lexical: Lexical | None = None
+
+
+def _get_bleu() -> Bleu:
+    """Get or create the BLEU metric instance."""
+    global _bleu
+    if _bleu is None:
+        _bleu = Bleu()
+    return _bleu
+
+
+def _get_rouge() -> Rouge:
+    """Get or create the ROUGE metric instance."""
+    global _rouge
+    if _rouge is None:
+        _rouge = Rouge()
+    return _rouge
+
+
+def _get_lexical() -> Lexical:
+    """Get or create the lexical metric instance."""
+    global _lexical
+    if _lexical is None:
+        _lexical = Lexical()
+    return _lexical
+
+
+# Metric registry: maps metric names to (result_keys, single_extractor, batch_extractor)
+# - result_keys: output keys to populate
+# - single_extractor: function(candidate, reference) -> dict of results
+# - batch_extractor: function(candidates, references) -> dict of results
+def _bleu_single(candidate: str, reference: str, key: str) -> dict[str, float]:
+    """Extract a BLEU score for single mode."""
+    result = _get_bleu().score(candidate, reference)
+    return {key: getattr(result, key)}
+
+
+def _bleu_batch(
+    candidates: list[str], references: list[str], key: str
+) -> dict[str, float]:
+    """Extract a BLEU score for batch mode."""
+    batch = _get_bleu().batch_score(candidates, references)
+    stats = batch.stats.get(key)
+    return {key: stats.mean} if stats else {}
+
+
+def _rouge_single(candidate: str, reference: str) -> dict[str, float]:
+    """Extract ROUGE-L F-measure for single mode."""
+    result = _get_rouge().score(candidate, reference)
+    return {"rouge_l": result.rouge_l.fmeasure}
+
+
+def _rouge_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
+    """Extract ROUGE-L F-measure for batch mode."""
+    batch = _get_rouge().batch_score(candidates, references)
+    stats = batch.stats.get("rouge_l_fmeasure")
+    return {"rouge_l": stats.mean} if stats else {}
+
+
+def _lexical_single(candidate: str, reference: str) -> dict[str, float]:
+    """Extract lexical scores for single mode."""
+    result = _get_lexical().score(candidate, reference)
+    return {"jaccard": result.jaccard, "token_overlap": result.token_overlap}
+
+
+def _lexical_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
+    """Extract lexical scores for batch mode."""
+    batch = _get_lexical().batch_score(candidates, references)
+    results: dict[str, float] = {}
+    jaccard_stats = batch.stats.get("jaccard")
+    overlap_stats = batch.stats.get("token_overlap")
+    if jaccard_stats:
+        results["jaccard"] = jaccard_stats.mean
+    if overlap_stats:
+        results["token_overlap"] = overlap_stats.mean
+    return results
+
+
+def _compute_metrics(
+    candidate: str,
+    reference: str,
+    metric_names: list[str],
+) -> dict[str, float]:
+    """Compute requested metrics for a single text pair."""
+    results: dict[str, float] = {}
+
+    for metric in metric_names:
+        if metric in ("bleu", "bleu4"):
+            results.update(_bleu_single(candidate, reference, "bleu4"))
+        elif metric in ("bleu1", "bleu2", "bleu3"):
+            results.update(_bleu_single(candidate, reference, metric))
+        elif metric in ("rouge", "rouge_l"):
+            results.update(_rouge_single(candidate, reference))
+        elif metric == "lexical":
+            results.update(_lexical_single(candidate, reference))
+
+    return results
+
+
+def _compute_batch_metrics(
+    candidates: list[str],
+    references: list[str],
+    metric_names: list[str],
+) -> dict[str, float]:
+    """Compute average metrics for a batch of text pairs."""
+    results: dict[str, float] = {}
+
+    for metric in metric_names:
+        if metric in ("bleu", "bleu4"):
+            results.update(_bleu_batch(candidates, references, "bleu4"))
+        elif metric in ("bleu1", "bleu2", "bleu3"):
+            results.update(_bleu_batch(candidates, references, metric))
+        elif metric in ("rouge", "rouge_l"):
+            results.update(_rouge_batch(candidates, references))
+        elif metric == "lexical":
+            results.update(_lexical_batch(candidates, references))
+
+    return results
+
+
+def _parse_metrics(metrics_str: str) -> list[str]:
+    """Parse comma-separated metric names."""
+    metrics = [m.strip().lower() for m in metrics_str.split(",")]
+
+    # Validate metric names
+    invalid = [m for m in metrics if m not in AVAILABLE_METRICS]
+    if invalid:
+        raise typer.BadParameter(
+            f"Unknown metrics: {', '.join(invalid)}. "
+            f"Available: {', '.join(sorted(AVAILABLE_METRICS))}"
+        )
+
+    return metrics
+
+
+def validate(
+    text: Annotated[
+        str | None,
+        typer.Argument(help="Candidate text to validate (inline mode)."),
+    ] = None,
+    reference: Annotated[
+        str | None,
+        typer.Option("--reference", "-r", help="Reference text for comparison."),
+    ] = None,
+    file: Annotated[
+        Path | None,
+        typer.Option("--file", "-f", help="JSONL file with candidate/reference pairs."),
+    ] = None,
+    reference_file: Annotated[
+        Path | None,
+        typer.Option(
+            "--reference-file",
+            "-R",
+            help="Separate JSONL file with references (requires --file).",
+        ),
+    ] = None,
+    metrics: Annotated[
+        str,
+        typer.Option(
+            "--metrics",
+            "-m",
+            help="Comma-separated metrics: bleu, bleu1-4, rouge, rouge_l, lexical.",
+        ),
+    ] = "bleu,rouge",
+    output: Annotated[
+        str,
+        typer.Option("--output", "-o", help="Output format: table, json, or simple."),
+    ] = "table",
+    threshold: Annotated[
+        float | None,
+        typer.Option("--threshold", "-t", help="Score threshold for pass/fail status."),
+    ] = None,
+) -> None:
+    """
+    Validate text quality using various metrics.
+
+    Use inline mode for single texts:
+        veritext validate "text" -r "reference" -m bleu,rouge
+
+    Use file mode for batches:
+        veritext validate -f outputs.jsonl -m bleu,rouge
+    """
+    # Parse and validate metric names
+    try:
+        metric_names = _parse_metrics(metrics)
+    except typer.BadParameter as e:
+        console.print(f"[red]Error:[/red] {e}")
+        raise typer.Exit(code=1) from e
+
+    # Validate output format
+    if output not in ("table", "json", "simple"):
+        console.print(f"[red]Error:[/red] Invalid output format: {output}")
+        raise typer.Exit(code=1)
+
+    # Determine mode: inline vs file
+    if file is not None:
+        # File mode
+        try:
+            if reference_file is not None:
+                pairs = read_paired_jsonl(file, reference_file)
+            else:
+                pairs = read_jsonl(file)
+        except (FileNotFoundError, ValueError) as e:
+            console.print(f"[red]Error:[/red] {e}")
+            raise typer.Exit(code=1) from e
+
+        if not pairs:
+            console.print("[yellow]Warning:[/yellow] No text pairs found in file.")
+            raise typer.Exit(code=0)
+
+        candidates = [p.candidate for p in pairs]
+        references = [p.reference for p in pairs]
+
+        results = _compute_batch_metrics(candidates, references, metric_names)
+        console.print(f"[dim]Evaluated {len(pairs)} text pairs.[/dim]\n")
+
+    elif text is not None and reference is not None:
+        # Inline mode
+        results = _compute_metrics(text, reference, metric_names)
+
+    else:
+        # Invalid usage
+        console.print(
+            "[red]Error:[/red] Provide either text and --reference, "
+            "or --file for batch mode."
+        )
+        raise typer.Exit(code=1)
+
+    print_validation_output(results, output, threshold)
@@ -1,5 +1,6 @@
 """Configuration management using pydantic-settings."""

+from functools import lru_cache
 from pathlib import Path
 from typing import Literal

@@ -54,6 +55,7 @@ class VeritextSettings(BaseSettings):
    )


+@lru_cache
 def get_settings() -> VeritextSettings:
-    """Get the current settings instance."""
+    """Get the cached settings instance."""
    return VeritextSettings()
@@ -137,8 +137,8 @@ class Readability:
                flesch_reading_ease=0.0,
            )

-        # Count sentences
-        sentence_count = _count_sentences(candidate)
+        # Count sentences (ensure at least 1 to avoid division by zero)
+        sentence_count = max(_count_sentences(candidate), 1)

        # Count syllables
        syllable_count = sum(_count_syllables(word) for word in words)
@@ -40,6 +40,11 @@ class LexicalResult(BaseModel):
    token_overlap: float
    """Proportion of candidate tokens found in reference."""

+    @property
+    def score(self) -> float:
+        """Return Jaccard similarity as the primary score."""
+        return self.jaccard
+

 class RougeScore(BaseModel):
    """Individual ROUGE variant score with precision, recall, F-measure."""
@@ -107,9 +107,6 @@ def _compute_rouge_l(
    Returns:
        RougeScore with precision, recall, and F-measure.
    """
-    if not candidate_tokens and not reference_tokens:
-        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
-
    if not candidate_tokens or not reference_tokens:
        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)

@@ -209,6 +206,10 @@ class Rouge:
            rouge2_scores.append(_compute_rouge_score(candidate_tokens, ref_tokens, 2))
            rouge_l_scores.append(_compute_rouge_l(candidate_tokens, ref_tokens))

+        # All references were empty after tokenisation
+        if not rouge1_scores:
+            raise ValueError("Reference text cannot be empty")
+
        return RougeResult(
            rouge1=_max_rouge_scores(rouge1_scores),
            rouge2=_max_rouge_scores(rouge2_scores),
@@ -0,0 +1,22 @@
+"""Pytest plugin for text validation.
+
+This plugin provides native pytest integration for Veritext, enabling
+text validation assertions in test suites.
+
+Example:
+    >>> from veritext.pytest_plugin import validate_text
+    >>>
+    >>> def test_summary_quality():
+    ...     text = "The quick brown fox jumps over the lazy dog."
+    ...     validate_text(
+    ...         text,
+    ...         min_length=10,
+    ...         max_length=100,
+    ...         max_reading_grade=8.0,
+    ...     )
+"""
+
+from veritext.pytest_plugin.assertions import validate_text
+from veritext.pytest_plugin.plugin import pytest_configure
+
+__all__ = ["pytest_configure", "validate_text"]
@@ -0,0 +1,141 @@
+"""Assertion functions for text validation in pytest."""
+
+from typing import TYPE_CHECKING
+
+from veritext.core.types import ValidationContext, ValidationResult
+from veritext.validators import all_of
+
+if TYPE_CHECKING:
+    from veritext.validators.base import Check
+
+
+def validate_text(
+    text: str,
+    *,
+    reference: str | list[str] | None = None,
+    min_bleu: float | None = None,
+    min_rouge: float | None = None,
+    min_semantic: float | None = None,
+    max_length: int | None = None,
+    min_length: int | None = None,
+    max_reading_grade: float | None = None,
+    must_contain: list[str] | None = None,
+    must_exclude: list[str] | None = None,
+) -> None:
+    """Assert text passes all specified validation criteria.
+
+    This is the primary assertion function for text validation in pytest.
+    It builds validators from keyword arguments and raises AssertionError
+    with detailed failure information if validation fails.
+
+    Args:
+        text: The text to validate.
+        reference: Reference text for comparison metrics (BLEU, ROUGE, semantic).
+        min_bleu: Minimum BLEU-4 score required (0.0 to 1.0).
+        min_rouge: Minimum ROUGE-L F-measure required (0.0 to 1.0).
+        min_semantic: Minimum semantic similarity required (0.0 to 1.0).
+        max_length: Maximum character count allowed.
+        min_length: Minimum character count required.
+        max_reading_grade: Maximum Flesch-Kincaid grade level.
+        must_contain: Patterns that must be present in the text.
+        must_exclude: Patterns that must not be present in the text.
+
+    Raises:
+        AssertionError: With detailed failure information if validation fails.
+        ValueError: If comparison metrics requested but reference not provided,
+            or if no validation criteria are specified.
+
+    Example:
+        >>> validate_text(
+        ...     "The quick brown fox jumps over the lazy dog.",
+        ...     min_length=10,
+        ...     max_length=100,
+        ...     max_reading_grade=8.0,
+        ... )
+    """
+    # Validate that reference is provided for comparison metrics
+    if any([min_bleu, min_rouge, min_semantic]) and reference is None:
+        raise ValueError(
+            "Reference text required for comparison metrics "
+            "(min_bleu, min_rouge, min_semantic)"
+        )
+
+    # Build list of validators from kwargs
+    checks: list[Check] = []
+
+    if min_bleu is not None:
+        from veritext.validators import bleu
+
+        checks.append(bleu(min_score=min_bleu))
+
+    if min_rouge is not None:
+        from veritext.validators import rouge
+
+        checks.append(rouge(min_score=min_rouge))
+
+    if min_semantic is not None:
+        # Lazy import to avoid loading sentence-transformers unless needed
+        from veritext.validators import semantic
+
+        checks.append(semantic(min_score=min_semantic))
+
+    if max_length is not None or min_length is not None:
+        from veritext.validators import length
+
+        checks.append(length(min_chars=min_length, max_chars=max_length))
+
+    if max_reading_grade is not None:
+        from veritext.validators import readability
+
+        checks.append(readability(max_grade=max_reading_grade))
+
+    if must_contain is not None:
+        from veritext.validators import contains
+
+        checks.append(contains(patterns=must_contain))
+
+    if must_exclude is not None:
+        from veritext.validators import excludes
+
+        checks.append(excludes(patterns=must_exclude))
+
+    if not checks:
+        raise ValueError("At least one validation criterion must be specified")
+
+    # Run validation
+    context = ValidationContext(reference=reference)
+    validator = all_of(checks)
+    result = validator.check(text, context)
+
+    if not result.passed:
+        raise AssertionError(_format_failure(text, result))
+
+
+def _format_failure(text: str, result: ValidationResult) -> str:
+    """Format a detailed failure message for pytest output.
+
+    Args:
+        text: The text that was validated.
+        result: The validation result containing check failures.
+
+    Returns:
+        Formatted failure message with check details.
+    """
+    lines = ["Text validation failed:"]
+    lines.append("")
+
+    # Show a preview of the text (truncated if long)
+    preview = text[:100] + "..." if len(text) > 100 else text
+    lines.append(f"  Text: {preview!r}")
+    lines.append("")
+
+    # List all failed checks with details
+    lines.append("  Failed checks:")
+    for check in result.failed_checks:
+        lines.append(f"    - {check.name}:")
+        lines.append(f"        {check.message}")
+        if check.threshold is not None:
+            lines.append(f"        Expected: >= {check.threshold}")
+            lines.append(f"        Actual:   {check.actual}")
+
+    return "\n".join(lines)
@@ -0,0 +1,80 @@
+"""Pytest fixtures for text validation."""
+
+from typing import TYPE_CHECKING, Any
+
+import pytest
+
+from veritext.core.types import ValidationContext, ValidationResult
+from veritext.validators import all_of
+from veritext.validators.base import Check
+
+if TYPE_CHECKING:
+    from collections.abc import Callable
+
+
+class ValidatorFactory:
+    """Factory for building validators from keyword arguments."""
+
+    def __call__(
+        self,
+        checks: list[Check],
+        reference: str | list[str] | None = None,
+    ) -> "Callable[[str], ValidationResult]":
+        """Create a validator function from a list of checks.
+
+        Args:
+            checks: List of validation checks to apply.
+            reference: Optional reference text for comparison metrics.
+
+        Returns:
+            A callable that takes text and returns a ValidationResult.
+        """
+        validator = all_of(checks)
+        context = ValidationContext(reference=reference)
+
+        def validate(text: str) -> ValidationResult:
+            return validator.check(text, context)
+
+        return validate
+
+
+@pytest.fixture
+def text_validator() -> ValidatorFactory:
+    """Provide a factory for building validators.
+
+    Example:
+        >>> def test_with_factory(text_validator):
+        ...     from veritext.validators import bleu, length
+        ...     validate = text_validator(
+        ...         checks=[bleu(min_score=0.5), length(min_words=10)],
+        ...         reference="The reference text.",
+        ...     )
+        ...     result = validate("Some candidate text.")
+        ...     assert result.passed
+
+    Returns:
+        ValidatorFactory instance.
+    """
+    return ValidatorFactory()
+
+
+@pytest.fixture
+def validation_context() -> "Callable[..., ValidationContext]":
+    """Provide a factory for creating ValidationContext objects.
+
+    Example:
+        >>> def test_with_context(validation_context):
+        ...     ctx = validation_context(reference="The reference text.")
+        ...     assert ctx.reference == "The reference text."
+
+    Returns:
+        A callable that creates ValidationContext objects.
+    """
+
+    def _create(
+        reference: str | list[str] | None = None,
+        **metadata: Any,
+    ) -> ValidationContext:
+        return ValidationContext(reference=reference, metadata=metadata)
+
+    return _create
@@ -0,0 +1,18 @@
+"""Pytest hooks for Veritext plugin."""
+
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    import pytest
+
+
+def pytest_configure(config: "pytest.Config") -> None:
+    """Register Veritext markers.
+
+    Args:
+        config: Pytest configuration object.
+    """
+    config.addinivalue_line(
+        "markers",
+        "text_validation: mark test as a text validation test",
+    )
@@ -1,11 +1,15 @@
 """Embedding-based semantic similarity using sentence-transformers."""

+from collections import OrderedDict
 from typing import Any

 from veritext.core.exceptions import DependencyError
 from veritext.metrics.base import AggregateStats, BatchResult
 from veritext.metrics.results import SemanticResult

+# Default maximum cache size (number of embeddings to store)
+DEFAULT_CACHE_MAX_SIZE = 1000
+

 class SemanticSimilarity:
    """
@@ -21,6 +25,7 @@ class SemanticSimilarity:
        self,
        model: str = "all-MiniLM-L6-v2",
        cache_embeddings: bool = True,
+        cache_max_size: int = DEFAULT_CACHE_MAX_SIZE,
    ) -> None:
        """
        Initialise the semantic similarity metric.
@@ -30,6 +35,8 @@ class SemanticSimilarity:
                   Defaults to "all-MiniLM-L6-v2" (22MB, good quality/size tradeoff).
            cache_embeddings: Whether to cache embeddings for repeated texts.
                              Defaults to True.
+            cache_max_size: Maximum number of embeddings to cache. Oldest entries
+                            are evicted when the limit is reached. Defaults to 1000.

        Raises:
            DependencyError: If sentence-transformers is not installed.
@@ -44,7 +51,10 @@ class SemanticSimilarity:

        self._model_name = model
        self._model: Any = SentenceTransformer(model)
-        self._cache: dict[str, Any] | None = {} if cache_embeddings else None
+        self._cache: OrderedDict[str, Any] | None = (
+            OrderedDict() if cache_embeddings else None
+        )
+        self._cache_max_size = cache_max_size

    @property
    def name(self) -> str:
@@ -58,7 +68,7 @@ class SemanticSimilarity:

    def _get_embedding(self, text: str) -> Any:
        """
-        Get embedding for text, using cache if available.
+        Get embedding for text, using LRU cache if available.

        Args:
            text: The text to embed.
@@ -67,11 +77,16 @@ class SemanticSimilarity:
            The embedding tensor.
        """
        if self._cache is not None and text in self._cache:
+            # Move to end to mark as recently used
+            self._cache.move_to_end(text)
            return self._cache[text]

        embedding = self._model.encode(text, convert_to_tensor=True)

        if self._cache is not None:
+            # Evict oldest entries if cache is full
+            while len(self._cache) >= self._cache_max_size:
+                self._cache.popitem(last=False)
            self._cache[text] = embedding

        return embedding
@@ -1,11 +1,20 @@
-"""Composite validators for combining multiple checks."""
+"""Composite validators for combining multiple checks.
+
+Note: CompositeCheck classes (AllOf, AnyOf) intentionally return ValidationResult
+rather than CheckResult. This allows callers to inspect individual check results
+for detailed error reporting. They implement a compatible interface but are not
+substitutable where Check is expected as a type constraint.
+"""

 from veritext.core.types import CheckResult, ValidationContext, ValidationResult
 from veritext.validators.base import Check


 class AllOf:
-    """Passes only if all checks pass."""
+    """Passes only if all checks pass.
+
+    Note: Returns ValidationResult (not CheckResult) to expose child results.
+    """

    def __init__(self, checks: list[Check]) -> None:
        """
@@ -20,7 +29,7 @@ class AllOf:
        if not checks:
            raise ValueError("checks list cannot be empty")

-        self._checks = checks
+        self._checks = list(checks)

    @property
    def name(self) -> str:
@@ -48,7 +57,10 @@ class AllOf:


 class AnyOf:
-    """Passes if any check passes."""
+    """Passes if any check passes.
+
+    Note: Returns ValidationResult (not CheckResult) to expose child results.
+    """

    def __init__(self, checks: list[Check]) -> None:
        """
@@ -63,7 +75,7 @@ class AnyOf:
        if not checks:
            raise ValueError("checks list cannot be empty")

-        self._checks = checks
+        self._checks = list(checks)

    @property
    def name(self) -> str:
@@ -229,7 +229,7 @@ class ContainsValidator:
            case_sensitive: Whether matching is case-sensitive. Defaults to False.

        Raises:
-            InvalidThresholdError: If patterns list is empty.
+            InvalidThresholdError: If patterns list is empty or contains invalid regex.
        """
        if not patterns:
            raise InvalidThresholdError("patterns list cannot be empty")
@@ -238,6 +238,15 @@ class ContainsValidator:
        self._case_sensitive = case_sensitive
        self._flags = 0 if case_sensitive else re.IGNORECASE

+        self._compiled_patterns: list[re.Pattern[str]] = []
+        for pattern in patterns:
+            try:
+                self._compiled_patterns.append(re.compile(pattern, self._flags))
+            except re.error as e:
+                raise InvalidThresholdError(
+                    f"Invalid regex pattern '{pattern}': {e}"
+                ) from e
+
    @property
    def name(self) -> str:
        """Return the name of this check."""
@@ -255,8 +264,10 @@ class ContainsValidator:
            CheckResult with pass/fail status.
        """
        missing = []
-        for pattern in self._patterns:
-            if not re.search(pattern, text, self._flags):
+        for pattern, compiled in zip(
+            self._patterns, self._compiled_patterns, strict=True
+        ):
+            if not compiled.search(text):
                missing.append(pattern)

        passed = len(missing) == 0
@@ -291,7 +302,7 @@ class ExcludesValidator:
            case_sensitive: Whether matching is case-sensitive. Defaults to False.

        Raises:
-            InvalidThresholdError: If patterns list is empty.
+            InvalidThresholdError: If patterns list is empty or contains invalid regex.
        """
        if not patterns:
            raise InvalidThresholdError("patterns list cannot be empty")
@@ -300,6 +311,15 @@ class ExcludesValidator:
        self._case_sensitive = case_sensitive
        self._flags = 0 if case_sensitive else re.IGNORECASE

+        self._compiled_patterns: list[re.Pattern[str]] = []
+        for pattern in patterns:
+            try:
+                self._compiled_patterns.append(re.compile(pattern, self._flags))
+            except re.error as e:
+                raise InvalidThresholdError(
+                    f"Invalid regex pattern '{pattern}': {e}"
+                ) from e
+
    @property
    def name(self) -> str:
        """Return the name of this check."""
@@ -317,8 +337,10 @@ class ExcludesValidator:
            CheckResult with pass/fail status.
        """
        found = []
-        for pattern in self._patterns:
-            if re.search(pattern, text, self._flags):
+        for pattern, compiled in zip(
+            self._patterns, self._compiled_patterns, strict=True
+        ):
+            if compiled.search(text):
                found.append(pattern)

        passed = len(found) == 0
@@ -0,0 +1 @@
+"""Tests for the benchmark module."""
@@ -0,0 +1,145 @@
+"""Tests for benchmark data models."""
+
+from datetime import UTC, datetime
+
+import pytest
+from pydantic import ValidationError
+
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+
+
+class TestBenchmarkRun:
+    """Tests for BenchmarkRun model."""
+
+    def test_create_benchmark_run(self) -> None:
+        """BenchmarkRun can be created with required fields."""
+        run = BenchmarkRun(
+            id="test-id-123",
+            benchmark_name="test-benchmark",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0-dev",
+            metrics={"bleu4": 0.75, "rouge_l": 0.82},
+            sample_count=100,
+        )
+
+        assert run.id == "test-id-123"
+        assert run.benchmark_name == "test-benchmark"
+        assert run.veritext_version == "0.1.0-dev"
+        assert run.metrics == {"bleu4": 0.75, "rouge_l": 0.82}
+        assert run.sample_count == 100
+        assert run.metadata == {}
+
+    def test_create_with_metadata(self) -> None:
+        """BenchmarkRun can include optional metadata."""
+        run = BenchmarkRun(
+            id="test-id-456",
+            benchmark_name="test-benchmark",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0-dev",
+            metrics={"bleu4": 0.75},
+            sample_count=50,
+            metadata={"git_sha": "abc123", "model_version": "gpt-4"},
+        )
+
+        assert run.metadata == {"git_sha": "abc123", "model_version": "gpt-4"}
+
+    def test_frozen_model(self) -> None:
+        """BenchmarkRun is immutable."""
+        run = BenchmarkRun(
+            id="test-id",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+
+        with pytest.raises(ValidationError):
+            run.id = "new-id"  # type: ignore[misc]
+
+    def test_serialisation(self) -> None:
+        """BenchmarkRun can be serialised to dict."""
+        run = BenchmarkRun(
+            id="test-id",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+
+        data = run.model_dump()
+        assert data["id"] == "test-id"
+        assert data["benchmark_name"] == "test"
+        assert data["metrics"] == {"bleu4": 0.5}
+
+
+class TestRegressionReport:
+    """Tests for RegressionReport model."""
+
+    def test_no_regression_summary(self) -> None:
+        """Summary indicates no regression when detected is False."""
+        report = RegressionReport(
+            detected=False,
+            baseline={"bleu4": 0.75, "rouge_l": 0.80},
+            current={"bleu4": 0.76, "rouge_l": 0.81},
+            deltas={"bleu4": 0.01, "rouge_l": 0.01},
+            tolerance=0.05,
+        )
+
+        assert "No regression detected" in report.summary
+
+    def test_regression_summary(self) -> None:
+        """Summary lists regressed metrics when detected is True."""
+        report = RegressionReport(
+            detected=True,
+            baseline={"bleu4": 0.75, "rouge_l": 0.80},
+            current={"bleu4": 0.65, "rouge_l": 0.78},
+            deltas={"bleu4": -0.10, "rouge_l": -0.02},
+            tolerance=0.05,
+        )
+
+        assert "Regression detected" in report.summary
+        assert "bleu4" in report.summary
+        assert "0.6500" in report.summary
+        assert "baseline: 0.7500" in report.summary
+
+    def test_regression_excludes_within_tolerance(self) -> None:
+        """Summary only shows metrics that exceed tolerance."""
+        report = RegressionReport(
+            detected=True,
+            baseline={"bleu4": 0.75, "rouge_l": 0.80},
+            current={"bleu4": 0.65, "rouge_l": 0.78},
+            deltas={"bleu4": -0.10, "rouge_l": -0.02},
+            tolerance=0.05,
+        )
+
+        # rouge_l is -0.02, within tolerance of 0.05, so shouldn't appear
+        assert "rouge_l" not in report.summary
+        # bleu4 is -0.10, exceeds tolerance, so should appear
+        assert "bleu4" in report.summary
+
+    def test_frozen_model(self) -> None:
+        """RegressionReport is immutable."""
+        report = RegressionReport(
+            detected=False,
+            baseline={},
+            current={},
+            deltas={},
+            tolerance=0.05,
+        )
+
+        with pytest.raises(ValidationError):
+            report.detected = True  # type: ignore[misc]
+
+    def test_tolerance_in_summary(self) -> None:
+        """Summary includes tolerance threshold."""
+        report = RegressionReport(
+            detected=True,
+            baseline={"metric": 0.80},
+            current={"metric": 0.50},
+            deltas={"metric": -0.30},
+            tolerance=0.10,
+        )
+
+        assert "10.00%" in report.summary
@@ -0,0 +1,229 @@
+"""Tests for regression detection."""
+
+from datetime import UTC, datetime
+
+import pytest
+
+from veritext.benchmark.models import BenchmarkRun
+from veritext.benchmark.regression import compute_baseline, detect_regression
+
+
+def make_run(
+    run_id: str,
+    metrics: dict[str, float],
+    day: int = 1,
+) -> BenchmarkRun:
+    """Helper to create a BenchmarkRun."""
+    return BenchmarkRun(
+        id=run_id,
+        benchmark_name="test",
+        timestamp=datetime(2025, 1, day, 12, 0, 0, tzinfo=UTC),
+        veritext_version="0.1.0",
+        metrics=metrics,
+        sample_count=10,
+    )
+
+
+class TestComputeBaseline:
+    """Tests for baseline computation."""
+
+    def test_empty_runs(self) -> None:
+        """Returns empty baseline for empty runs list."""
+        baseline = compute_baseline([])
+        assert baseline == {}
+
+    def test_single_run(self) -> None:
+        """Single run produces baseline equal to that run's metrics."""
+        runs = [make_run("r1", {"bleu4": 0.75, "rouge_l": 0.80})]
+
+        baseline = compute_baseline(runs)
+
+        assert baseline["bleu4"] == 0.75
+        assert baseline["rouge_l"] == 0.80
+
+    def test_multiple_runs_average(self) -> None:
+        """Baseline is the average of all runs in window."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70}, day=3),
+            make_run("r2", {"bleu4": 0.80}, day=2),
+            make_run("r3", {"bleu4": 0.90}, day=1),
+        ]
+
+        baseline = compute_baseline(runs, window=3)
+
+        assert baseline["bleu4"] == pytest.approx(0.80)  # (0.70+0.80+0.90)/3
+
+    def test_window_limits_runs(self) -> None:
+        """Only includes runs within the window size."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70}, day=5),  # most recent
+            make_run("r2", {"bleu4": 0.80}, day=4),
+            make_run("r3", {"bleu4": 0.90}, day=3),
+            make_run("r4", {"bleu4": 0.60}, day=2),  # excluded
+            make_run("r5", {"bleu4": 0.50}, day=1),  # excluded
+        ]
+
+        baseline = compute_baseline(runs, window=3)
+
+        # Only first 3 runs: (0.70 + 0.80 + 0.90) / 3 = 0.80
+        assert baseline["bleu4"] == pytest.approx(0.80)
+
+    def test_partial_history(self) -> None:
+        """Works when fewer runs than window size exist."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70}),
+            make_run("r2", {"bleu4": 0.80}),
+        ]
+
+        baseline = compute_baseline(runs, window=10)
+
+        # Only 2 runs available: (0.70 + 0.80) / 2 = 0.75
+        assert baseline["bleu4"] == pytest.approx(0.75)
+
+    def test_multiple_metrics(self) -> None:
+        """Computes baseline for all metrics present."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70, "rouge_l": 0.75}),
+            make_run("r2", {"bleu4": 0.80, "rouge_l": 0.85}),
+        ]
+
+        baseline = compute_baseline(runs)
+
+        assert baseline["bleu4"] == pytest.approx(0.75)
+        assert baseline["rouge_l"] == pytest.approx(0.80)
+
+    def test_varying_metrics(self) -> None:
+        """Handles runs with different metric sets."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70, "rouge_l": 0.75}),
+            make_run("r2", {"bleu4": 0.80}),  # No rouge_l
+        ]
+
+        baseline = compute_baseline(runs)
+
+        # bleu4 appears in both runs
+        assert baseline["bleu4"] == pytest.approx(0.75)
+        # rouge_l only appears in one run
+        assert baseline["rouge_l"] == pytest.approx(0.75)
+
+
+class TestDetectRegression:
+    """Tests for regression detection."""
+
+    def test_no_baseline(self) -> None:
+        """No regression when baseline is empty."""
+        report = detect_regression(
+            current={"bleu4": 0.70},
+            baseline={},
+            tolerance=0.05,
+        )
+
+        assert not report.detected
+        assert report.deltas == {}
+
+    def test_no_regression_stable(self) -> None:
+        """No regression when metrics are stable."""
+        report = detect_regression(
+            current={"bleu4": 0.75},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        assert not report.detected
+        assert report.deltas["bleu4"] == pytest.approx(0.0)
+
+    def test_no_regression_improved(self) -> None:
+        """No regression when metrics improved."""
+        report = detect_regression(
+            current={"bleu4": 0.85},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        assert not report.detected
+        assert report.deltas["bleu4"] == pytest.approx(0.10)
+
+    def test_no_regression_within_tolerance(self) -> None:
+        """No regression when drop is within tolerance."""
+        report = detect_regression(
+            current={"bleu4": 0.73},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        assert not report.detected
+        assert report.deltas["bleu4"] == pytest.approx(-0.02)
+
+    def test_regression_detected(self) -> None:
+        """Regression detected when metric drops beyond tolerance."""
+        report = detect_regression(
+            current={"bleu4": 0.65},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        assert report.detected
+        assert report.deltas["bleu4"] == pytest.approx(-0.10)
+
+    def test_regression_at_tolerance_boundary(self) -> None:
+        """Drop at tolerance boundary is not a regression."""
+        # Use a value clearly at the boundary (accounting for float precision)
+        # The implementation checks delta < -tolerance (strictly less than)
+        report = detect_regression(
+            current={"bleu4": 0.50},
+            baseline={"bleu4": 0.50},
+            tolerance=0.05,
+        )
+
+        # Delta is 0.0, well within tolerance
+        assert not report.detected
+        assert report.deltas["bleu4"] == 0.0
+
+    def test_regression_just_beyond_tolerance(self) -> None:
+        """Just beyond tolerance is a regression."""
+        report = detect_regression(
+            current={"bleu4": 0.6999},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        # Delta is -0.0501, which is < -tolerance
+        assert report.detected
+
+    def test_multiple_metrics_any_regresses(self) -> None:
+        """Regression detected if any metric exceeds tolerance."""
+        report = detect_regression(
+            current={"bleu4": 0.65, "rouge_l": 0.80},
+            baseline={"bleu4": 0.75, "rouge_l": 0.80},
+            tolerance=0.05,
+        )
+
+        assert report.detected
+        # Only bleu4 regressed
+        assert report.deltas["bleu4"] == pytest.approx(-0.10)
+        assert report.deltas["rouge_l"] == pytest.approx(0.0)
+
+    def test_report_contains_all_values(self) -> None:
+        """Report includes baseline, current, and deltas."""
+        baseline = {"bleu4": 0.75, "rouge_l": 0.80}
+        current = {"bleu4": 0.65, "rouge_l": 0.82}
+
+        report = detect_regression(current, baseline, tolerance=0.05)
+
+        assert report.baseline == baseline
+        assert report.current == current
+        assert report.tolerance == 0.05
+        assert "bleu4" in report.deltas
+        assert "rouge_l" in report.deltas
+
+    def test_missing_metric_in_current(self) -> None:
+        """Missing metric in current treated as zero."""
+        report = detect_regression(
+            current={},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        # 0.0 - 0.75 = -0.75, which is a regression
+        assert report.detected
+        assert report.deltas["bleu4"] == pytest.approx(-0.75)
@@ -0,0 +1,247 @@
+"""Tests for benchmark runner."""
+
+from pathlib import Path
+
+import pytest
+
+from veritext.benchmark.models import BenchmarkRun
+from veritext.benchmark.runner import Benchmark
+from veritext.core.exceptions import RegressionDetectedError
+
+
+@pytest.fixture
+def benchmark(tmp_path: Path) -> Benchmark:
+    """Create a Benchmark instance with temporary storage."""
+    return Benchmark("test-suite", storage_path=tmp_path / "benchmarks")
+
+
+@pytest.fixture
+def sample_data() -> tuple[list[str], list[str]]:
+    """Sample candidates and references for testing."""
+    candidates = [
+        "The quick brown fox jumps over the lazy dog.",
+        "A fast auburn fox leaps above the sleepy hound.",
+    ]
+    references = [
+        "The quick brown fox jumps over the lazy dog.",
+        "The swift brown fox jumps over the lazy dog.",
+    ]
+    return candidates, references
+
+
+class TestBenchmarkInit:
+    """Tests for Benchmark initialisation."""
+
+    def test_creates_storage_directory(self, tmp_path: Path) -> None:
+        """Benchmark creates storage directory on init."""
+        storage_path = tmp_path / "benchmarks"
+        Benchmark("my-suite", storage_path=storage_path)
+
+        assert storage_path.exists()
+
+    def test_name_property(self, benchmark: Benchmark) -> None:
+        """Benchmark exposes its name."""
+        assert benchmark.name == "test-suite"
+
+
+class TestEvaluate:
+    """Tests for the evaluate method."""
+
+    def test_evaluate_stores_run(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Evaluate creates and stores a benchmark run."""
+        candidates, references = sample_data
+
+        run = benchmark.evaluate(candidates, references)
+
+        assert isinstance(run, BenchmarkRun)
+        assert run.benchmark_name == "test-suite"
+        assert run.sample_count == 2
+
+    def test_evaluate_returns_metrics(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Evaluate computes default metrics."""
+        candidates, references = sample_data
+
+        run = benchmark.evaluate(candidates, references)
+
+        # Default metrics are rouge_l and bleu4
+        assert "rouge_l" in run.metrics
+        assert "bleu4" in run.metrics
+        assert 0.0 <= run.metrics["rouge_l"] <= 1.0
+        assert 0.0 <= run.metrics["bleu4"] <= 1.0
+
+    def test_evaluate_custom_metrics(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Evaluate can compute custom metrics."""
+        candidates, references = sample_data
+
+        run = benchmark.evaluate(
+            candidates, references, metrics=["bleu1", "bleu2", "rouge1"]
+        )
+
+        assert "bleu1" in run.metrics
+        assert "bleu2" in run.metrics
+        assert "rouge1" in run.metrics
+        assert "bleu4" not in run.metrics  # Not requested
+
+    def test_evaluate_with_metadata(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Evaluate can include metadata."""
+        candidates, references = sample_data
+
+        run = benchmark.evaluate(
+            candidates, references, metadata={"git_sha": "abc123", "model": "gpt-4"}
+        )
+
+        assert run.metadata == {"git_sha": "abc123", "model": "gpt-4"}
+
+    def test_evaluate_stores_retrievable(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Stored run can be retrieved."""
+        candidates, references = sample_data
+        run = benchmark.evaluate(candidates, references)
+
+        history = benchmark.get_history()
+
+        assert len(history) == 1
+        assert history[0].id == run.id
+
+
+class TestCheckRegression:
+    """Tests for regression checking."""
+
+    def test_check_no_runs(self, benchmark: Benchmark) -> None:
+        """No regression when no runs exist."""
+        report = benchmark.check_regression()
+
+        assert not report.detected
+        assert report.baseline == {}
+        assert report.current == {}
+
+    def test_check_single_run(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """No regression with single run (no baseline)."""
+        candidates, references = sample_data
+        benchmark.evaluate(candidates, references)
+
+        report = benchmark.check_regression()
+
+        # First run has no baseline to compare against
+        assert not report.detected
+
+    def test_check_stable_metrics(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """No regression when metrics are stable."""
+        candidates, references = sample_data
+
+        # Run multiple times with same data
+        for _ in range(3):
+            benchmark.evaluate(candidates, references)
+
+        report = benchmark.check_regression()
+        assert not report.detected
+
+    def test_check_reports_regression(self, tmp_path: Path) -> None:
+        """Reports regression when metrics drop significantly."""
+        benchmark = Benchmark("regress-test", storage_path=tmp_path / "benchmarks")
+
+        # First run with good metrics
+        good_candidates = ["The quick brown fox jumps."]
+        good_references = ["The quick brown fox jumps."]
+        benchmark.evaluate(good_candidates, good_references)
+
+        # Second run with worse metrics (different text)
+        bad_candidates = ["Something completely different here."]
+        benchmark.evaluate(bad_candidates, good_references)
+
+        report = benchmark.check_regression(tolerance=0.05)
+
+        # Should detect regression since second run is very different
+        assert report.detected or any(d < -0.05 for d in report.deltas.values())
+
+
+class TestAssertNoRegression:
+    """Tests for assert_no_regression method."""
+
+    def test_passes_when_stable(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Does not raise when metrics are stable."""
+        candidates, references = sample_data
+
+        for _ in range(3):
+            benchmark.evaluate(candidates, references)
+
+        # Should not raise
+        benchmark.assert_no_regression()
+
+    def test_raises_on_regression(self, tmp_path: Path) -> None:
+        """Raises RegressionDetectedError when quality drops."""
+        benchmark = Benchmark("regress-test", storage_path=tmp_path / "benchmarks")
+
+        # Establish baseline with perfect match
+        perfect = ["The quick brown fox."]
+        benchmark.evaluate(perfect, perfect)
+
+        # Second run with terrible match
+        terrible = ["Completely unrelated text."]
+        benchmark.evaluate(terrible, perfect)
+
+        with pytest.raises(RegressionDetectedError):
+            benchmark.assert_no_regression(tolerance=0.05)
+
+
+class TestGetHistory:
+    """Tests for get_history method."""
+
+    def test_empty_history(self, benchmark: Benchmark) -> None:
+        """Returns empty list when no runs."""
+        history = benchmark.get_history()
+        assert history == []
+
+    def test_returns_runs(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Returns benchmark runs."""
+        candidates, references = sample_data
+
+        run1 = benchmark.evaluate(candidates, references)
+        run2 = benchmark.evaluate(candidates, references)
+
+        history = benchmark.get_history()
+
+        assert len(history) == 2
+        assert history[0].id == run2.id  # Most recent first
+        assert history[1].id == run1.id
+
+    def test_respects_limit(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Respects limit parameter."""
+        candidates, references = sample_data
+
+        for _ in range(5):
+            benchmark.evaluate(candidates, references)
+
+        history = benchmark.get_history(limit=3)
+        assert len(history) == 3
+
+    def test_default_limit(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Default limit is 20."""
+        candidates, references = sample_data
+
+        for _ in range(25):
+            benchmark.evaluate(candidates, references)
+
+        history = benchmark.get_history()
+        assert len(history) == 20
@@ -0,0 +1,297 @@
+"""Tests for benchmark SQLite storage."""
+
+import sqlite3
+import threading
+from datetime import UTC, datetime
+from pathlib import Path
+
+import pytest
+
+from veritext.benchmark.models import BenchmarkRun
+from veritext.benchmark.storage import BenchmarkStorage
+from veritext.core.exceptions import StorageError
+
+
+@pytest.fixture
+def db_path(tmp_path: Path) -> Path:
+    """Return a temporary database path."""
+    return tmp_path / "benchmarks" / "test.db"
+
+
+@pytest.fixture
+def storage(db_path: Path) -> BenchmarkStorage:
+    """Create a BenchmarkStorage instance."""
+    return BenchmarkStorage(db_path)
+
+
+@pytest.fixture
+def sample_run() -> BenchmarkRun:
+    """Create a sample benchmark run."""
+    return BenchmarkRun(
+        id="run-001",
+        benchmark_name="test-suite",
+        timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+        veritext_version="0.1.0-dev",
+        metrics={"bleu4": 0.75, "rouge_l": 0.82},
+        sample_count=100,
+        metadata={"git_sha": "abc123"},
+    )
+
+
+class TestDatabaseCreation:
+    """Tests for database initialisation."""
+
+    def test_creates_database_file(self, db_path: Path) -> None:
+        """Storage creates the database file on init."""
+        assert not db_path.exists()
+        BenchmarkStorage(db_path)
+        assert db_path.exists()
+
+    def test_creates_parent_directories(self, tmp_path: Path) -> None:
+        """Storage creates parent directories if needed."""
+        nested_path = tmp_path / "deep" / "nested" / "path" / "test.db"
+        BenchmarkStorage(nested_path)
+        assert nested_path.exists()
+
+    def test_creates_tables(self, db_path: Path) -> None:
+        """Storage creates required tables."""
+        BenchmarkStorage(db_path)
+
+        conn = sqlite3.connect(str(db_path))
+        cursor = conn.execute("SELECT name FROM sqlite_master WHERE type='table'")
+        tables = {row[0] for row in cursor.fetchall()}
+        conn.close()
+
+        assert "benchmark_runs" in tables
+        assert "benchmark_metrics" in tables
+
+    def test_creates_index(self, db_path: Path) -> None:
+        """Storage creates index on benchmark_name and timestamp."""
+        BenchmarkStorage(db_path)
+
+        conn = sqlite3.connect(str(db_path))
+        cursor = conn.execute("SELECT name FROM sqlite_master WHERE type='index'")
+        indices = {row[0] for row in cursor.fetchall()}
+        conn.close()
+
+        assert "idx_benchmark_name" in indices
+
+
+class TestSaveRun:
+    """Tests for saving benchmark runs."""
+
+    def test_save_run(
+        self, storage: BenchmarkStorage, sample_run: BenchmarkRun
+    ) -> None:
+        """Storage can save a benchmark run."""
+        storage.save_run(sample_run)
+
+        runs = storage.get_runs("test-suite")
+        assert len(runs) == 1
+        assert runs[0].id == "run-001"
+
+    def test_save_preserves_all_fields(
+        self, storage: BenchmarkStorage, sample_run: BenchmarkRun
+    ) -> None:
+        """Saved run preserves all fields correctly."""
+        storage.save_run(sample_run)
+
+        runs = storage.get_runs("test-suite")
+        run = runs[0]
+
+        assert run.id == sample_run.id
+        assert run.benchmark_name == sample_run.benchmark_name
+        assert run.timestamp == sample_run.timestamp
+        assert run.veritext_version == sample_run.veritext_version
+        assert run.metrics == sample_run.metrics
+        assert run.sample_count == sample_run.sample_count
+        assert run.metadata == sample_run.metadata
+
+    def test_save_duplicate_id_raises(
+        self, storage: BenchmarkStorage, sample_run: BenchmarkRun
+    ) -> None:
+        """Saving a run with duplicate ID raises StorageError."""
+        storage.save_run(sample_run)
+
+        with pytest.raises(StorageError, match="already exists"):
+            storage.save_run(sample_run)
+
+    def test_save_run_empty_metadata(self, storage: BenchmarkStorage) -> None:
+        """Run with empty metadata saves correctly."""
+        run = BenchmarkRun(
+            id="run-no-meta",
+            benchmark_name="test-suite",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0-dev",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+
+        storage.save_run(run)
+        retrieved = storage.get_latest_run("test-suite")
+
+        assert retrieved is not None
+        assert retrieved.metadata == {}
+
+
+class TestGetRuns:
+    """Tests for retrieving benchmark runs."""
+
+    def test_get_runs_empty_database(self, storage: BenchmarkStorage) -> None:
+        """Returns empty list for empty database."""
+        runs = storage.get_runs("nonexistent")
+        assert runs == []
+
+    def test_get_runs_filters_by_name(self, storage: BenchmarkStorage) -> None:
+        """Returns only runs matching the benchmark name."""
+        run1 = BenchmarkRun(
+            id="run-1",
+            benchmark_name="suite-a",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+        run2 = BenchmarkRun(
+            id="run-2",
+            benchmark_name="suite-b",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.6},
+            sample_count=10,
+        )
+
+        storage.save_run(run1)
+        storage.save_run(run2)
+
+        runs_a = storage.get_runs("suite-a")
+        runs_b = storage.get_runs("suite-b")
+
+        assert len(runs_a) == 1
+        assert runs_a[0].id == "run-1"
+        assert len(runs_b) == 1
+        assert runs_b[0].id == "run-2"
+
+    def test_get_runs_ordered_by_timestamp(self, storage: BenchmarkStorage) -> None:
+        """Returns runs ordered by timestamp, most recent first."""
+        run_old = BenchmarkRun(
+            id="run-old",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 10, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+        run_new = BenchmarkRun(
+            id="run-new",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 20, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.6},
+            sample_count=10,
+        )
+
+        # Save in reverse order
+        storage.save_run(run_new)
+        storage.save_run(run_old)
+
+        runs = storage.get_runs("test")
+        assert runs[0].id == "run-new"
+        assert runs[1].id == "run-old"
+
+    def test_get_runs_with_limit(self, storage: BenchmarkStorage) -> None:
+        """Respects limit parameter."""
+        for i in range(5):
+            run = BenchmarkRun(
+                id=f"run-{i}",
+                benchmark_name="test",
+                timestamp=datetime(2025, 1, i + 1, 12, 0, 0, tzinfo=UTC),
+                veritext_version="0.1.0",
+                metrics={"bleu4": 0.5 + i * 0.1},
+                sample_count=10,
+            )
+            storage.save_run(run)
+
+        runs = storage.get_runs("test", limit=3)
+        assert len(runs) == 3
+
+
+class TestGetLatestRun:
+    """Tests for getting the latest run."""
+
+    def test_get_latest_run_empty(self, storage: BenchmarkStorage) -> None:
+        """Returns None for empty database."""
+        result = storage.get_latest_run("nonexistent")
+        assert result is None
+
+    def test_get_latest_run(self, storage: BenchmarkStorage) -> None:
+        """Returns the most recent run."""
+        run_old = BenchmarkRun(
+            id="run-old",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 10, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+        run_new = BenchmarkRun(
+            id="run-new",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 20, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.6},
+            sample_count=10,
+        )
+
+        storage.save_run(run_old)
+        storage.save_run(run_new)
+
+        latest = storage.get_latest_run("test")
+        assert latest is not None
+        assert latest.id == "run-new"
+
+
+class TestConcurrentAccess:
+    """Tests for concurrent database access."""
+
+    def test_concurrent_writes(self, db_path: Path) -> None:
+        """Multiple threads can write concurrently with WAL mode."""
+        errors: list[Exception] = []
+
+        def write_run(run_id: int) -> None:
+            try:
+                storage = BenchmarkStorage(db_path)
+                run = BenchmarkRun(
+                    id=f"run-{run_id}",
+                    benchmark_name="test",
+                    timestamp=datetime(2025, 1, 15, 12, 0, run_id, tzinfo=UTC),
+                    veritext_version="0.1.0",
+                    metrics={"bleu4": 0.5},
+                    sample_count=10,
+                )
+                storage.save_run(run)
+            except Exception as e:
+                errors.append(e)
+
+        threads = [threading.Thread(target=write_run, args=(i,)) for i in range(10)]
+        for t in threads:
+            t.start()
+        for t in threads:
+            t.join()
+
+        assert not errors, f"Concurrent writes failed: {errors}"
+
+        storage = BenchmarkStorage(db_path)
+        runs = storage.get_runs("test")
+        assert len(runs) == 10
+
+    def test_wal_mode_enabled(self, db_path: Path) -> None:
+        """Database uses WAL journal mode."""
+        BenchmarkStorage(db_path)
+
+        conn = sqlite3.connect(str(db_path))
+        cursor = conn.execute("PRAGMA journal_mode")
+        mode = cursor.fetchone()[0]
+        conn.close()
+
+        assert mode.lower() == "wal"
@@ -0,0 +1 @@
+"""CLI test suite."""
@@ -0,0 +1,337 @@
+"""Tests for CLI benchmark commands."""
+
+from pathlib import Path
+
+from typer.testing import CliRunner
+
+from veritext.cli.main import app
+
+runner = CliRunner()
+
+
+class TestBenchmarkRun:
+    """Tests for benchmark run command."""
+
+    def test_benchmark_run_basic(self, tmp_path: Path) -> None:
+        """Test basic benchmark run."""
+        data_file = tmp_path / "data.jsonl"
+        data_file.write_text(
+            '{"candidate": "hello world today", "reference": "hello world today"}\n'
+            '{"candidate": "foo bar baz qux", "reference": "foo bar baz qux"}'
+        )
+        storage_path = tmp_path / "benchmarks"
+
+        result = runner.invoke(
+            app,
+            [
+                "benchmark",
+                "run",
+                "test_bench",
+                "-f",
+                str(data_file),
+                "-m",
+                "rouge_l,bleu4",
+                "-s",
+                str(storage_path),
+            ],
+        )
+        assert result.exit_code == 0
+        assert "Benchmark 'test_bench' completed" in result.stdout
+        assert "Samples: 2" in result.stdout
+        assert "rouge_l:" in result.stdout
+        assert "bleu4:" in result.stdout
+
+    def test_benchmark_run_file_not_found(self, tmp_path: Path) -> None:
+        """Test benchmark run with non-existent file."""
+        result = runner.invoke(
+            app,
+            [
+                "benchmark",
+                "run",
+                "test_bench",
+                "-f",
+                "/nonexistent/file.jsonl",
+                "-s",
+                str(tmp_path / "benchmarks"),
+            ],
+        )
+        assert result.exit_code == 1
+        assert "Error" in result.stdout
+
+    def test_benchmark_run_creates_storage(self, tmp_path: Path) -> None:
+        """Test that benchmark run creates storage directory."""
+        data_file = tmp_path / "data.jsonl"
+        data_file.write_text('{"candidate": "hello", "reference": "hello"}')
+        storage_path = tmp_path / "new_benchmarks"
+
+        result = runner.invoke(
+            app,
+            [
+                "benchmark",
+                "run",
+                "test_bench",
+                "-f",
+                str(data_file),
+                "-s",
+                str(storage_path),
+            ],
+        )
+        assert result.exit_code == 0
+        assert storage_path.exists()
+
+
+class TestBenchmarkShow:
+    """Tests for benchmark show command."""
+
+    def test_benchmark_show_no_runs(self, tmp_path: Path) -> None:
+        """Test showing benchmark with no runs."""
+        storage_path = tmp_path / "benchmarks"
+        storage_path.mkdir()
+
+        result = runner.invoke(
+            app,
+            [
+                "benchmark",
+                "show",
+                "nonexistent_bench",
+                "-s",
+                str(storage_path),
+            ],
+        )
+        assert result.exit_code == 0
+        assert "No benchmark runs found" in result.stdout
+
+    def test_benchmark_show_with_runs(self, tmp_path: Path) -> None:
+        """Test showing benchmark history with runs."""
+        # First create some runs
+        data_file = tmp_path / "data.jsonl"
+        data_file.write_text('{"candidate": "hello world", "reference": "hello world"}')
+        storage_path = tmp_path / "benchmarks"
+
+        # Run benchmark twice
+        for _ in range(2):
+            runner.invoke(
+                app,
+                [
+                    "benchmark",
+                    "run",
+                    "test_bench",
+                    "-f",
+                    str(data_file),
+                    "-s",
+                    str(storage_path),
+                ],
+            )
+
+        # Show history
+        result = runner.invoke(
+            app,
+            [
+                "benchmark",
+                "show",
+                "test_bench",
+                "-s",
+                str(storage_path),
+            ],
+        )
+        assert result.exit_code == 0
+        assert "Benchmark History" in result.stdout
+
+    def test_benchmark_show_limit(self, tmp_path: Path) -> None:
+        """Test showing limited benchmark history."""
+        data_file = tmp_path / "data.jsonl"
+        data_file.write_text('{"candidate": "hello", "reference": "hello"}')
+        storage_path = tmp_path / "benchmarks"
+
+        # Run benchmark 3 times
+        for _ in range(3):
+            runner.invoke(
+                app,
+                [
+                    "benchmark",
+                    "run",
+                    "test_bench",
+                    "-f",
+                    str(data_file),
+                    "-s",
+                    str(storage_path),
+                ],
+            )
+
+        # Show only last 2
+        result = runner.invoke(
+            app,
+            [
+                "benchmark",
+                "show",
+                "test_bench",
+                "--last",
+                "2",
+                "-s",
+                str(storage_path),
+            ],
+        )
+        assert result.exit_code == 0
+
+
+class TestBenchmarkCheck:
+    """Tests for benchmark check command."""
+
+    def test_benchmark_check_no_regression(self, tmp_path: Path) -> None:
+        """Test checking for regression with no regression."""
+        data_file = tmp_path / "data.jsonl"
+        data_file.write_text(
+            '{"candidate": "hello world today", "reference": "hello world today"}'
+        )
+        storage_path = tmp_path / "benchmarks"
+
+        # Run benchmark twice with same data (no regression)
+        for _ in range(2):
+            runner.invoke(
+                app,
+                [
+                    "benchmark",
+                    "run",
+                    "test_bench",
+                    "-f",
+                    str(data_file),
+                    "-s",
+                    str(storage_path),
+                ],
+            )
+
+        # Check for regression
+        result = runner.invoke(
+            app,
+            [
+                "benchmark",
+                "check",
+                "test_bench",
+                "-s",
+                str(storage_path),
+            ],
+        )
+        assert result.exit_code == 0
+        assert "No regression detected" in result.stdout
+
+    def test_benchmark_check_with_regression(self, tmp_path: Path) -> None:
+        """Test checking for regression when regression occurs."""
+        storage_path = tmp_path / "benchmarks"
+
+        # First run with good data
+        good_file = tmp_path / "good.jsonl"
+        good_file.write_text(
+            '{"candidate": "hello world today", "reference": "hello world today"}'
+        )
+        runner.invoke(
+            app,
+            [
+                "benchmark",
+                "run",
+                "test_bench",
+                "-f",
+                str(good_file),
+                "-s",
+                str(storage_path),
+            ],
+        )
+
+        # Second run with bad data (regression)
+        bad_file = tmp_path / "bad.jsonl"
+        bad_file.write_text(
+            '{"candidate": "completely different", "reference": "hello world today"}'
+        )
+        runner.invoke(
+            app,
+            [
+                "benchmark",
+                "run",
+                "test_bench",
+                "-f",
+                str(bad_file),
+                "-s",
+                str(storage_path),
+            ],
+        )
+
+        # Check for regression
+        result = runner.invoke(
+            app,
+            [
+                "benchmark",
+                "check",
+                "test_bench",
+                "-t",
+                "0.05",
+                "-s",
+                str(storage_path),
+            ],
+        )
+        assert result.exit_code == 1
+        assert "Regression detected" in result.stdout
+
+    def test_benchmark_check_custom_tolerance(self, tmp_path: Path) -> None:
+        """Test checking regression with custom tolerance."""
+        data_file = tmp_path / "data.jsonl"
+        data_file.write_text('{"candidate": "hello", "reference": "hello"}')
+        storage_path = tmp_path / "benchmarks"
+
+        runner.invoke(
+            app,
+            [
+                "benchmark",
+                "run",
+                "test_bench",
+                "-f",
+                str(data_file),
+                "-s",
+                str(storage_path),
+            ],
+        )
+
+        result = runner.invoke(
+            app,
+            [
+                "benchmark",
+                "check",
+                "test_bench",
+                "--tolerance",
+                "0.10",
+                "-s",
+                str(storage_path),
+            ],
+        )
+        assert result.exit_code == 0
+        assert "10.00%" in result.stdout
+
+
+class TestBenchmarkHelp:
+    """Tests for benchmark help output."""
+
+    def test_benchmark_help(self) -> None:
+        """Test benchmark help output."""
+        result = runner.invoke(app, ["benchmark", "--help"])
+        assert result.exit_code == 0
+        assert "run" in result.stdout
+        assert "show" in result.stdout
+        assert "check" in result.stdout
+
+    def test_benchmark_run_help(self) -> None:
+        """Test benchmark run help output."""
+        result = runner.invoke(app, ["benchmark", "run", "--help"])
+        assert result.exit_code == 0
+        assert "--file" in result.stdout
+        assert "--metrics" in result.stdout
+
+    def test_benchmark_show_help(self) -> None:
+        """Test benchmark show help output."""
+        result = runner.invoke(app, ["benchmark", "show", "--help"])
+        assert result.exit_code == 0
+        assert "--last" in result.stdout
+
+    def test_benchmark_check_help(self) -> None:
+        """Test benchmark check help output."""
+        result = runner.invoke(app, ["benchmark", "check", "--help"])
+        assert result.exit_code == 0
+        assert "--tolerance" in result.stdout
+        assert "--window" in result.stdout
@@ -0,0 +1,141 @@
+"""Tests for CLI output formatters."""
+
+from datetime import UTC, datetime
+
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+from veritext.cli.formatters import (
+    format_benchmark_history,
+    format_regression_report,
+    format_validation_json,
+    format_validation_simple,
+    format_validation_table,
+)
+
+
+class TestFormatValidationTable:
+    """Tests for format_validation_table function."""
+
+    def test_format_empty_results(self) -> None:
+        """Test formatting empty results."""
+        table = format_validation_table({})
+        assert table.title == "Validation Results"
+        assert table.row_count == 0
+
+    def test_format_single_metric(self) -> None:
+        """Test formatting a single metric."""
+        results = {"bleu4": 0.8523}
+        table = format_validation_table(results)
+        assert table.row_count == 1
+
+    def test_format_multiple_metrics(self) -> None:
+        """Test formatting multiple metrics."""
+        results = {"bleu4": 0.85, "rouge_l": 0.92, "jaccard": 0.75}
+        table = format_validation_table(results)
+        assert table.row_count == 3
+
+    def test_format_with_threshold(self) -> None:
+        """Test formatting with threshold for pass/fail."""
+        results = {"bleu4": 0.85, "rouge_l": 0.45}
+        table = format_validation_table(results, threshold=0.5)
+        # Should have 3 columns: Metric, Score, Status
+        assert table.row_count == 2
+
+
+class TestFormatValidationJson:
+    """Tests for format_validation_json function."""
+
+    def test_format_empty_results(self) -> None:
+        """Test formatting empty results as JSON."""
+        result = format_validation_json({})
+        assert result == "{}"
+
+    def test_format_results(self) -> None:
+        """Test formatting results as JSON."""
+        results = {"bleu4": 0.85, "rouge_l": 0.92}
+        result = format_validation_json(results)
+        assert '"bleu4": 0.85' in result
+        assert '"rouge_l": 0.92' in result
+
+
+class TestFormatValidationSimple:
+    """Tests for format_validation_simple function."""
+
+    def test_format_empty_results(self) -> None:
+        """Test formatting empty results as simple text."""
+        result = format_validation_simple({})
+        assert result == ""
+
+    def test_format_results(self) -> None:
+        """Test formatting results as simple text."""
+        results = {"bleu4": 0.8523, "rouge_l": 0.9234}
+        result = format_validation_simple(results)
+        assert "bleu4: 0.8523" in result
+        assert "rouge_l: 0.9234" in result
+
+
+class TestFormatBenchmarkHistory:
+    """Tests for format_benchmark_history function."""
+
+    def test_format_empty_history(self) -> None:
+        """Test formatting empty benchmark history."""
+        table = format_benchmark_history([])
+        assert table.title == "Benchmark History"
+
+    def test_format_single_run(self) -> None:
+        """Test formatting a single benchmark run."""
+        run = BenchmarkRun(
+            id="test-id",
+            benchmark_name="test",
+            timestamp=datetime(2024, 1, 15, 10, 30, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"rouge_l": 0.85, "bleu4": 0.72},
+            sample_count=100,
+        )
+        table = format_benchmark_history([run])
+        assert table.row_count == 1
+
+    def test_format_multiple_runs(self) -> None:
+        """Test formatting multiple benchmark runs."""
+        runs = [
+            BenchmarkRun(
+                id=f"test-id-{i}",
+                benchmark_name="test",
+                timestamp=datetime(2024, 1, i + 1, 10, 30, tzinfo=UTC),
+                veritext_version="0.1.0",
+                metrics={"rouge_l": 0.8 + i * 0.01},
+                sample_count=100,
+            )
+            for i in range(3)
+        ]
+        table = format_benchmark_history(runs)
+        assert table.row_count == 3
+
+
+class TestFormatRegressionReport:
+    """Tests for format_regression_report function."""
+
+    def test_format_no_regression(self) -> None:
+        """Test formatting report with no regression."""
+        report = RegressionReport(
+            detected=False,
+            baseline={"rouge_l": 0.85},
+            current={"rouge_l": 0.86},
+            deltas={"rouge_l": 0.01},
+            tolerance=0.05,
+        )
+        panel = format_regression_report(report)
+        assert panel.title == "Regression Check"
+        assert panel.border_style == "green"
+
+    def test_format_with_regression(self) -> None:
+        """Test formatting report with regression detected."""
+        report = RegressionReport(
+            detected=True,
+            baseline={"rouge_l": 0.85, "bleu4": 0.72},
+            current={"rouge_l": 0.70, "bleu4": 0.70},
+            deltas={"rouge_l": -0.15, "bleu4": -0.02},
+            tolerance=0.05,
+        )
+        panel = format_regression_report(report)
+        assert panel.title == "Regression Check"
+        assert panel.border_style == "red"
@@ -0,0 +1,145 @@
+"""Tests for CLI input readers."""
+
+import json
+from pathlib import Path
+
+import pytest
+
+from veritext.cli.readers import TextPair, read_jsonl, read_paired_jsonl
+
+
+class TestTextPair:
+    """Tests for TextPair dataclass."""
+
+    def test_create_text_pair(self) -> None:
+        """Test creating a TextPair."""
+        pair = TextPair(candidate="hello", reference="world")
+        assert pair.candidate == "hello"
+        assert pair.reference == "world"
+
+
+class TestReadJsonl:
+    """Tests for read_jsonl function."""
+
+    def test_read_valid_jsonl(self, tmp_path: Path) -> None:
+        """Test reading a valid JSONL file."""
+        data = [
+            {"candidate": "foo", "reference": "bar"},
+            {"candidate": "baz", "reference": "qux"},
+        ]
+        jsonl_file = tmp_path / "data.jsonl"
+        jsonl_file.write_text("\n".join(json.dumps(d) for d in data))
+
+        pairs = read_jsonl(jsonl_file)
+
+        assert len(pairs) == 2
+        assert pairs[0].candidate == "foo"
+        assert pairs[0].reference == "bar"
+        assert pairs[1].candidate == "baz"
+        assert pairs[1].reference == "qux"
+
+    def test_read_empty_file(self, tmp_path: Path) -> None:
+        """Test reading an empty JSONL file."""
+        jsonl_file = tmp_path / "empty.jsonl"
+        jsonl_file.write_text("")
+
+        pairs = read_jsonl(jsonl_file)
+
+        assert pairs == []
+
+    def test_read_file_with_blank_lines(self, tmp_path: Path) -> None:
+        """Test reading a JSONL file with blank lines."""
+        jsonl_file = tmp_path / "data.jsonl"
+        content = '{"candidate": "a", "reference": "b"}\n\n{"candidate": "c", "reference": "d"}\n'
+        jsonl_file.write_text(content)
+
+        pairs = read_jsonl(jsonl_file)
+
+        assert len(pairs) == 2
+
+    def test_read_file_not_found(self, tmp_path: Path) -> None:
+        """Test reading a non-existent file."""
+        with pytest.raises(FileNotFoundError):
+            read_jsonl(tmp_path / "nonexistent.jsonl")
+
+    def test_read_invalid_json(self, tmp_path: Path) -> None:
+        """Test reading a file with invalid JSON."""
+        jsonl_file = tmp_path / "invalid.jsonl"
+        jsonl_file.write_text("not valid json")
+
+        with pytest.raises(ValueError, match="Invalid JSON on line 1"):
+            read_jsonl(jsonl_file)
+
+    def test_read_missing_candidate_key(self, tmp_path: Path) -> None:
+        """Test reading a file missing the candidate key."""
+        jsonl_file = tmp_path / "data.jsonl"
+        jsonl_file.write_text('{"reference": "bar"}')
+
+        with pytest.raises(ValueError, match="Missing 'candidate' key on line 1"):
+            read_jsonl(jsonl_file)
+
+    def test_read_missing_reference_key(self, tmp_path: Path) -> None:
+        """Test reading a file missing the reference key."""
+        jsonl_file = tmp_path / "data.jsonl"
+        jsonl_file.write_text('{"candidate": "foo"}')
+
+        with pytest.raises(ValueError, match="Missing 'reference' key on line 1"):
+            read_jsonl(jsonl_file)
+
+
+class TestReadPairedJsonl:
+    """Tests for read_paired_jsonl function."""
+
+    def test_read_paired_valid(self, tmp_path: Path) -> None:
+        """Test reading valid paired JSONL files."""
+        candidates_file = tmp_path / "candidates.jsonl"
+        references_file = tmp_path / "references.jsonl"
+
+        candidates_file.write_text('{"text": "foo"}\n{"text": "bar"}')
+        references_file.write_text('{"text": "baz"}\n{"text": "qux"}')
+
+        pairs = read_paired_jsonl(candidates_file, references_file)
+
+        assert len(pairs) == 2
+        assert pairs[0].candidate == "foo"
+        assert pairs[0].reference == "baz"
+        assert pairs[1].candidate == "bar"
+        assert pairs[1].reference == "qux"
+
+    def test_read_paired_length_mismatch(self, tmp_path: Path) -> None:
+        """Test reading paired files with different lengths."""
+        candidates_file = tmp_path / "candidates.jsonl"
+        references_file = tmp_path / "references.jsonl"
+
+        candidates_file.write_text('{"text": "foo"}\n{"text": "bar"}')
+        references_file.write_text('{"text": "baz"}')
+
+        with pytest.raises(ValueError, match="does not match"):
+            read_paired_jsonl(candidates_file, references_file)
+
+    def test_read_paired_candidates_not_found(self, tmp_path: Path) -> None:
+        """Test reading when candidates file doesn't exist."""
+        references_file = tmp_path / "references.jsonl"
+        references_file.write_text('{"text": "baz"}')
+
+        with pytest.raises(FileNotFoundError, match="Candidates file not found"):
+            read_paired_jsonl(tmp_path / "nonexistent.jsonl", references_file)
+
+    def test_read_paired_references_not_found(self, tmp_path: Path) -> None:
+        """Test reading when references file doesn't exist."""
+        candidates_file = tmp_path / "candidates.jsonl"
+        candidates_file.write_text('{"text": "foo"}')
+
+        with pytest.raises(FileNotFoundError, match="References file not found"):
+            read_paired_jsonl(candidates_file, tmp_path / "nonexistent.jsonl")
+
+    def test_read_paired_missing_text_key(self, tmp_path: Path) -> None:
+        """Test reading paired files with missing text key."""
+        candidates_file = tmp_path / "candidates.jsonl"
+        references_file = tmp_path / "references.jsonl"
+
+        candidates_file.write_text('{"value": "foo"}')
+        references_file.write_text('{"text": "baz"}')
+
+        with pytest.raises(ValueError, match="Missing 'text' key in candidates file"):
+            read_paired_jsonl(candidates_file, references_file)
@@ -0,0 +1,233 @@
+"""Tests for CLI validate command."""
+
+import json
+from pathlib import Path
+
+from typer.testing import CliRunner
+
+from veritext.cli.main import app
+
+runner = CliRunner()
+
+
+class TestValidateInline:
+    """Tests for inline validation mode."""
+
+    def test_validate_inline_basic(self) -> None:
+        """Test basic inline validation."""
+        result = runner.invoke(
+            app,
+            [
+                "validate",
+                "The quick brown fox jumps",
+                "-r",
+                "The quick brown fox jumps",
+                "-m",
+                "bleu",
+            ],
+        )
+        assert result.exit_code == 0
+        assert "bleu4" in result.stdout
+
+    def test_validate_inline_with_rouge(self) -> None:
+        """Test inline validation with ROUGE metric."""
+        result = runner.invoke(
+            app,
+            [
+                "validate",
+                "hello world today",
+                "-r",
+                "hello world here",
+                "-m",
+                "rouge",
+            ],
+        )
+        assert result.exit_code == 0
+        assert "rouge_l" in result.stdout
+
+    def test_validate_inline_with_lexical(self) -> None:
+        """Test inline validation with lexical metric."""
+        result = runner.invoke(
+            app,
+            [
+                "validate",
+                "hello world",
+                "-r",
+                "hello everyone",
+                "-m",
+                "lexical",
+            ],
+        )
+        assert result.exit_code == 0
+        assert "jaccard" in result.stdout
+        assert "token_overlap" in result.stdout
+
+    def test_validate_inline_json_output(self) -> None:
+        """Test inline validation with JSON output."""
+        result = runner.invoke(
+            app,
+            [
+                "validate",
+                "hello world today",
+                "-r",
+                "hello world today",
+                "-m",
+                "bleu",
+                "-o",
+                "json",
+            ],
+        )
+        assert result.exit_code == 0
+        data = json.loads(result.stdout)
+        assert "bleu4" in data
+
+    def test_validate_inline_simple_output(self) -> None:
+        """Test inline validation with simple output."""
+        result = runner.invoke(
+            app,
+            [
+                "validate",
+                "hello world today",
+                "-r",
+                "hello world today",
+                "-m",
+                "rouge",
+                "-o",
+                "simple",
+            ],
+        )
+        assert result.exit_code == 0
+        assert "rouge_l:" in result.stdout
+
+    def test_validate_inline_missing_reference(self) -> None:
+        """Test inline validation without reference."""
+        result = runner.invoke(
+            app,
+            ["validate", "hello world", "-m", "bleu"],
+        )
+        assert result.exit_code == 1
+        assert "Error" in result.stdout
+
+    def test_validate_inline_invalid_metric(self) -> None:
+        """Test inline validation with invalid metric."""
+        result = runner.invoke(
+            app,
+            ["validate", "hello", "-r", "world", "-m", "invalid_metric"],
+        )
+        assert result.exit_code == 1
+        assert "Unknown metrics" in result.stdout
+
+
+class TestValidateFile:
+    """Tests for file-based validation mode."""
+
+    def test_validate_file_basic(self, tmp_path: Path) -> None:
+        """Test basic file-based validation."""
+        data_file = tmp_path / "data.jsonl"
+        data_file.write_text(
+            '{"candidate": "hello world today", "reference": "hello world today"}\n'
+            '{"candidate": "foo bar baz", "reference": "foo bar baz"}'
+        )
+
+        result = runner.invoke(
+            app,
+            ["validate", "-f", str(data_file), "-m", "bleu"],
+        )
+        assert result.exit_code == 0
+        assert "bleu4" in result.stdout
+        assert "Evaluated 2 text pairs" in result.stdout
+
+    def test_validate_file_not_found(self) -> None:
+        """Test file-based validation with non-existent file."""
+        result = runner.invoke(
+            app,
+            ["validate", "-f", "/nonexistent/file.jsonl", "-m", "bleu"],
+        )
+        assert result.exit_code == 1
+        assert "Error" in result.stdout
+
+    def test_validate_paired_files(self, tmp_path: Path) -> None:
+        """Test validation with separate candidate and reference files."""
+        candidates_file = tmp_path / "candidates.jsonl"
+        references_file = tmp_path / "references.jsonl"
+
+        candidates_file.write_text(
+            '{"text": "hello world today"}\n{"text": "foo bar baz"}'
+        )
+        references_file.write_text(
+            '{"text": "hello world today"}\n{"text": "foo bar baz"}'
+        )
+
+        result = runner.invoke(
+            app,
+            [
+                "validate",
+                "-f",
+                str(candidates_file),
+                "-R",
+                str(references_file),
+                "-m",
+                "bleu",
+            ],
+        )
+        assert result.exit_code == 0
+        assert "Evaluated 2 text pairs" in result.stdout
+
+
+class TestValidateOptions:
+    """Tests for validate command options."""
+
+    def test_validate_with_threshold(self) -> None:
+        """Test validation with threshold option."""
+        result = runner.invoke(
+            app,
+            [
+                "validate",
+                "hello world today",
+                "-r",
+                "hello world today",
+                "-m",
+                "bleu",
+                "-t",
+                "0.5",
+            ],
+        )
+        assert result.exit_code == 0
+        # Table output should include Status column
+        assert "Status" in result.stdout or "PASS" in result.stdout
+
+    def test_validate_invalid_output_format(self) -> None:
+        """Test validation with invalid output format."""
+        result = runner.invoke(
+            app,
+            [
+                "validate",
+                "hello",
+                "-r",
+                "world",
+                "-m",
+                "bleu",
+                "-o",
+                "invalid",
+            ],
+        )
+        assert result.exit_code == 1
+        assert "Invalid output format" in result.stdout
+
+    def test_validate_multiple_metrics(self) -> None:
+        """Test validation with multiple metrics."""
+        result = runner.invoke(
+            app,
+            [
+                "validate",
+                "The quick brown fox",
+                "-r",
+                "The quick brown fox",
+                "-m",
+                "bleu,rouge,lexical",
+            ],
+        )
+        assert result.exit_code == 0
+        assert "bleu4" in result.stdout
+        assert "rouge_l" in result.stdout
+        assert "jaccard" in result.stdout
@@ -0,0 +1,73 @@
+"""Tests for configuration module."""
+
+from pathlib import Path
+
+import pytest
+
+from veritext.core.config import VeritextSettings, get_settings
+
+
+class TestVeritextSettings:
+    """Tests for VeritextSettings."""
+
+    def test_default_log_level(self) -> None:
+        """Test default log level is INFO."""
+        settings = VeritextSettings()
+        assert settings.log_level == "INFO"
+
+    def test_default_log_format(self) -> None:
+        """Test default log format is console."""
+        settings = VeritextSettings()
+        assert settings.log_format == "console"
+
+    def test_default_benchmark_path(self) -> None:
+        """Test default benchmark storage path."""
+        settings = VeritextSettings()
+        assert settings.benchmark_storage_path == Path("benchmarks")
+
+    def test_default_tokeniser_lowercase(self) -> None:
+        """Test default tokeniser lowercase setting."""
+        settings = VeritextSettings()
+        assert settings.tokeniser_lowercase is True
+
+    def test_default_tokeniser_remove_punctuation(self) -> None:
+        """Test default tokeniser remove punctuation setting."""
+        settings = VeritextSettings()
+        assert settings.tokeniser_remove_punctuation is True
+
+    def test_default_semantic_model(self) -> None:
+        """Test default semantic model name."""
+        settings = VeritextSettings()
+        assert settings.semantic_model == "all-MiniLM-L6-v2"
+
+    def test_default_semantic_cache_enabled(self) -> None:
+        """Test semantic cache is enabled by default."""
+        settings = VeritextSettings()
+        assert settings.semantic_cache_embeddings is True
+
+    def test_env_var_override(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        """Test environment variable overrides default settings."""
+        monkeypatch.setenv("VERITEXT_LOG_LEVEL", "DEBUG")
+        settings = VeritextSettings()
+        assert settings.log_level == "DEBUG"
+
+    def test_env_var_override_log_format(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        """Test environment variable overrides log format."""
+        monkeypatch.setenv("VERITEXT_LOG_FORMAT", "json")
+        settings = VeritextSettings()
+        assert settings.log_format == "json"
+
+
+class TestGetSettings:
+    """Tests for get_settings function."""
+
+    def test_get_settings_returns_instance(self) -> None:
+        """Test get_settings returns a VeritextSettings instance."""
+        settings = get_settings()
+        assert isinstance(settings, VeritextSettings)
+
+    def test_get_settings_returns_valid_defaults(self) -> None:
+        """Test get_settings returns instance with valid defaults."""
+        settings = get_settings()
+        assert settings.log_level in ("DEBUG", "INFO", "WARNING", "ERROR")
+        assert settings.log_format in ("console", "json")
@@ -0,0 +1,56 @@
+"""Tests for logging module."""
+
+from veritext.core.logging import configure_logging, get_logger
+
+
+class TestGetLogger:
+    """Tests for get_logger function."""
+
+    def test_get_logger_returns_logger(self) -> None:
+        """Test get_logger returns a logger instance."""
+        logger = get_logger()
+        assert logger is not None
+
+    def test_get_logger_default_name(self) -> None:
+        """Test get_logger uses 'veritext' as default name."""
+        logger = get_logger()
+        # The logger should be a bound logger from structlog
+        assert hasattr(logger, "info")
+        assert hasattr(logger, "debug")
+        assert hasattr(logger, "warning")
+        assert hasattr(logger, "error")
+
+    def test_get_logger_custom_name(self) -> None:
+        """Test get_logger respects custom name parameter."""
+        logger = get_logger("custom.module")
+        assert logger is not None
+        assert hasattr(logger, "info")
+
+
+class TestConfigureLogging:
+    """Tests for configure_logging function."""
+
+    def test_configure_logging_console_format(self) -> None:
+        """Test configure_logging with console format does not raise."""
+        configure_logging(level="INFO", log_format="console")
+        logger = get_logger()
+        assert logger is not None
+
+    def test_configure_logging_json_format(self) -> None:
+        """Test configure_logging with json format does not raise."""
+        configure_logging(level="DEBUG", log_format="json")
+        logger = get_logger()
+        assert logger is not None
+
+    def test_configure_logging_uses_defaults(self) -> None:
+        """Test configure_logging uses settings defaults when not provided."""
+        configure_logging()
+        logger = get_logger()
+        assert logger is not None
+
+    def test_configure_logging_different_levels(self) -> None:
+        """Test configure_logging accepts different log levels."""
+        for level in ("DEBUG", "INFO", "WARNING", "ERROR"):
+            configure_logging(level=level)
+            logger = get_logger()
+            assert logger is not None
@@ -0,0 +1 @@
+"""Tests for the Veritext pytest plugin."""
@@ -0,0 +1,32 @@
+"""Pytest configuration for pytest_plugin tests."""
+
+import pytest
+
+from veritext.pytest_plugin.fixtures import ValidatorFactory
+
+# Enable the pytester fixture for plugin testing
+pytest_plugins = ["pytester"]
+
+# Re-export fixtures from the plugin module for testing
+
+
+@pytest.fixture
+def text_validator() -> ValidatorFactory:
+    """Provide a factory for building validators."""
+    return ValidatorFactory()
+
+
+@pytest.fixture
+def validation_context() -> type:
+    """Provide a factory for creating ValidationContext objects."""
+    from typing import Any
+
+    from veritext.core.types import ValidationContext
+
+    def _create(
+        reference: str | list[str] | None = None,
+        **metadata: Any,
+    ) -> ValidationContext:
+        return ValidationContext(reference=reference, metadata=metadata)
+
+    return _create
@@ -0,0 +1,211 @@
+"""Tests for the validate_text assertion function."""
+
+import pytest
+
+from veritext.pytest_plugin import validate_text
+
+
+class TestValidateTextBasicValidation:
+    """Test basic validation scenarios."""
+
+    def test_passes_with_valid_length(self) -> None:
+        """Test validation passes when length constraints are met."""
+        text = "The quick brown fox jumps over the lazy dog."
+        validate_text(text, min_length=10, max_length=100)
+
+    def test_fails_when_too_short(self) -> None:
+        """Test validation fails when text is below minimum length."""
+        text = "Short."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, min_length=50)
+        assert "length" in str(exc_info.value).lower()
+
+    def test_fails_when_too_long(self) -> None:
+        """Test validation fails when text exceeds maximum length."""
+        text = "A" * 100
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, max_length=50)
+        assert "length" in str(exc_info.value).lower()
+
+
+class TestValidateTextReadability:
+    """Test readability validation."""
+
+    def test_passes_with_simple_text(self) -> None:
+        """Test validation passes for simple, readable text."""
+        text = "The cat sat on the mat. It was a nice day."
+        validate_text(text, max_reading_grade=10.0)
+
+    def test_fails_with_complex_text(self) -> None:
+        """Test validation fails for overly complex text."""
+        text = (
+            "The implementation of sophisticated metacognitive strategies "
+            "necessitates the comprehensive understanding of epistemological "
+            "frameworks and their corresponding methodological implications."
+        )
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, max_reading_grade=3.0)
+        assert "readability" in str(exc_info.value).lower()
+
+
+class TestValidateTextPatterns:
+    """Test pattern matching validation."""
+
+    def test_passes_when_contains_pattern(self) -> None:
+        """Test validation passes when required pattern is present."""
+        text = "Please contact support@example.com for assistance."
+        validate_text(text, must_contain=["support@example.com"])
+
+    def test_fails_when_missing_required_pattern(self) -> None:
+        """Test validation fails when required pattern is missing."""
+        text = "Please contact us for assistance."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, must_contain=["@example.com"])
+        assert "contains" in str(exc_info.value).lower()
+
+    def test_passes_when_excludes_pattern(self) -> None:
+        """Test validation passes when forbidden pattern is absent."""
+        text = "The report is complete and reviewed."
+        validate_text(text, must_exclude=["TODO", "FIXME"])
+
+    def test_fails_when_contains_forbidden_pattern(self) -> None:
+        """Test validation fails when forbidden pattern is present."""
+        text = "The report is almost done. TODO: add conclusion."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, must_exclude=["TODO"])
+        assert "excludes" in str(exc_info.value).lower()
+
+
+class TestValidateTextComparisonMetrics:
+    """Test comparison-based validation (BLEU, ROUGE)."""
+
+    def test_passes_with_high_bleu_score(self) -> None:
+        """Test validation passes when BLEU score meets threshold."""
+        reference = "The quick brown fox jumps over the lazy dog."
+        text = "The quick brown fox jumps over the lazy dog."
+        validate_text(text, reference=reference, min_bleu=0.9)
+
+    def test_fails_with_low_bleu_score(self) -> None:
+        """Test validation fails when BLEU score is below threshold."""
+        reference = "The quick brown fox jumps over the lazy dog."
+        text = "A slow red cat sleeps under the active mouse."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, reference=reference, min_bleu=0.5)
+        assert "bleu" in str(exc_info.value).lower()
+
+    def test_passes_with_high_rouge_score(self) -> None:
+        """Test validation passes when ROUGE score meets threshold."""
+        reference = "Machine learning models require extensive training data."
+        text = "Machine learning models need extensive training data."
+        validate_text(text, reference=reference, min_rouge=0.5)
+
+    def test_fails_with_low_rouge_score(self) -> None:
+        """Test validation fails when ROUGE score is below threshold."""
+        reference = "The algorithm processes input data efficiently."
+        text = "Cats enjoy sleeping in sunny spots."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, reference=reference, min_rouge=0.5)
+        assert "rouge" in str(exc_info.value).lower()
+
+
+class TestValidateTextErrorHandling:
+    """Test error handling and edge cases."""
+
+    def test_raises_value_error_when_no_criteria(self) -> None:
+        """Test that ValueError is raised when no validation criteria provided."""
+        with pytest.raises(ValueError, match="At least one validation criterion"):
+            validate_text("Some text")
+
+    def test_raises_value_error_when_bleu_without_reference(self) -> None:
+        """Test that ValueError is raised when BLEU requested without reference."""
+        with pytest.raises(ValueError, match="Reference text required"):
+            validate_text("Some text", min_bleu=0.5)
+
+    def test_raises_value_error_when_rouge_without_reference(self) -> None:
+        """Test that ValueError is raised when ROUGE requested without reference."""
+        with pytest.raises(ValueError, match="Reference text required"):
+            validate_text("Some text", min_rouge=0.5)
+
+    def test_raises_value_error_when_semantic_without_reference(self) -> None:
+        """Test that ValueError is raised for semantic without reference."""
+        with pytest.raises(ValueError, match="Reference text required"):
+            validate_text("Some text", min_semantic=0.5)
+
+
+class TestValidateTextMultipleCriteria:
+    """Test validation with multiple criteria combined."""
+
+    def test_passes_all_criteria(self) -> None:
+        """Test validation passes when all criteria are met."""
+        reference = "The quick brown fox jumps over the lazy dog."
+        text = "The quick brown fox jumps over the lazy dog."
+        validate_text(
+            text,
+            reference=reference,
+            min_bleu=0.9,
+            min_length=10,
+            max_length=100,
+        )
+
+    def test_fails_when_one_criterion_fails(self) -> None:
+        """Test validation fails when any criterion fails."""
+        reference = "The quick brown fox jumps over the lazy dog."
+        text = "The quick brown fox jumps over the lazy dog."
+        with pytest.raises(AssertionError):
+            validate_text(
+                text,
+                reference=reference,
+                min_bleu=0.9,
+                max_length=10,  # This will fail
+            )
+
+
+class TestValidateTextFailureMessage:
+    """Test failure message formatting."""
+
+    def test_failure_message_includes_text_preview(self) -> None:
+        """Test that failure message includes preview of the text."""
+        text = "Short text"
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, min_length=100)
+        assert "Short text" in str(exc_info.value)
+
+    def test_failure_message_truncates_long_text(self) -> None:
+        """Test that long text is truncated in failure message."""
+        text = "A" * 200
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, max_length=50)
+        message = str(exc_info.value)
+        assert "..." in message
+        assert "A" * 200 not in message
+
+    def test_failure_message_includes_check_details(self) -> None:
+        """Test that failure message includes check name and details."""
+        text = "Short"
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, min_length=100)
+        message = str(exc_info.value)
+        assert "Failed checks:" in message
+        assert "length" in message.lower()
+
+
+class TestValidateTextListReference:
+    """Test validation with list of reference texts."""
+
+    def test_bleu_with_multiple_references(self) -> None:
+        """Test BLEU validation accepts multiple reference texts."""
+        references = [
+            "The quick brown fox jumps over the lazy dog.",
+            "A fast brown fox leaps over a sleepy dog.",
+        ]
+        text = "The quick brown fox jumps over the lazy dog."
+        validate_text(text, reference=references, min_bleu=0.9)
+
+    def test_rouge_with_multiple_references(self) -> None:
+        """Test ROUGE validation accepts multiple reference texts."""
+        references = [
+            "Machine learning requires data.",
+            "ML models need training data.",
+        ]
+        text = "Machine learning models require training data."
+        validate_text(text, reference=references, min_rouge=0.3)
@@ -0,0 +1,88 @@
+"""Tests for the pytest plugin fixtures."""
+
+from veritext.core.types import ValidationContext
+from veritext.pytest_plugin.fixtures import ValidatorFactory
+from veritext.validators import bleu, length
+
+
+class TestValidatorFactory:
+    """Test the ValidatorFactory class."""
+
+    def test_creates_validator_from_checks(self) -> None:
+        """Test that factory creates a callable validator."""
+        factory = ValidatorFactory()
+        validate = factory(checks=[length(min_chars=5)])
+
+        result = validate("Hello, World!")
+        assert result.passed
+
+    def test_validator_uses_provided_reference(self) -> None:
+        """Test that factory passes reference to context."""
+        factory = ValidatorFactory()
+        reference = "The quick brown fox."
+        validate = factory(
+            checks=[bleu(min_score=0.5)],
+            reference=reference,
+        )
+
+        # Exact match should pass
+        result = validate("The quick brown fox.")
+        assert result.passed
+
+    def test_validator_returns_validation_result(self) -> None:
+        """Test that validator returns a ValidationResult."""
+        factory = ValidatorFactory()
+        validate = factory(checks=[length(min_chars=100)])
+
+        result = validate("Short")
+        assert not result.passed
+        assert len(result.checks) == 1
+        assert result.checks[0].name == "length"
+
+
+class TestTextValidatorFixture:
+    """Test the text_validator fixture."""
+
+    def test_fixture_returns_factory(self, text_validator: ValidatorFactory) -> None:
+        """Test that fixture provides a ValidatorFactory."""
+        assert isinstance(text_validator, ValidatorFactory)
+
+    def test_fixture_can_create_validators(
+        self,
+        text_validator: ValidatorFactory,
+    ) -> None:
+        """Test that fixture can be used to create validators."""
+        validate = text_validator(checks=[length(min_chars=5, max_chars=50)])
+
+        assert validate("Hello, World!").passed
+        assert not validate("Hi").passed
+
+
+class TestValidationContextFixture:
+    """Test the validation_context fixture."""
+
+    def test_fixture_creates_context(
+        self,
+        validation_context: type,
+    ) -> None:
+        """Test that fixture creates ValidationContext."""
+        ctx = validation_context(reference="Test reference")
+        assert isinstance(ctx, ValidationContext)
+        assert ctx.reference == "Test reference"
+
+    def test_fixture_accepts_metadata(
+        self,
+        validation_context: type,
+    ) -> None:
+        """Test that fixture passes metadata to context."""
+        ctx = validation_context(reference="Test", source="unit_test", version=1)
+        assert ctx.metadata["source"] == "unit_test"
+        assert ctx.metadata["version"] == 1
+
+    def test_fixture_allows_no_reference(
+        self,
+        validation_context: type,
+    ) -> None:
+        """Test that fixture allows creating context without reference."""
+        ctx = validation_context()
+        assert ctx.reference is None
@@ -0,0 +1,99 @@
+"""Tests for the pytest plugin hooks."""
+
+import pytest
+
+
+@pytest.fixture
+def plugin_pytester(pytester: pytest.Pytester) -> pytest.Pytester:
+    """Configure pytester to use the veritext plugin.
+
+    Note: The plugin is already loaded via the entry point in pyproject.toml,
+    so no explicit pytest_plugins declaration is needed.
+    """
+    return pytester
+
+
+def test_plugin_registers_marker(plugin_pytester: pytest.Pytester) -> None:
+    """Test that the text_validation marker is registered."""
+    plugin_pytester.makepyfile(
+        """
+        import pytest
+
+        @pytest.mark.text_validation
+        def test_example():
+            pass
+        """
+    )
+    # Run with strict markers - this will fail if marker isn't registered
+    result = plugin_pytester.runpytest("--strict-markers")
+    result.assert_outcomes(passed=1)
+
+
+def test_marker_can_be_used(plugin_pytester: pytest.Pytester) -> None:
+    """Test that the text_validation marker can filter tests."""
+    plugin_pytester.makepyfile(
+        """
+        import pytest
+
+        @pytest.mark.text_validation
+        def test_marked():
+            pass
+
+        def test_unmarked():
+            pass
+        """
+    )
+    # Run only marked tests
+    result = plugin_pytester.runpytest("-m", "text_validation")
+    result.assert_outcomes(passed=1)
+
+
+def test_validate_text_is_importable(plugin_pytester: pytest.Pytester) -> None:
+    """Test that validate_text can be imported from the plugin."""
+    plugin_pytester.makepyfile(
+        """
+        from veritext.pytest_plugin import validate_text
+
+        def test_import():
+            assert callable(validate_text)
+        """
+    )
+    result = plugin_pytester.runpytest()
+    result.assert_outcomes(passed=1)
+
+
+def test_validate_text_works_in_tests(plugin_pytester: pytest.Pytester) -> None:
+    """Test that validate_text can be used in test functions."""
+    plugin_pytester.makepyfile(
+        """
+        from veritext.pytest_plugin import validate_text
+
+        def test_validation_passes():
+            validate_text(
+                "The quick brown fox jumps over the lazy dog.",
+                min_length=10,
+                max_length=100,
+            )
+        """
+    )
+    result = plugin_pytester.runpytest()
+    result.assert_outcomes(passed=1)
+
+
+def test_validate_text_failure_in_tests(plugin_pytester: pytest.Pytester) -> None:
+    """Test that validate_text failures are reported properly."""
+    plugin_pytester.makepyfile(
+        """
+        from veritext.pytest_plugin import validate_text
+
+        def test_validation_fails():
+            validate_text(
+                "Short",
+                min_length=100,
+            )
+        """
+    )
+    result = plugin_pytester.runpytest()
+    result.assert_outcomes(failed=1)
+    # Check that failure message contains useful information
+    result.stdout.fnmatch_lines(["*Text validation failed*"])
@@ -263,6 +263,11 @@ class TestContainsValidator:
        with pytest.raises(InvalidThresholdError, match="cannot be empty"):
            ContainsValidator(patterns=[])

+    def test_contains_validator_raises_on_invalid_regex(self) -> None:
+        """Test that invalid regex pattern raises error at init time."""
+        with pytest.raises(InvalidThresholdError, match="Invalid regex"):
+            ContainsValidator(patterns=[r"[invalid"])
+
    def test_contains_factory_function(self) -> None:
        """Test the contains() factory function."""
        validator = contains(patterns=["test"], case_sensitive=True)
@@ -327,6 +332,11 @@ class TestExcludesValidator:
        with pytest.raises(InvalidThresholdError, match="cannot be empty"):
            ExcludesValidator(patterns=[])

+    def test_excludes_validator_raises_on_invalid_regex(self) -> None:
+        """Test that invalid regex pattern raises error at init time."""
+        with pytest.raises(InvalidThresholdError, match="Invalid regex"):
+            ExcludesValidator(patterns=[r"[invalid"])
+
    def test_excludes_factory_function(self) -> None:
        """Test the excludes() factory function."""
        validator = excludes(patterns=["test"], case_sensitive=True)
Author	SHA1	Message	Date
kschappell	0699e97e1d	refactor: CLI cleanup and documentation updates - Refactor CLI metric computation to eliminate code duplication - Update version format to PEP 440 compliance (0.1.0.dev0) - Cache Settings instance via @lru_cache for performance - Document composite validators' protocol deviation - Consolidate redundant empty checks in ROUGE-L computation - Add Phase 10 (Portfolio Demos) to implementation plan	2026-02-04 15:38:46 +00:00
kschappell	7de4505e31	fix(pytest-plugin): remove duplicate plugin registration in tests The pytest plugin is already loaded via the entry point, so explicitly declaring it in conftest causes a duplicate registration error.	2026-02-04 00:43:20 +00:00
kschappell	564d663c78	docs(changelog): update for QA fixes	2026-02-04 00:23:06 +00:00
kschappell	0b2bc6c688	test(core): add coverage for config and logging modules Adds tests for VeritextSettings defaults, env var overrides, and the get_logger/configure_logging functions.	2026-02-04 00:22:57 +00:00
kschappell	aa687f43cd	fix(validators): validate regex patterns at init time ContainsValidator and ExcludesValidator now pre-compile regex patterns during initialisation and raise InvalidThresholdError if invalid.	2026-02-04 00:22:47 +00:00
kschappell	f18427e123	fix: QA review fixes for 0.1.0 release - Fix README readability example property names - Add validation for empty references after tokenisation in ROUGE - Guard against zero sentence count in readability metric - Implement LRU cache with max size for semantic embeddings - Add .score property to LexicalResult for API consistency - Use defensive list copy in composite validators	2026-02-03 21:31:48 +00:00
kschappell	1754556c99	docs(changelog): release 0.1.0 Initial release with metrics, validators, pytest plugin, benchmark module, CLI, and comprehensive documentation.	2026-02-03 19:16:37 +00:00
kschappell	13c869f5d6	docs(readme): comprehensive documentation Expands readme with detailed coverage of metrics, validators, pytest plugin, benchmark module, CLI commands, and development setup.	2026-02-03 19:16:14 +00:00
kschappell	93515707cc	docs(examples): add benchmark regression example Demonstrates benchmark quality tracking with historical comparison and CI integration using assert_no_regression() for exit code control.	2026-02-03 19:15:12 +00:00
kschappell	3cde5aba77	docs(examples): add chatbot testing example Demonstrates pytest integration for chatbot QA with validate_text() assertions, fixtures, and parametrised content safety tests.	2026-02-03 19:14:25 +00:00
kschappell	69966d171c	docs(examples): add basic validation example Demonstrates core Veritext functionality: metrics, validators, composites, and constraint validators with runnable code.	2026-02-03 19:13:47 +00:00
kschappell	d5df8b52e6	docs: add branch creation instruction to git workflow Explicitly documents the requirement to create a new branch before starting work from a plan, consistent with the parent workspace CLAUDE.md instruction.	2026-02-03 19:06:45 +00:00
kschappell	8b7c087de7	docs(changelog): add CLI entries Document command-line interface including validate command, benchmark subcommands, and output formatting options.	2026-02-03 18:22:50 +00:00
kschappell	c54f8c3f6f	test(cli): add CLI tests Add comprehensive test suite for validate command, benchmark commands, input readers, and output formatters using Typer CliRunner.	2026-02-03 18:22:31 +00:00
kschappell	0cadfd4d23	feat(cli): add benchmark subcommands Add benchmark run, show, and check commands for quality tracking with regression detection supporting CI integration.	2026-02-03 18:20:28 +00:00
kschappell	e128720917	feat(cli): add validate command Implement validate command with inline and file-based modes supporting BLEU, ROUGE, and lexical metrics with multiple output formats.	2026-02-03 18:19:20 +00:00
kschappell	f713d5e8a6	feat(cli): add Rich output formatters Add formatters for validation results (table/json/simple) and benchmark history display with regression report panels.	2026-02-03 18:17:33 +00:00
kschappell	9853b57843	feat(cli): add JSONL and directory input readers Add TextPair dataclass and read_jsonl/read_paired_jsonl functions for parsing candidate-reference pairs from JSONL files.	2026-02-03 18:16:34 +00:00
kschappell	55faae3e1b	feat(cli): add CLI entry point with version command Initialise Typer app with --version flag and help text.	2026-02-03 18:16:07 +00:00
kschappell	07ac70e835	docs(changelog): add benchmark entries Document benchmark module features in changelog.	2026-02-03 18:10:19 +00:00
kschappell	6d1bece815	test(benchmark): add benchmark module tests Comprehensive tests for models, storage, regression detection, and runner.	2026-02-03 18:10:13 +00:00
kschappell	40fa39485e	feat(benchmark): add module exports Public API exports for the benchmark module.	2026-02-03 18:10:07 +00:00
kschappell	9115f0c25b	feat(benchmark): add Benchmark runner class Main Benchmark class for evaluating text quality and tracking regressions.	2026-02-03 18:10:01 +00:00
kschappell	83c4b4bee5	feat(benchmark): add regression detection Rolling window baseline computation and statistical regression detection.	2026-02-03 18:09:55 +00:00
kschappell	44e3e8f4ea	feat(benchmark): add SQLite storage backend Persistent storage for benchmark history with WAL mode for concurrent access.	2026-02-03 18:09:49 +00:00
kschappell	45dfe07772	feat(benchmark): add BenchmarkRun and RegressionReport models Data models for benchmark runs and regression reports using Pydantic.	2026-02-03 18:09:43 +00:00
kschappell	6bafc43754	docs(changelog): add pytest plugin entries	2026-02-03 17:40:52 +00:00
kschappell	012b306749	test(pytest-plugin): add plugin tests Cover validate_text assertions, fixture factories, marker registration, and pytest integration using pytester for subprocess testing.	2026-02-03 17:40:46 +00:00
kschappell	ac7c5c69cf	feat(pytest-plugin): add validate_text assertion Primary API for text validation in pytest with keyword arguments for BLEU, ROUGE, semantic similarity, length, readability, and pattern matching. Includes detailed failure formatting.	2026-02-03 17:40:40 +00:00
kschappell	cd36c54e22	feat(pytest-plugin): add plugin hooks and markers Register text_validation marker via pytest_configure hook.	2026-02-03 17:40:33 +00:00
				`@@ -0,0 +1 @@`
				`"""Tests for the Veritext pytest plugin."""`