docs(changelog): add benchmark entries

Document benchmark module features in changelog.
test(benchmark): add benchmark module tests
2026-02-03 18:10:19 +00:00 · 2026-02-03 18:10:13 +00:00 · 2026-02-03 18:10:07 +00:00 · 2026-02-03 18:10:01 +00:00 · 2026-02-03 18:09:55 +00:00 · 2026-02-03 18:09:49 +00:00
39 changed files with 5633 additions and 2 deletions
@@ -18,4 +18,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Metrics module with `Metric` protocol, `AggregateStats`, and `BatchResult` types
 - BLEU metric implementation (BLEU-1 through BLEU-4 with brevity penalty)
 - Lexical similarity metric (Jaccard similarity and token overlap)
+- ROUGE metric (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F-measure)
+- Flesch-Kincaid readability metrics (grade level and reading ease)
 - Batch scoring with aggregate statistics for all metrics
+- Validators module with `Check` protocol for validation checks
+- Metric-based validators: `BleuValidator`, `RougeValidator`, `LexicalValidator`
+- Constraint validators: `LengthValidator`, `ReadabilityValidator`, `ContainsValidator`, `ExcludesValidator`
+- Composite validators: `AllOf` (all checks must pass), `AnyOf` (any check must pass)
+- Factory functions for clean validator API (`bleu()`, `rouge()`, `lexical()`, `length()`, `readability()`, `contains()`, `excludes()`, `all_of()`, `any_of()`)
+- Semantic similarity module with embedding-based text comparison (requires `veritext[semantic]` extra)
+- `SemanticSimilarity` metric using sentence-transformers for semantic relatedness
+- `SemanticValidator` for threshold-based semantic similarity validation
+- `semantic()` factory function for creating semantic validators
+- Embedding caching for performance optimisation in repeated comparisons
+- Native pytest plugin for CI/CD integration (entry point: `pytest11`)
+- `validate_text()` assertion function for expressive test assertions
+- `text_validation` marker for filtering validation tests
+- Pytest fixtures: `text_validator` factory and `validation_context` helper
+- Detailed failure messages with text preview and check diagnostics
+- Benchmark module for quality tracking and regression detection
+- `Benchmark` class for evaluating text quality over time with metric storage
+- `BenchmarkRun` and `RegressionReport` data models for tracking runs
+- SQLite storage backend with WAL mode for concurrent access
+- Rolling window baseline computation for historical comparison
+- `check_regression()` for statistical comparison against baseline
+- `assert_no_regression()` raises `RegressionDetectedError` for CI integration
+- Customisable tolerance threshold and window size for regression detection
+- Metadata support for tracking git SHA, model versions, etc.
@@ -0,0 +1,12 @@
+"""Benchmark module for quality tracking and regression detection."""
+
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+from veritext.benchmark.runner import Benchmark
+from veritext.benchmark.storage import BenchmarkStorage
+
+__all__ = [
+    "Benchmark",
+    "BenchmarkRun",
+    "BenchmarkStorage",
+    "RegressionReport",
+]
@@ -0,0 +1,72 @@
+"""Benchmark data models."""
+
+from datetime import datetime
+from typing import Any
+
+from pydantic import BaseModel, ConfigDict, Field
+
+
+class BenchmarkRun(BaseModel):
+    """Record of a single benchmark execution."""
+
+    model_config = ConfigDict(frozen=True)
+
+    id: str
+    """UUID for this run."""
+
+    benchmark_name: str
+    """Name identifying this benchmark suite."""
+
+    timestamp: datetime
+    """When the benchmark was executed."""
+
+    veritext_version: str
+    """Version of veritext used."""
+
+    metrics: dict[str, float]
+    """Metric results, e.g. {"rouge_l": 0.82, "bleu4": 0.71}."""
+
+    sample_count: int
+    """Number of samples evaluated."""
+
+    metadata: dict[str, Any] = Field(default_factory=dict)
+    """Optional metadata (git_sha, model version, etc.)."""
+
+
+class RegressionReport(BaseModel):
+    """Report comparing current run against baseline."""
+
+    model_config = ConfigDict(frozen=True)
+
+    detected: bool
+    """Whether a regression was detected."""
+
+    baseline: dict[str, float]
+    """Baseline metric values (rolling average)."""
+
+    current: dict[str, float]
+    """Current run metric values."""
+
+    deltas: dict[str, float]
+    """Difference from baseline (negative = regression)."""
+
+    tolerance: float
+    """Tolerance threshold used for detection."""
+
+    @property
+    def summary(self) -> str:
+        """Human-readable summary of the report."""
+        if not self.detected:
+            return "No regression detected. All metrics within tolerance."
+
+        regressions = [
+            f"  {metric}: {self.current.get(metric, 0.0):.4f} "
+            f"(baseline: {self.baseline.get(metric, 0.0):.4f}, "
+            f"delta: {delta:+.4f})"
+            for metric, delta in self.deltas.items()
+            if delta < -self.tolerance
+        ]
+
+        return f"Regression detected (tolerance: {self.tolerance:.2%}):\n" + "\n".join(
+            regressions
+        )
@@ -0,0 +1,87 @@
+"""Regression detection using rolling window comparison."""
+
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+
+
+def compute_baseline(
+    runs: list[BenchmarkRun],
+    window: int = 10,
+) -> dict[str, float]:
+    """
+    Compute rolling average baseline from recent runs.
+
+    Args:
+        runs: List of benchmark runs (most recent first).
+        window: Number of runs to include in the baseline.
+
+    Returns:
+        Dictionary of metric names to their average values.
+    """
+    if not runs:
+        return {}
+
+    # Take up to `window` runs
+    recent_runs = runs[:window]
+
+    # Collect all metric values
+    metric_values: dict[str, list[float]] = {}
+    for run in recent_runs:
+        for metric_name, value in run.metrics.items():
+            if metric_name not in metric_values:
+                metric_values[metric_name] = []
+            metric_values[metric_name].append(value)
+
+    # Compute averages
+    return {
+        metric: sum(values) / len(values) for metric, values in metric_values.items()
+    }
+
+
+def detect_regression(
+    current: dict[str, float],
+    baseline: dict[str, float],
+    tolerance: float = 0.05,
+) -> RegressionReport:
+    """
+    Compare current metrics against baseline.
+
+    A regression is detected if any metric drops by more than the tolerance
+    threshold (relative to its baseline value).
+
+    Args:
+        current: Current metric values.
+        baseline: Baseline metric values.
+        tolerance: Maximum allowed drop before regression is flagged (e.g., 0.05 = 5%).
+
+    Returns:
+        RegressionReport with comparison results.
+    """
+    if not baseline:
+        # No baseline means no regression possible
+        return RegressionReport(
+            detected=False,
+            baseline=baseline,
+            current=current,
+            deltas={},
+            tolerance=tolerance,
+        )
+
+    deltas: dict[str, float] = {}
+    detected = False
+
+    for metric, baseline_value in baseline.items():
+        current_value = current.get(metric, 0.0)
+        delta = current_value - baseline_value
+        deltas[metric] = delta
+
+        # Check if this metric regressed beyond tolerance
+        if delta < -tolerance:
+            detected = True
+
+    return RegressionReport(
+        detected=detected,
+        baseline=baseline,
+        current=current,
+        deltas=deltas,
+        tolerance=tolerance,
+    )
@@ -0,0 +1,186 @@
+"""Benchmark execution and tracking."""
+
+import uuid
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Any
+
+import veritext
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+from veritext.benchmark.regression import compute_baseline, detect_regression
+from veritext.benchmark.storage import BenchmarkStorage
+from veritext.core.exceptions import RegressionDetectedError
+from veritext.metrics.bleu import Bleu
+from veritext.metrics.rouge import Rouge
+
+# Default metrics to use for evaluation
+DEFAULT_METRICS = ["rouge_l", "bleu4"]
+
+
+class Benchmark:
+    """Track text quality over time."""
+
+    def __init__(
+        self,
+        name: str,
+        storage_path: str | Path = "benchmarks/",
+    ) -> None:
+        """
+        Initialise a benchmark tracker.
+
+        Args:
+            name: Name identifying this benchmark suite.
+            storage_path: Directory for storing benchmark data.
+        """
+        self._name = name
+        self._storage_path = Path(storage_path)
+        self._storage = BenchmarkStorage(self._storage_path / f"{name}.db")
+
+        # Initialise metrics
+        self._bleu = Bleu()
+        self._rouge = Rouge()
+
+    @property
+    def name(self) -> str:
+        """Return the benchmark name."""
+        return self._name
+
+    def _compute_metrics(
+        self,
+        candidates: list[str],
+        references: list[str] | list[list[str]],
+        metric_names: list[str],
+    ) -> dict[str, float]:
+        """Compute requested metrics for the given samples."""
+        results: dict[str, float] = {}
+
+        for metric_name in metric_names:
+            if metric_name in ("bleu1", "bleu2", "bleu3", "bleu4"):
+                batch_result = self._bleu.batch_score(candidates, references)
+                stats = batch_result.stats.get(metric_name)
+                if stats:
+                    results[metric_name] = stats.mean
+
+            elif metric_name in (
+                "rouge1",
+                "rouge2",
+                "rouge_l",
+                "rouge1_fmeasure",
+                "rouge2_fmeasure",
+                "rouge_l_fmeasure",
+            ):
+                rouge_result = self._rouge.batch_score(candidates, references)
+                # Map short names to stat names
+                stat_name = metric_name
+                if metric_name == "rouge1":
+                    stat_name = "rouge1_fmeasure"
+                elif metric_name == "rouge2":
+                    stat_name = "rouge2_fmeasure"
+                elif metric_name == "rouge_l":
+                    stat_name = "rouge_l_fmeasure"
+
+                stats = rouge_result.stats.get(stat_name)
+                if stats:
+                    results[metric_name] = stats.mean
+
+        return results
+
+    def evaluate(
+        self,
+        candidates: list[str],
+        references: list[str] | list[list[str]],
+        metrics: list[str] | None = None,
+        metadata: dict[str, Any] | None = None,
+    ) -> BenchmarkRun:
+        """
+        Evaluate candidates against references, store results, and return the run.
+
+        Args:
+            candidates: List of candidate texts to evaluate.
+            references: Reference text(s) for each candidate.
+            metrics: List of metrics to compute. Defaults to ["rouge_l", "bleu4"].
+            metadata: Optional metadata (git_sha, model version, etc.).
+
+        Returns:
+            The BenchmarkRun record that was created and stored.
+        """
+        metric_names = metrics or DEFAULT_METRICS
+        metric_results = self._compute_metrics(candidates, references, metric_names)
+
+        run = BenchmarkRun(
+            id=str(uuid.uuid4()),
+            benchmark_name=self._name,
+            timestamp=datetime.now(UTC),
+            veritext_version=veritext.__version__,
+            metrics=metric_results,
+            sample_count=len(candidates),
+            metadata=metadata or {},
+        )
+
+        self._storage.save_run(run)
+        return run
+
+    def check_regression(
+        self,
+        tolerance: float = 0.05,
+        window: int = 10,
+    ) -> RegressionReport:
+        """
+        Compare latest run against historical baseline.
+
+        Args:
+            tolerance: Maximum allowed metric drop before regression is flagged.
+            window: Number of historical runs to include in baseline.
+
+        Returns:
+            RegressionReport with comparison results.
+        """
+        runs = self._storage.get_runs(self._name)
+
+        if not runs:
+            # No runs at all
+            return RegressionReport(
+                detected=False,
+                baseline={},
+                current={},
+                deltas={},
+                tolerance=tolerance,
+            )
+
+        current_run = runs[0]
+        # Baseline excludes the current run
+        historical_runs = runs[1:]
+        baseline = compute_baseline(historical_runs, window=window)
+
+        return detect_regression(current_run.metrics, baseline, tolerance)
+
+    def assert_no_regression(
+        self,
+        tolerance: float = 0.05,
+        window: int = 10,
+    ) -> None:
+        """
+        Raise RegressionDetectedError if quality dropped.
+
+        Args:
+            tolerance: Maximum allowed metric drop before regression is flagged.
+            window: Number of historical runs to include in baseline.
+
+        Raises:
+            RegressionDetectedError: If a regression is detected.
+        """
+        report = self.check_regression(tolerance=tolerance, window=window)
+        if report.detected:
+            raise RegressionDetectedError(report.summary)
+
+    def get_history(self, limit: int = 20) -> list[BenchmarkRun]:
+        """
+        Get recent benchmark runs.
+
+        Args:
+            limit: Maximum number of runs to return.
+
+        Returns:
+            List of BenchmarkRun objects, most recent first.
+        """
+        return self._storage.get_runs(self._name, limit=limit)
@@ -0,0 +1,179 @@
+"""SQLite storage for benchmark history."""
+
+import json
+import sqlite3
+from datetime import datetime
+from pathlib import Path
+
+from veritext.benchmark.models import BenchmarkRun
+from veritext.core.exceptions import StorageError
+
+
+class BenchmarkStorage:
+    """SQLite-backed storage for benchmark runs."""
+
+    def __init__(self, db_path: Path) -> None:
+        """
+        Initialise storage, creating tables if needed.
+
+        Args:
+            db_path: Path to the SQLite database file.
+        """
+        self._db_path = db_path
+        self._ensure_parent_exists()
+        self._init_database()
+
+    def _ensure_parent_exists(self) -> None:
+        """Ensure the parent directory exists."""
+        self._db_path.parent.mkdir(parents=True, exist_ok=True)
+
+    def _get_connection(self) -> sqlite3.Connection:
+        """Get a database connection with WAL mode enabled."""
+        conn = sqlite3.connect(str(self._db_path), timeout=30.0)
+        conn.execute("PRAGMA journal_mode=WAL")
+        conn.execute("PRAGMA foreign_keys=ON")
+        conn.row_factory = sqlite3.Row
+        return conn
+
+    def _init_database(self) -> None:
+        """Create tables if they don't exist."""
+        try:
+            with self._get_connection() as conn:
+                conn.executescript("""
+                    CREATE TABLE IF NOT EXISTS benchmark_runs (
+                        id TEXT PRIMARY KEY,
+                        benchmark_name TEXT NOT NULL,
+                        timestamp TEXT NOT NULL,
+                        veritext_version TEXT NOT NULL,
+                        sample_count INTEGER NOT NULL,
+                        metadata TEXT
+                    );
+
+                    CREATE TABLE IF NOT EXISTS benchmark_metrics (
+                        run_id TEXT REFERENCES benchmark_runs(id) ON DELETE CASCADE,
+                        metric_name TEXT NOT NULL,
+                        value REAL NOT NULL,
+                        PRIMARY KEY (run_id, metric_name)
+                    );
+
+                    CREATE INDEX IF NOT EXISTS idx_benchmark_name
+                    ON benchmark_runs(benchmark_name, timestamp DESC);
+                """)
+        except sqlite3.Error as e:
+            raise StorageError(f"Failed to initialise database: {e}") from e
+
+    def save_run(self, run: BenchmarkRun) -> None:
+        """
+        Persist a benchmark run.
+
+        Args:
+            run: The benchmark run to save.
+
+        Raises:
+            StorageError: If the save operation fails.
+        """
+        try:
+            with self._get_connection() as conn:
+                # Insert the run
+                conn.execute(
+                    """
+                    INSERT INTO benchmark_runs
+                    (id, benchmark_name, timestamp, veritext_version, sample_count, metadata)
+                    VALUES (?, ?, ?, ?, ?, ?)
+                    """,
+                    (
+                        run.id,
+                        run.benchmark_name,
+                        run.timestamp.isoformat(),
+                        run.veritext_version,
+                        run.sample_count,
+                        json.dumps(run.metadata) if run.metadata else None,
+                    ),
+                )
+
+                # Insert metrics
+                for metric_name, value in run.metrics.items():
+                    conn.execute(
+                        """
+                        INSERT INTO benchmark_metrics (run_id, metric_name, value)
+                        VALUES (?, ?, ?)
+                        """,
+                        (run.id, metric_name, value),
+                    )
+        except sqlite3.IntegrityError as e:
+            raise StorageError(f"Run with id '{run.id}' already exists") from e
+        except sqlite3.Error as e:
+            raise StorageError(f"Failed to save benchmark run: {e}") from e
+
+    def get_runs(
+        self,
+        benchmark_name: str,
+        limit: int | None = None,
+    ) -> list[BenchmarkRun]:
+        """
+        Retrieve runs for a benchmark, most recent first.
+
+        Args:
+            benchmark_name: Name of the benchmark to retrieve runs for.
+            limit: Maximum number of runs to return.
+
+        Returns:
+            List of BenchmarkRun objects, most recent first.
+
+        Raises:
+            StorageError: If the retrieval fails.
+        """
+        try:
+            with self._get_connection() as conn:
+                query = """
+                    SELECT id, benchmark_name, timestamp, veritext_version,
+                           sample_count, metadata
+                    FROM benchmark_runs
+                    WHERE benchmark_name = ?
+                    ORDER BY timestamp DESC
+                """
+                if limit is not None:
+                    query += " LIMIT ?"
+                    rows = conn.execute(query, (benchmark_name, limit)).fetchall()
+                else:
+                    rows = conn.execute(query, (benchmark_name,)).fetchall()
+
+                runs = []
+                for row in rows:
+                    # Get metrics for this run
+                    metrics_rows = conn.execute(
+                        "SELECT metric_name, value FROM benchmark_metrics WHERE run_id = ?",
+                        (row["id"],),
+                    ).fetchall()
+                    metrics = {m["metric_name"]: m["value"] for m in metrics_rows}
+
+                    metadata = json.loads(row["metadata"]) if row["metadata"] else {}
+
+                    runs.append(
+                        BenchmarkRun(
+                            id=row["id"],
+                            benchmark_name=row["benchmark_name"],
+                            timestamp=datetime.fromisoformat(row["timestamp"]),
+                            veritext_version=row["veritext_version"],
+                            sample_count=row["sample_count"],
+                            metrics=metrics,
+                            metadata=metadata,
+                        )
+                    )
+
+                return runs
+        except sqlite3.Error as e:
+            raise StorageError(f"Failed to retrieve benchmark runs: {e}") from e
+
+    def get_latest_run(self, benchmark_name: str) -> BenchmarkRun | None:
+        """
+        Get the most recent run for a benchmark.
+
+        Args:
+            benchmark_name: Name of the benchmark.
+
+        Returns:
+            The most recent BenchmarkRun, or None if no runs exist.
+        """
+        runs = self.get_runs(benchmark_name, limit=1)
+        return runs[0] if runs else None
@@ -1,9 +1,18 @@
-"""Metrics module: BLEU, lexical similarity, and batch processing."""
+"""Metrics module: BLEU, ROUGE, lexical similarity, readability, and batch processing."""

 from veritext.metrics.base import AggregateStats, BatchResult, Metric
 from veritext.metrics.bleu import Bleu
 from veritext.metrics.lexical import Lexical
-from veritext.metrics.results import BleuResult, LexicalResult
+from veritext.metrics.readability import Readability
+from veritext.metrics.results import (
+    BleuResult,
+    LexicalResult,
+    ReadabilityResult,
+    RougeResult,
+    RougeScore,
+    SemanticResult,
+)
+from veritext.metrics.rouge import Rouge

 __all__ = [
    "AggregateStats",
@@ -13,4 +22,10 @@ __all__ = [
    "Lexical",
    "LexicalResult",
    "Metric",
+    "Readability",
+    "ReadabilityResult",
+    "Rouge",
+    "RougeResult",
+    "RougeScore",
+    "SemanticResult",
 ]
@@ -0,0 +1,195 @@
+"""Readability metrics implementation (Flesch-Kincaid)."""
+
+import re
+
+from veritext.metrics.base import AggregateStats, BatchResult
+from veritext.metrics.results import ReadabilityResult
+
+# Sentence-ending punctuation pattern
+_SENTENCE_ENDINGS = re.compile(r"[.!?]+")
+
+# Vowel pattern for syllable counting
+_VOWELS = re.compile(r"[aeiouy]+", re.IGNORECASE)
+
+
+def _count_syllables(word: str) -> int:
+    """
+    Count syllables in a word using a heuristic approach.
+
+    Uses vowel group counting with adjustments for common patterns.
+
+    Args:
+        word: The word to count syllables for.
+
+    Returns:
+        Estimated syllable count (minimum 1 for non-empty words).
+    """
+    if not word:
+        return 0
+
+    word = word.lower().strip()
+    if not word:
+        return 0
+
+    # Count vowel groups
+    vowel_groups = _VOWELS.findall(word)
+    count = len(vowel_groups)
+
+    # Adjust for silent 'e' at end
+    if word.endswith("e") and count > 1:
+        count -= 1
+
+    # Adjust for 'le' ending (e.g., "table", "able")
+    if word.endswith("le") and len(word) > 2 and word[-3] not in "aeiouy":
+        count += 1
+
+    # Adjust for 'ed' ending when not adding syllable
+    if word.endswith("ed") and len(word) > 2 and word[-3] not in "dt":
+        count = max(count - 1, 1)
+
+    # Ensure at least 1 syllable for any word
+    return max(count, 1)
+
+
+def _count_sentences(text: str) -> int:
+    """
+    Count sentences in text.
+
+    Splits on sentence-ending punctuation (.!?).
+
+    Args:
+        text: The text to count sentences in.
+
+    Returns:
+        Number of sentences (minimum 1 for non-empty text).
+    """
+    if not text or not text.strip():
+        return 0
+
+    # Split on sentence endings and filter empty strings
+    sentences = _SENTENCE_ENDINGS.split(text)
+    # Filter out empty segments
+    sentences = [s for s in sentences if s.strip()]
+
+    return max(len(sentences), 1)
+
+
+def _count_words(text: str) -> tuple[list[str], int]:
+    """
+    Extract words from text and count them.
+
+    Args:
+        text: The text to process.
+
+    Returns:
+        Tuple of (word list, word count).
+    """
+    # Extract words (sequences of letters and apostrophes)
+    words = re.findall(r"[a-zA-Z']+", text)
+    # Filter out standalone apostrophes
+    words = [w for w in words if w.replace("'", "")]
+    return words, len(words)
+
+
+class Readability:
+    """
+    Readability metric using Flesch-Kincaid formulas.
+
+    Computes:
+    - Flesch-Kincaid Grade Level: US grade level required to understand text
+    - Flesch Reading Ease: Score from 0-100 (higher = easier to read)
+
+    This metric does NOT require reference text.
+    """
+
+    @property
+    def name(self) -> str:
+        """Return the name of this metric."""
+        return "readability"
+
+    @property
+    def requires_reference(self) -> bool:
+        """Return whether this metric requires reference text."""
+        return False
+
+    def score(
+        self,
+        candidate: str,
+        reference: str | list[str] | None = None,  # noqa: ARG002
+    ) -> ReadabilityResult:
+        """
+        Compute readability scores for a text.
+
+        Args:
+            candidate: The text to score.
+            reference: Ignored (readability doesn't use reference text).
+
+        Returns:
+            ReadabilityResult with Flesch-Kincaid scores.
+        """
+        # Extract words and count
+        words, word_count = _count_words(candidate)
+
+        # Handle empty or trivial text
+        if word_count == 0:
+            return ReadabilityResult(
+                flesch_kincaid_grade=0.0,
+                flesch_reading_ease=0.0,
+            )
+
+        # Count sentences
+        sentence_count = _count_sentences(candidate)
+
+        # Count syllables
+        syllable_count = sum(_count_syllables(word) for word in words)
+
+        # Compute ratios
+        words_per_sentence = word_count / sentence_count
+        syllables_per_word = syllable_count / word_count
+
+        # Flesch-Kincaid Grade Level
+        # Formula: 0.39 * (words/sentences) + 11.8 * (syllables/words) - 15.59
+        grade_level = 0.39 * words_per_sentence + 11.8 * syllables_per_word - 15.59
+
+        # Flesch Reading Ease
+        # Formula: 206.835 - 1.015 * (words/sentences) - 84.6 * (syllables/words)
+        reading_ease = 206.835 - 1.015 * words_per_sentence - 84.6 * syllables_per_word
+
+        return ReadabilityResult(
+            flesch_kincaid_grade=grade_level,
+            flesch_reading_ease=reading_ease,
+        )
+
+    def batch_score(
+        self,
+        candidates: list[str],
+        references: list[str] | list[list[str]] | None = None,  # noqa: ARG002
+    ) -> BatchResult[ReadabilityResult]:
+        """
+        Compute readability scores for a batch of texts.
+
+        Args:
+            candidates: List of texts to score.
+            references: Ignored (readability doesn't use reference text).
+
+        Returns:
+            BatchResult containing individual results and aggregate statistics.
+        """
+        if not candidates:
+            raise ValueError("Cannot compute batch statistics from empty list")
+
+        results: list[ReadabilityResult] = []
+        for cand in candidates:
+            results.append(self.score(cand))
+
+        # Compute aggregate statistics
+        stats = {
+            "flesch_kincaid_grade": AggregateStats.from_values(
+                [r.flesch_kincaid_grade for r in results]
+            ),
+            "flesch_reading_ease": AggregateStats.from_values(
+                [r.flesch_reading_ease for r in results]
+            ),
+        }
+
+        return BatchResult(results=results, count=len(results), stats=stats)
@@ -39,3 +39,72 @@ class LexicalResult(BaseModel):

    token_overlap: float
    """Proportion of candidate tokens found in reference."""
+
+
+class RougeScore(BaseModel):
+    """Individual ROUGE variant score with precision, recall, F-measure."""
+
+    model_config = ConfigDict(frozen=True)
+
+    precision: float
+    """Precision: overlap / candidate length."""
+
+    recall: float
+    """Recall: overlap / reference length."""
+
+    fmeasure: float
+    """F1-measure: harmonic mean of precision and recall."""
+
+
+class RougeResult(BaseModel):
+    """Result of ROUGE score computation."""
+
+    model_config = ConfigDict(frozen=True)
+
+    rouge1: RougeScore
+    """ROUGE-1 (unigram) score."""
+
+    rouge2: RougeScore
+    """ROUGE-2 (bigram) score."""
+
+    rouge_l: RougeScore
+    """ROUGE-L (longest common subsequence) score."""
+
+    @property
+    def score(self) -> float:
+        """Return ROUGE-L F-measure as the primary score."""
+        return self.rouge_l.fmeasure
+
+
+class ReadabilityResult(BaseModel):
+    """Result of readability computation."""
+
+    model_config = ConfigDict(frozen=True)
+
+    flesch_kincaid_grade: float
+    """US grade level (e.g., 8.0 = 8th grade reading level)."""
+
+    flesch_reading_ease: float
+    """Score 0-100, higher = easier to read."""
+
+    @property
+    def score(self) -> float:
+        """Return Flesch reading ease as the primary score."""
+        return self.flesch_reading_ease
+
+
+class SemanticResult(BaseModel):
+    """Result of semantic similarity computation."""
+
+    model_config = ConfigDict(frozen=True)
+
+    similarity: float
+    """Cosine similarity score (0.0 to 1.0)."""
+
+    model: str
+    """Name of the embedding model used."""
+
+    @property
+    def score(self) -> float:
+        """Return the primary score for this result."""
+        return self.similarity
@@ -0,0 +1,281 @@
+"""ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric implementation."""
+
+from collections import Counter
+
+from veritext.core.tokenisation import WordTokeniser
+from veritext.metrics.base import AggregateStats, BatchResult
+from veritext.metrics.results import RougeResult, RougeScore
+
+
+def _get_ngrams(tokens: list[str], n: int) -> Counter[tuple[str, ...]]:
+    """Extract n-grams from a list of tokens."""
+    if n > len(tokens):
+        return Counter()
+    return Counter(tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1))
+
+
+def _ngram_overlap(
+    candidate_ngrams: Counter[tuple[str, ...]],
+    reference_ngrams: Counter[tuple[str, ...]],
+) -> int:
+    """Compute the overlap count between candidate and reference n-grams."""
+    overlap = 0
+    for ngram, count in candidate_ngrams.items():
+        overlap += min(count, reference_ngrams.get(ngram, 0))
+    return overlap
+
+
+def _compute_rouge_score(
+    candidate_tokens: list[str],
+    reference_tokens: list[str],
+    n: int,
+) -> RougeScore:
+    """
+    Compute ROUGE-n score for given n-gram size.
+
+    Args:
+        candidate_tokens: Tokenised candidate text.
+        reference_tokens: Tokenised reference text.
+        n: N-gram size.
+
+    Returns:
+        RougeScore with precision, recall, and F-measure.
+    """
+    candidate_ngrams = _get_ngrams(candidate_tokens, n)
+    reference_ngrams = _get_ngrams(reference_tokens, n)
+
+    candidate_count = sum(candidate_ngrams.values())
+    reference_count = sum(reference_ngrams.values())
+
+    if candidate_count == 0 and reference_count == 0:
+        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
+
+    overlap = _ngram_overlap(candidate_ngrams, reference_ngrams)
+
+    precision = overlap / candidate_count if candidate_count > 0 else 0.0
+    recall = overlap / reference_count if reference_count > 0 else 0.0
+
+    if precision + recall > 0:
+        fmeasure = 2 * precision * recall / (precision + recall)
+    else:
+        fmeasure = 0.0
+
+    return RougeScore(precision=precision, recall=recall, fmeasure=fmeasure)
+
+
+def _lcs_length(seq1: list[str], seq2: list[str]) -> int:
+    """
+    Compute the length of the longest common subsequence.
+
+    Uses dynamic programming with O(m*n) time and O(min(m,n)) space.
+    """
+    if not seq1 or not seq2:
+        return 0
+
+    # Optimise by using shorter sequence for columns
+    if len(seq1) < len(seq2):
+        seq1, seq2 = seq2, seq1
+
+    m, n = len(seq1), len(seq2)
+
+    # Only need two rows at a time
+    prev = [0] * (n + 1)
+    curr = [0] * (n + 1)
+
+    for i in range(1, m + 1):
+        for j in range(1, n + 1):
+            if seq1[i - 1] == seq2[j - 1]:
+                curr[j] = prev[j - 1] + 1
+            else:
+                curr[j] = max(prev[j], curr[j - 1])
+        prev, curr = curr, prev
+
+    return prev[n]
+
+
+def _compute_rouge_l(
+    candidate_tokens: list[str],
+    reference_tokens: list[str],
+) -> RougeScore:
+    """
+    Compute ROUGE-L score using longest common subsequence.
+
+    Args:
+        candidate_tokens: Tokenised candidate text.
+        reference_tokens: Tokenised reference text.
+
+    Returns:
+        RougeScore with precision, recall, and F-measure.
+    """
+    if not candidate_tokens and not reference_tokens:
+        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
+
+    if not candidate_tokens or not reference_tokens:
+        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
+
+    lcs = _lcs_length(candidate_tokens, reference_tokens)
+
+    precision = lcs / len(candidate_tokens)
+    recall = lcs / len(reference_tokens)
+
+    if precision + recall > 0:
+        fmeasure = 2 * precision * recall / (precision + recall)
+    else:
+        fmeasure = 0.0
+
+    return RougeScore(precision=precision, recall=recall, fmeasure=fmeasure)
+
+
+def _max_rouge_scores(scores: list[RougeScore]) -> RougeScore:
+    """Select the RougeScore with the highest F-measure from a list."""
+    return max(scores, key=lambda s: s.fmeasure)
+
+
+class Rouge:
+    """
+    ROUGE metric for measuring summary/generation quality.
+
+    Computes ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (LCS) scores.
+    ROUGE is recall-oriented, measuring how much of the reference is captured.
+    """
+
+    def __init__(self, tokeniser: WordTokeniser | None = None) -> None:
+        """
+        Initialise the ROUGE metric.
+
+        Args:
+            tokeniser: Tokeniser to use. Defaults to WordTokeniser().
+        """
+        self._tokeniser = tokeniser or WordTokeniser()
+
+    @property
+    def name(self) -> str:
+        """Return the name of this metric."""
+        return "rouge"
+
+    @property
+    def requires_reference(self) -> bool:
+        """Return whether this metric requires reference text."""
+        return True
+
+    def score(
+        self, candidate: str, reference: str | list[str] | None = None
+    ) -> RougeResult:
+        """
+        Compute ROUGE scores for a candidate text.
+
+        Args:
+            candidate: The text to score.
+            reference: Reference text(s) for comparison. If multiple references
+                are provided, returns the maximum score for each variant.
+
+        Returns:
+            RougeResult with ROUGE-1, ROUGE-2, and ROUGE-L scores.
+
+        Raises:
+            ValueError: If reference is None or empty.
+        """
+        if reference is None:
+            raise ValueError("ROUGE requires reference text")
+
+        # Normalise reference to list
+        references = [reference] if isinstance(reference, str) else reference
+
+        # Tokenise
+        candidate_tokens = self._tokeniser.tokenise(candidate)
+        reference_token_lists = [self._tokeniser.tokenise(r) for r in references]
+
+        # Handle empty references
+        if all(not ref for ref in reference_token_lists):
+            raise ValueError("Reference text cannot be empty")
+
+        # Handle empty candidate
+        if not candidate_tokens:
+            return RougeResult(
+                rouge1=RougeScore(precision=0.0, recall=0.0, fmeasure=0.0),
+                rouge2=RougeScore(precision=0.0, recall=0.0, fmeasure=0.0),
+                rouge_l=RougeScore(precision=0.0, recall=0.0, fmeasure=0.0),
+            )
+
+        # Compute scores for each reference and take max
+        rouge1_scores = []
+        rouge2_scores = []
+        rouge_l_scores = []
+
+        for ref_tokens in reference_token_lists:
+            if not ref_tokens:
+                continue
+            rouge1_scores.append(_compute_rouge_score(candidate_tokens, ref_tokens, 1))
+            rouge2_scores.append(_compute_rouge_score(candidate_tokens, ref_tokens, 2))
+            rouge_l_scores.append(_compute_rouge_l(candidate_tokens, ref_tokens))
+
+        return RougeResult(
+            rouge1=_max_rouge_scores(rouge1_scores),
+            rouge2=_max_rouge_scores(rouge2_scores),
+            rouge_l=_max_rouge_scores(rouge_l_scores),
+        )
+
+    def batch_score(
+        self,
+        candidates: list[str],
+        references: list[str] | list[list[str]] | None = None,
+    ) -> BatchResult[RougeResult]:
+        """
+        Compute ROUGE scores for a batch of candidates.
+
+        Args:
+            candidates: List of texts to score.
+            references: Reference text(s) for each candidate.
+
+        Returns:
+            BatchResult containing individual results and aggregate statistics.
+
+        Raises:
+            ValueError: If references is None or length mismatch.
+        """
+        if references is None:
+            raise ValueError("ROUGE requires reference texts")
+
+        if len(candidates) != len(references):
+            raise ValueError(
+                f"Number of candidates ({len(candidates)}) must match "
+                f"number of references ({len(references)})"
+            )
+
+        results: list[RougeResult] = []
+        for i, cand in enumerate(candidates):
+            ref: str | list[str] = references[i]
+            results.append(self.score(cand, ref))
+
+        # Compute aggregate statistics for each score type
+        stats = {
+            "rouge1_precision": AggregateStats.from_values(
+                [r.rouge1.precision for r in results]
+            ),
+            "rouge1_recall": AggregateStats.from_values(
+                [r.rouge1.recall for r in results]
+            ),
+            "rouge1_fmeasure": AggregateStats.from_values(
+                [r.rouge1.fmeasure for r in results]
+            ),
+            "rouge2_precision": AggregateStats.from_values(
+                [r.rouge2.precision for r in results]
+            ),
+            "rouge2_recall": AggregateStats.from_values(
+                [r.rouge2.recall for r in results]
+            ),
+            "rouge2_fmeasure": AggregateStats.from_values(
+                [r.rouge2.fmeasure for r in results]
+            ),
+            "rouge_l_precision": AggregateStats.from_values(
+                [r.rouge_l.precision for r in results]
+            ),
+            "rouge_l_recall": AggregateStats.from_values(
+                [r.rouge_l.recall for r in results]
+            ),
+            "rouge_l_fmeasure": AggregateStats.from_values(
+                [r.rouge_l.fmeasure for r in results]
+            ),
+        }
+
+        return BatchResult(results=results, count=len(results), stats=stats)
@@ -0,0 +1,22 @@
+"""Pytest plugin for text validation.
+
+This plugin provides native pytest integration for Veritext, enabling
+text validation assertions in test suites.
+
+Example:
+    >>> from veritext.pytest_plugin import validate_text
+    >>>
+    >>> def test_summary_quality():
+    ...     text = "The quick brown fox jumps over the lazy dog."
+    ...     validate_text(
+    ...         text,
+    ...         min_length=10,
+    ...         max_length=100,
+    ...         max_reading_grade=8.0,
+    ...     )
+"""
+
+from veritext.pytest_plugin.assertions import validate_text
+from veritext.pytest_plugin.plugin import pytest_configure
+
+__all__ = ["pytest_configure", "validate_text"]
@@ -0,0 +1,141 @@
+"""Assertion functions for text validation in pytest."""
+
+from typing import TYPE_CHECKING
+
+from veritext.core.types import ValidationContext, ValidationResult
+from veritext.validators import all_of
+
+if TYPE_CHECKING:
+    from veritext.validators.base import Check
+
+
+def validate_text(
+    text: str,
+    *,
+    reference: str | list[str] | None = None,
+    min_bleu: float | None = None,
+    min_rouge: float | None = None,
+    min_semantic: float | None = None,
+    max_length: int | None = None,
+    min_length: int | None = None,
+    max_reading_grade: float | None = None,
+    must_contain: list[str] | None = None,
+    must_exclude: list[str] | None = None,
+) -> None:
+    """Assert text passes all specified validation criteria.
+
+    This is the primary assertion function for text validation in pytest.
+    It builds validators from keyword arguments and raises AssertionError
+    with detailed failure information if validation fails.
+
+    Args:
+        text: The text to validate.
+        reference: Reference text for comparison metrics (BLEU, ROUGE, semantic).
+        min_bleu: Minimum BLEU-4 score required (0.0 to 1.0).
+        min_rouge: Minimum ROUGE-L F-measure required (0.0 to 1.0).
+        min_semantic: Minimum semantic similarity required (0.0 to 1.0).
+        max_length: Maximum character count allowed.
+        min_length: Minimum character count required.
+        max_reading_grade: Maximum Flesch-Kincaid grade level.
+        must_contain: Patterns that must be present in the text.
+        must_exclude: Patterns that must not be present in the text.
+
+    Raises:
+        AssertionError: With detailed failure information if validation fails.
+        ValueError: If comparison metrics requested but reference not provided,
+            or if no validation criteria are specified.
+
+    Example:
+        >>> validate_text(
+        ...     "The quick brown fox jumps over the lazy dog.",
+        ...     min_length=10,
+        ...     max_length=100,
+        ...     max_reading_grade=8.0,
+        ... )
+    """
+    # Validate that reference is provided for comparison metrics
+    if any([min_bleu, min_rouge, min_semantic]) and reference is None:
+        raise ValueError(
+            "Reference text required for comparison metrics "
+            "(min_bleu, min_rouge, min_semantic)"
+        )
+
+    # Build list of validators from kwargs
+    checks: list[Check] = []
+
+    if min_bleu is not None:
+        from veritext.validators import bleu
+
+        checks.append(bleu(min_score=min_bleu))
+
+    if min_rouge is not None:
+        from veritext.validators import rouge
+
+        checks.append(rouge(min_score=min_rouge))
+
+    if min_semantic is not None:
+        # Lazy import to avoid loading sentence-transformers unless needed
+        from veritext.validators import semantic
+
+        checks.append(semantic(min_score=min_semantic))
+
+    if max_length is not None or min_length is not None:
+        from veritext.validators import length
+
+        checks.append(length(min_chars=min_length, max_chars=max_length))
+
+    if max_reading_grade is not None:
+        from veritext.validators import readability
+
+        checks.append(readability(max_grade=max_reading_grade))
+
+    if must_contain is not None:
+        from veritext.validators import contains
+
+        checks.append(contains(patterns=must_contain))
+
+    if must_exclude is not None:
+        from veritext.validators import excludes
+
+        checks.append(excludes(patterns=must_exclude))
+
+    if not checks:
+        raise ValueError("At least one validation criterion must be specified")
+
+    # Run validation
+    context = ValidationContext(reference=reference)
+    validator = all_of(checks)
+    result = validator.check(text, context)
+
+    if not result.passed:
+        raise AssertionError(_format_failure(text, result))
+
+
+def _format_failure(text: str, result: ValidationResult) -> str:
+    """Format a detailed failure message for pytest output.
+
+    Args:
+        text: The text that was validated.
+        result: The validation result containing check failures.
+
+    Returns:
+        Formatted failure message with check details.
+    """
+    lines = ["Text validation failed:"]
+    lines.append("")
+
+    # Show a preview of the text (truncated if long)
+    preview = text[:100] + "..." if len(text) > 100 else text
+    lines.append(f"  Text: {preview!r}")
+    lines.append("")
+
+    # List all failed checks with details
+    lines.append("  Failed checks:")
+    for check in result.failed_checks:
+        lines.append(f"    - {check.name}:")
+        lines.append(f"        {check.message}")
+        if check.threshold is not None:
+            lines.append(f"        Expected: >= {check.threshold}")
+            lines.append(f"        Actual:   {check.actual}")
+
+    return "\n".join(lines)
@@ -0,0 +1,80 @@
+"""Pytest fixtures for text validation."""
+
+from typing import TYPE_CHECKING, Any
+
+import pytest
+
+from veritext.core.types import ValidationContext, ValidationResult
+from veritext.validators import all_of
+from veritext.validators.base import Check
+
+if TYPE_CHECKING:
+    from collections.abc import Callable
+
+
+class ValidatorFactory:
+    """Factory for building validators from keyword arguments."""
+
+    def __call__(
+        self,
+        checks: list[Check],
+        reference: str | list[str] | None = None,
+    ) -> "Callable[[str], ValidationResult]":
+        """Create a validator function from a list of checks.
+
+        Args:
+            checks: List of validation checks to apply.
+            reference: Optional reference text for comparison metrics.
+
+        Returns:
+            A callable that takes text and returns a ValidationResult.
+        """
+        validator = all_of(checks)
+        context = ValidationContext(reference=reference)
+
+        def validate(text: str) -> ValidationResult:
+            return validator.check(text, context)
+
+        return validate
+
+
+@pytest.fixture
+def text_validator() -> ValidatorFactory:
+    """Provide a factory for building validators.
+
+    Example:
+        >>> def test_with_factory(text_validator):
+        ...     from veritext.validators import bleu, length
+        ...     validate = text_validator(
+        ...         checks=[bleu(min_score=0.5), length(min_words=10)],
+        ...         reference="The reference text.",
+        ...     )
+        ...     result = validate("Some candidate text.")
+        ...     assert result.passed
+
+    Returns:
+        ValidatorFactory instance.
+    """
+    return ValidatorFactory()
+
+
+@pytest.fixture
+def validation_context() -> "Callable[..., ValidationContext]":
+    """Provide a factory for creating ValidationContext objects.
+
+    Example:
+        >>> def test_with_context(validation_context):
+        ...     ctx = validation_context(reference="The reference text.")
+        ...     assert ctx.reference == "The reference text."
+
+    Returns:
+        A callable that creates ValidationContext objects.
+    """
+
+    def _create(
+        reference: str | list[str] | None = None,
+        **metadata: Any,
+    ) -> ValidationContext:
+        return ValidationContext(reference=reference, metadata=metadata)
+
+    return _create
@@ -0,0 +1,18 @@
+"""Pytest hooks for Veritext plugin."""
+
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    import pytest
+
+
+def pytest_configure(config: "pytest.Config") -> None:
+    """Register Veritext markers.
+
+    Args:
+        config: Pytest configuration object.
+    """
+    config.addinivalue_line(
+        "markers",
+        "text_validation: mark test as a text validation test",
+    )
@@ -0,0 +1,16 @@
+"""Semantic similarity module: embedding-based text comparison.
+
+This module provides semantic similarity using sentence-transformers.
+It requires the `veritext[semantic]` extra to be installed.
+
+Example:
+    >>> from veritext.semantic import SemanticSimilarity
+    >>>
+    >>> metric = SemanticSimilarity()
+    >>> result = metric.score("The cat sat on the mat", "A feline rested on the rug")
+    >>> print(f"Similarity: {result.similarity:.2f}")
+"""
+
+from veritext.semantic.similarity import SemanticSimilarity
+
+__all__ = ["SemanticSimilarity"]
@@ -0,0 +1,188 @@
+"""Embedding-based semantic similarity using sentence-transformers."""
+
+from typing import Any
+
+from veritext.core.exceptions import DependencyError
+from veritext.metrics.base import AggregateStats, BatchResult
+from veritext.metrics.results import SemanticResult
+
+
+class SemanticSimilarity:
+    """
+    Embedding-based semantic similarity using sentence-transformers.
+
+    Computes cosine similarity between text embeddings to measure semantic
+    relatedness. This metric captures meaning beyond lexical overlap.
+
+    Requires the `veritext[semantic]` extra to be installed.
+    """
+
+    def __init__(
+        self,
+        model: str = "all-MiniLM-L6-v2",
+        cache_embeddings: bool = True,
+    ) -> None:
+        """
+        Initialise the semantic similarity metric.
+
+        Args:
+            model: Name of the sentence-transformers model to use.
+                   Defaults to "all-MiniLM-L6-v2" (22MB, good quality/size tradeoff).
+            cache_embeddings: Whether to cache embeddings for repeated texts.
+                              Defaults to True.
+
+        Raises:
+            DependencyError: If sentence-transformers is not installed.
+        """
+        try:
+            from sentence_transformers import SentenceTransformer
+        except ImportError as err:
+            raise DependencyError(
+                "Install veritext[semantic] for semantic similarity: "
+                "pip install veritext[semantic]"
+            ) from err
+
+        self._model_name = model
+        self._model: Any = SentenceTransformer(model)
+        self._cache: dict[str, Any] | None = {} if cache_embeddings else None
+
+    @property
+    def name(self) -> str:
+        """Return the name of this metric."""
+        return "semantic"
+
+    @property
+    def requires_reference(self) -> bool:
+        """Return whether this metric requires reference text."""
+        return True
+
+    def _get_embedding(self, text: str) -> Any:
+        """
+        Get embedding for text, using cache if available.
+
+        Args:
+            text: The text to embed.
+
+        Returns:
+            The embedding tensor.
+        """
+        if self._cache is not None and text in self._cache:
+            return self._cache[text]
+
+        embedding = self._model.encode(text, convert_to_tensor=True)
+
+        if self._cache is not None:
+            self._cache[text] = embedding
+
+        return embedding
+
+    def _cosine_similarity(self, embedding1: Any, embedding2: Any) -> float:
+        """
+        Compute cosine similarity between two embeddings.
+
+        Args:
+            embedding1: First embedding tensor.
+            embedding2: Second embedding tensor.
+
+        Returns:
+            Cosine similarity score (0.0 to 1.0).
+        """
+        from sentence_transformers import util
+
+        similarity: float = util.cos_sim(embedding1, embedding2).item()
+        # Clamp to [0, 1] as negative similarities are possible but not meaningful
+        return max(0.0, min(1.0, similarity))
+
+    def score(
+        self, candidate: str, reference: str | list[str] | None = None
+    ) -> SemanticResult:
+        """
+        Compute semantic similarity between candidate and reference.
+
+        When multiple references are provided, returns the maximum similarity
+        across all references.
+
+        Args:
+            candidate: The text to score.
+            reference: Reference text(s) for comparison.
+
+        Returns:
+            SemanticResult with similarity score and model name.
+
+        Raises:
+            ValueError: If reference is None or empty.
+        """
+        if reference is None:
+            raise ValueError("Semantic similarity requires reference text")
+
+        # Normalise reference to list
+        references = [reference] if isinstance(reference, str) else reference
+
+        if not references:
+            raise ValueError("Reference text cannot be empty")
+
+        # Handle empty candidate
+        candidate_stripped = candidate.strip()
+        if not candidate_stripped:
+            return SemanticResult(similarity=0.0, model=self._model_name)
+
+        # Handle empty references
+        valid_references = [r for r in references if r.strip()]
+        if not valid_references:
+            raise ValueError("Reference text cannot be empty")
+
+        # Get candidate embedding
+        candidate_embedding = self._get_embedding(candidate_stripped)
+
+        # Compute similarity against each reference, take maximum
+        max_similarity = 0.0
+        for ref in valid_references:
+            ref_embedding = self._get_embedding(ref.strip())
+            similarity = self._cosine_similarity(candidate_embedding, ref_embedding)
+            max_similarity = max(max_similarity, similarity)
+
+        return SemanticResult(similarity=max_similarity, model=self._model_name)
+
+    def batch_score(
+        self,
+        candidates: list[str],
+        references: list[str] | list[list[str]] | None = None,
+    ) -> BatchResult[SemanticResult]:
+        """
+        Compute semantic similarity for a batch of candidates.
+
+        Args:
+            candidates: List of texts to score.
+            references: Reference text(s) for each candidate.
+
+        Returns:
+            BatchResult containing individual results and aggregate statistics.
+
+        Raises:
+            ValueError: If references is None or length mismatch.
+        """
+        if references is None:
+            raise ValueError("Semantic similarity requires reference texts")
+
+        if len(candidates) != len(references):
+            raise ValueError(
+                f"Number of candidates ({len(candidates)}) must match "
+                f"number of references ({len(references)})"
+            )
+
+        results: list[SemanticResult] = []
+        for i, cand in enumerate(candidates):
+            ref: str | list[str] = references[i]
+            results.append(self.score(cand, ref))
+
+        # Compute aggregate statistics
+        stats = {
+            "similarity": AggregateStats.from_values([r.similarity for r in results]),
+        }
+
+        return BatchResult(results=results, count=len(results), stats=stats)
+
+    def clear_cache(self) -> None:
+        """Clear the embedding cache."""
+        if self._cache is not None:
+            self._cache.clear()
@@ -0,0 +1,239 @@
+"""Validators module: composable validation checks for text quality.
+
+This module provides validators that apply thresholds to metrics and return
+pass/fail decisions with diagnostics.
+
+Example:
+    >>> from veritext.validators import bleu, length, all_of
+    >>> from veritext.core.types import ValidationContext
+    >>>
+    >>> validator = all_of([
+    ...     bleu(min_score=0.5),
+    ...     length(min_words=10),
+    ... ])
+    >>> context = ValidationContext(reference="The quick brown fox.")
+    >>> result = validator.check("The quick brown fox jumps.", context)
+    >>> print(result.passed)
+"""
+
+from typing import Literal
+
+from veritext.core.tokenisation import WordTokeniser
+from veritext.validators.base import Check
+from veritext.validators.composite import AllOf, AnyOf
+from veritext.validators.constraint import (
+    ContainsValidator,
+    ExcludesValidator,
+    LengthValidator,
+    ReadabilityValidator,
+)
+from veritext.validators.metric import (
+    BleuValidator,
+    LexicalValidator,
+    RougeValidator,
+    SemanticValidator,
+)
+
+
+# Factory functions for clean API
+def bleu(
+    min_score: float,
+    variant: Literal[1, 2, 3, 4] = 4,
+    tokeniser: WordTokeniser | None = None,
+) -> BleuValidator:
+    """Create a BLEU validator.
+
+    Args:
+        min_score: Minimum BLEU score required (0.0 to 1.0).
+        variant: BLEU variant to use (1, 2, 3, or 4). Defaults to 4.
+        tokeniser: Tokeniser to use. Defaults to WordTokeniser().
+
+    Returns:
+        BleuValidator instance.
+    """
+    return BleuValidator(min_score=min_score, variant=variant, tokeniser=tokeniser)
+
+
+def rouge(
+    min_score: float,
+    variant: Literal["1", "2", "l"] = "l",
+    tokeniser: WordTokeniser | None = None,
+) -> RougeValidator:
+    """Create a ROUGE validator.
+
+    Args:
+        min_score: Minimum ROUGE F-measure required (0.0 to 1.0).
+        variant: ROUGE variant ("1", "2", or "l"). Defaults to "l".
+        tokeniser: Tokeniser to use. Defaults to WordTokeniser().
+
+    Returns:
+        RougeValidator instance.
+    """
+    return RougeValidator(min_score=min_score, variant=variant, tokeniser=tokeniser)
+
+
+def lexical(
+    min_jaccard: float | None = None,
+    min_overlap: float | None = None,
+    tokeniser: WordTokeniser | None = None,
+) -> LexicalValidator:
+    """Create a lexical similarity validator.
+
+    Args:
+        min_jaccard: Minimum Jaccard similarity required (0.0 to 1.0).
+        min_overlap: Minimum token overlap required (0.0 to 1.0).
+        tokeniser: Tokeniser to use. Defaults to WordTokeniser().
+
+    Returns:
+        LexicalValidator instance.
+    """
+    return LexicalValidator(
+        min_jaccard=min_jaccard, min_overlap=min_overlap, tokeniser=tokeniser
+    )
+
+
+def length(
+    min_chars: int | None = None,
+    max_chars: int | None = None,
+    min_words: int | None = None,
+    max_words: int | None = None,
+    tokeniser: WordTokeniser | None = None,
+) -> LengthValidator:
+    """Create a length validator.
+
+    Args:
+        min_chars: Minimum character count (inclusive).
+        max_chars: Maximum character count (inclusive).
+        min_words: Minimum word count (inclusive).
+        max_words: Maximum word count (inclusive).
+        tokeniser: Tokeniser to use for word counting. Defaults to WordTokeniser().
+
+    Returns:
+        LengthValidator instance.
+    """
+    return LengthValidator(
+        min_chars=min_chars,
+        max_chars=max_chars,
+        min_words=min_words,
+        max_words=max_words,
+        tokeniser=tokeniser,
+    )
+
+
+def readability(
+    max_grade: float | None = None,
+    min_ease: float | None = None,
+) -> ReadabilityValidator:
+    """Create a readability validator.
+
+    Args:
+        max_grade: Maximum Flesch-Kincaid grade level allowed.
+        min_ease: Minimum Flesch Reading Ease score required.
+
+    Returns:
+        ReadabilityValidator instance.
+    """
+    return ReadabilityValidator(max_grade=max_grade, min_ease=min_ease)
+
+
+def contains(
+    patterns: list[str],
+    case_sensitive: bool = False,
+) -> ContainsValidator:
+    """Create a contains validator.
+
+    Args:
+        patterns: List of substrings or regex patterns that must be present.
+        case_sensitive: Whether matching is case-sensitive. Defaults to False.
+
+    Returns:
+        ContainsValidator instance.
+    """
+    return ContainsValidator(patterns=patterns, case_sensitive=case_sensitive)
+
+
+def excludes(
+    patterns: list[str],
+    case_sensitive: bool = False,
+) -> ExcludesValidator:
+    """Create an excludes validator.
+
+    Args:
+        patterns: List of substrings or regex patterns that must not be present.
+        case_sensitive: Whether matching is case-sensitive. Defaults to False.
+
+    Returns:
+        ExcludesValidator instance.
+    """
+    return ExcludesValidator(patterns=patterns, case_sensitive=case_sensitive)
+
+
+def all_of(checks: list[Check]) -> AllOf:
+    """Create an AllOf composite validator.
+
+    Args:
+        checks: List of checks that must all pass.
+
+    Returns:
+        AllOf instance.
+    """
+    return AllOf(checks=checks)
+
+
+def any_of(checks: list[Check]) -> AnyOf:
+    """Create an AnyOf composite validator.
+
+    Args:
+        checks: List of checks where at least one must pass.
+
+    Returns:
+        AnyOf instance.
+    """
+    return AnyOf(checks=checks)
+
+
+def semantic(
+    min_score: float,
+    model: str = "all-MiniLM-L6-v2",
+    cache_embeddings: bool = True,
+) -> SemanticValidator:
+    """Create a semantic similarity validator.
+
+    Requires the `veritext[semantic]` extra to be installed.
+
+    Args:
+        min_score: Minimum semantic similarity required (0.0 to 1.0).
+        model: Name of the sentence-transformers model to use.
+        cache_embeddings: Whether to cache embeddings for repeated texts.
+
+    Returns:
+        SemanticValidator instance.
+    """
+    return SemanticValidator(
+        min_score=min_score, model=model, cache_embeddings=cache_embeddings
+    )
+
+
+__all__ = [
+    "AllOf",
+    "AnyOf",
+    "BleuValidator",
+    "Check",
+    "ContainsValidator",
+    "ExcludesValidator",
+    "LengthValidator",
+    "LexicalValidator",
+    "ReadabilityValidator",
+    "RougeValidator",
+    "SemanticValidator",
+    "all_of",
+    "any_of",
+    "bleu",
+    "contains",
+    "excludes",
+    "length",
+    "lexical",
+    "readability",
+    "rouge",
+    "semantic",
+]
@@ -0,0 +1,31 @@
+"""Base types and protocols for validation checks."""
+
+from typing import Protocol, runtime_checkable
+
+from veritext.core.types import CheckResult, ValidationContext
+
+
+@runtime_checkable
+class Check(Protocol):
+    """Protocol for validation checks.
+
+    A Check computes a score or property of text and compares it against
+    a threshold to produce a pass/fail result.
+    """
+
+    @property
+    def name(self) -> str:
+        """Return the name of this check."""
+        ...
+
+    def check(self, text: str, context: ValidationContext) -> CheckResult:
+        """Run the check and return a result.
+
+        Args:
+            text: The text to validate.
+            context: Validation context containing reference text and metadata.
+
+        Returns:
+            CheckResult with pass/fail status and diagnostics.
+        """
+        ...
@@ -0,0 +1,90 @@
+"""Composite validators for combining multiple checks."""
+
+from veritext.core.types import CheckResult, ValidationContext, ValidationResult
+from veritext.validators.base import Check
+
+
+class AllOf:
+    """Passes only if all checks pass."""
+
+    def __init__(self, checks: list[Check]) -> None:
+        """
+        Initialise the AllOf composite validator.
+
+        Args:
+            checks: List of checks that must all pass.
+
+        Raises:
+            ValueError: If checks list is empty.
+        """
+        if not checks:
+            raise ValueError("checks list cannot be empty")
+
+        self._checks = checks
+
+    @property
+    def name(self) -> str:
+        """Return the name of this composite check."""
+        return "all_of"
+
+    def check(self, text: str, context: ValidationContext) -> ValidationResult:
+        """
+        Run all checks and return aggregate result.
+
+        Args:
+            text: The text to validate.
+            context: Validation context containing reference text and metadata.
+
+        Returns:
+            ValidationResult that passes only if all checks pass.
+        """
+        results: list[CheckResult] = []
+        for check in self._checks:
+            results.append(check.check(text, context))
+
+        all_passed = all(r.passed for r in results)
+
+        return ValidationResult(passed=all_passed, checks=results)
+
+
+class AnyOf:
+    """Passes if any check passes."""
+
+    def __init__(self, checks: list[Check]) -> None:
+        """
+        Initialise the AnyOf composite validator.
+
+        Args:
+            checks: List of checks where at least one must pass.
+
+        Raises:
+            ValueError: If checks list is empty.
+        """
+        if not checks:
+            raise ValueError("checks list cannot be empty")
+
+        self._checks = checks
+
+    @property
+    def name(self) -> str:
+        """Return the name of this composite check."""
+        return "any_of"
+
+    def check(self, text: str, context: ValidationContext) -> ValidationResult:
+        """
+        Run all checks and return aggregate result.
+
+        Args:
+            text: The text to validate.
+            context: Validation context containing reference text and metadata.
+
+        Returns:
+            ValidationResult that passes if any check passes.
+        """
+        results: list[CheckResult] = []
+        for check in self._checks:
+            results.append(check.check(text, context))
+
+        any_passed = any(r.passed for r in results)
+
+        return ValidationResult(passed=any_passed, checks=results)
@@ -0,0 +1,337 @@
+"""Constraint validators that do not require reference text."""
+
+import re
+
+from veritext.core.exceptions import InvalidThresholdError
+from veritext.core.tokenisation import WordTokeniser
+from veritext.core.types import CheckResult, ValidationContext
+from veritext.metrics.readability import Readability
+
+
+class LengthValidator:
+    """Validates text length constraints."""
+
+    def __init__(
+        self,
+        min_chars: int | None = None,
+        max_chars: int | None = None,
+        min_words: int | None = None,
+        max_words: int | None = None,
+        tokeniser: WordTokeniser | None = None,
+    ) -> None:
+        """
+        Initialise the length validator.
+
+        Args:
+            min_chars: Minimum character count (inclusive).
+            max_chars: Maximum character count (inclusive).
+            min_words: Minimum word count (inclusive).
+            max_words: Maximum word count (inclusive).
+            tokeniser: Tokeniser to use for word counting. Defaults to WordTokeniser().
+
+        Raises:
+            InvalidThresholdError: If no constraints provided or invalid values.
+        """
+        if all(v is None for v in (min_chars, max_chars, min_words, max_words)):
+            raise InvalidThresholdError("At least one length constraint must be set")
+
+        if min_chars is not None and min_chars < 0:
+            raise InvalidThresholdError(f"min_chars must be >= 0, got {min_chars}")
+        if max_chars is not None and max_chars < 0:
+            raise InvalidThresholdError(f"max_chars must be >= 0, got {max_chars}")
+        if min_words is not None and min_words < 0:
+            raise InvalidThresholdError(f"min_words must be >= 0, got {min_words}")
+        if max_words is not None and max_words < 0:
+            raise InvalidThresholdError(f"max_words must be >= 0, got {max_words}")
+
+        if min_chars is not None and max_chars is not None and min_chars > max_chars:
+            raise InvalidThresholdError(
+                f"min_chars ({min_chars}) cannot exceed max_chars ({max_chars})"
+            )
+        if min_words is not None and max_words is not None and min_words > max_words:
+            raise InvalidThresholdError(
+                f"min_words ({min_words}) cannot exceed max_words ({max_words})"
+            )
+
+        self._min_chars = min_chars
+        self._max_chars = max_chars
+        self._min_words = min_words
+        self._max_words = max_words
+        self._tokeniser = tokeniser or WordTokeniser()
+
+    @property
+    def name(self) -> str:
+        """Return the name of this check."""
+        return "length"
+
+    def check(self, text: str, context: ValidationContext) -> CheckResult:  # noqa: ARG002
+        """
+        Run the length check.
+
+        Args:
+            text: The text to validate.
+            context: Validation context (not used for length checks).
+
+        Returns:
+            CheckResult with pass/fail status.
+        """
+        char_count = len(text)
+        words = self._tokeniser.tokenise(text)
+        word_count = len(words)
+
+        failures = []
+
+        if self._min_chars is not None and char_count < self._min_chars:
+            failures.append(f"{char_count} chars < min {self._min_chars}")
+        if self._max_chars is not None and char_count > self._max_chars:
+            failures.append(f"{char_count} chars > max {self._max_chars}")
+        if self._min_words is not None and word_count < self._min_words:
+            failures.append(f"{word_count} words < min {self._min_words}")
+        if self._max_words is not None and word_count > self._max_words:
+            failures.append(f"{word_count} words > max {self._max_words}")
+
+        passed = len(failures) == 0
+
+        if passed:
+            message = f"Length check passed: {char_count} chars, {word_count} words"
+        else:
+            message = "Length check failed: " + "; ".join(failures)
+
+        actual = {"chars": char_count, "words": word_count}
+        threshold = {}
+        if self._min_chars is not None:
+            threshold["min_chars"] = self._min_chars
+        if self._max_chars is not None:
+            threshold["max_chars"] = self._max_chars
+        if self._min_words is not None:
+            threshold["min_words"] = self._min_words
+        if self._max_words is not None:
+            threshold["max_words"] = self._max_words
+
+        return CheckResult(
+            name=self.name,
+            passed=passed,
+            actual=actual,
+            threshold=threshold,
+            message=message,
+        )
+
+
+class ReadabilityValidator:
+    """Validates Flesch-Kincaid readability."""
+
+    def __init__(
+        self,
+        max_grade: float | None = None,
+        min_ease: float | None = None,
+    ) -> None:
+        """
+        Initialise the readability validator.
+
+        Args:
+            max_grade: Maximum Flesch-Kincaid grade level allowed.
+            min_ease: Minimum Flesch Reading Ease score required.
+
+        Raises:
+            InvalidThresholdError: If no constraints provided.
+        """
+        if max_grade is None and min_ease is None:
+            raise InvalidThresholdError(
+                "At least one of max_grade or min_ease must be provided"
+            )
+
+        self._max_grade = max_grade
+        self._min_ease = min_ease
+        self._metric = Readability()
+
+    @property
+    def name(self) -> str:
+        """Return the name of this check."""
+        return "readability"
+
+    def check(self, text: str, context: ValidationContext) -> CheckResult:  # noqa: ARG002
+        """
+        Run the readability check.
+
+        Args:
+            text: The text to validate.
+            context: Validation context (not used for readability checks).
+
+        Returns:
+            CheckResult with pass/fail status.
+        """
+        result = self._metric.score(text)
+
+        failures = []
+        if (
+            self._max_grade is not None
+            and result.flesch_kincaid_grade > self._max_grade
+        ):
+            failures.append(
+                f"grade level {result.flesch_kincaid_grade:.1f} "
+                f"> max {self._max_grade:.1f}"
+            )
+
+        if self._min_ease is not None and result.flesch_reading_ease < self._min_ease:
+            failures.append(
+                f"reading ease {result.flesch_reading_ease:.1f} "
+                f"< min {self._min_ease:.1f}"
+            )
+
+        passed = len(failures) == 0
+
+        if passed:
+            parts = []
+            if self._max_grade is not None:
+                parts.append(
+                    f"grade {result.flesch_kincaid_grade:.1f} <= {self._max_grade:.1f}"
+                )
+            if self._min_ease is not None:
+                parts.append(
+                    f"ease {result.flesch_reading_ease:.1f} >= {self._min_ease:.1f}"
+                )
+            message = "Readability: " + ", ".join(parts)
+        else:
+            message = "Readability: " + "; ".join(failures)
+
+        actual = {
+            "grade": result.flesch_kincaid_grade,
+            "ease": result.flesch_reading_ease,
+        }
+        threshold = {}
+        if self._max_grade is not None:
+            threshold["max_grade"] = self._max_grade
+        if self._min_ease is not None:
+            threshold["min_ease"] = self._min_ease
+
+        return CheckResult(
+            name=self.name,
+            passed=passed,
+            actual=actual,
+            threshold=threshold,
+            message=message,
+        )
+
+
+class ContainsValidator:
+    """Validates text contains required patterns."""
+
+    def __init__(
+        self,
+        patterns: list[str],
+        case_sensitive: bool = False,
+    ) -> None:
+        """
+        Initialise the contains validator.
+
+        Args:
+            patterns: List of substrings or regex patterns that must be present.
+            case_sensitive: Whether matching is case-sensitive. Defaults to False.
+
+        Raises:
+            InvalidThresholdError: If patterns list is empty.
+        """
+        if not patterns:
+            raise InvalidThresholdError("patterns list cannot be empty")
+
+        self._patterns = patterns
+        self._case_sensitive = case_sensitive
+        self._flags = 0 if case_sensitive else re.IGNORECASE
+
+    @property
+    def name(self) -> str:
+        """Return the name of this check."""
+        return "contains"
+
+    def check(self, text: str, context: ValidationContext) -> CheckResult:  # noqa: ARG002
+        """
+        Run the contains check.
+
+        Args:
+            text: The text to validate.
+            context: Validation context (not used for contains checks).
+
+        Returns:
+            CheckResult with pass/fail status.
+        """
+        missing = []
+        for pattern in self._patterns:
+            if not re.search(pattern, text, self._flags):
+                missing.append(pattern)
+
+        passed = len(missing) == 0
+
+        if passed:
+            message = f"Text contains all {len(self._patterns)} required pattern(s)"
+        else:
+            message = f"Text missing {len(missing)} pattern(s): {missing}"
+
+        return CheckResult(
+            name=self.name,
+            passed=passed,
+            actual={"found": len(self._patterns) - len(missing), "missing": missing},
+            threshold={"patterns": self._patterns},
+            message=message,
+        )
+
+
+class ExcludesValidator:
+    """Validates text excludes forbidden patterns."""
+
+    def __init__(
+        self,
+        patterns: list[str],
+        case_sensitive: bool = False,
+    ) -> None:
+        """
+        Initialise the excludes validator.
+
+        Args:
+            patterns: List of substrings or regex patterns that must not be present.
+            case_sensitive: Whether matching is case-sensitive. Defaults to False.
+
+        Raises:
+            InvalidThresholdError: If patterns list is empty.
+        """
+        if not patterns:
+            raise InvalidThresholdError("patterns list cannot be empty")
+
+        self._patterns = patterns
+        self._case_sensitive = case_sensitive
+        self._flags = 0 if case_sensitive else re.IGNORECASE
+
+    @property
+    def name(self) -> str:
+        """Return the name of this check."""
+        return "excludes"
+
+    def check(self, text: str, context: ValidationContext) -> CheckResult:  # noqa: ARG002
+        """
+        Run the excludes check.
+
+        Args:
+            text: The text to validate.
+            context: Validation context (not used for excludes checks).
+
+        Returns:
+            CheckResult with pass/fail status.
+        """
+        found = []
+        for pattern in self._patterns:
+            if re.search(pattern, text, self._flags):
+                found.append(pattern)
+
+        passed = len(found) == 0
+
+        if passed:
+            message = f"Text excludes all {len(self._patterns)} forbidden pattern(s)"
+        else:
+            message = f"Text contains {len(found)} forbidden pattern(s): {found}"
+
+        return CheckResult(
+            name=self.name,
+            passed=passed,
+            actual={"excluded": len(self._patterns) - len(found), "found": found},
+            threshold={"patterns": self._patterns},
+            message=message,
+        )
@@ -0,0 +1,370 @@
+"""Metric-based validators that require reference text."""
+
+from typing import Literal
+
+from veritext.core.exceptions import InvalidThresholdError, ValidationError
+from veritext.core.tokenisation import WordTokeniser
+from veritext.core.types import CheckResult, ValidationContext
+from veritext.metrics.bleu import Bleu
+from veritext.metrics.lexical import Lexical
+from veritext.metrics.rouge import Rouge
+
+
+class BleuValidator:
+    """Validates that BLEU score meets minimum threshold."""
+
+    def __init__(
+        self,
+        min_score: float,
+        variant: Literal[1, 2, 3, 4] = 4,
+        tokeniser: WordTokeniser | None = None,
+    ) -> None:
+        """
+        Initialise the BLEU validator.
+
+        Args:
+            min_score: Minimum BLEU score required (0.0 to 1.0).
+            variant: BLEU variant to use (1, 2, 3, or 4). Defaults to 4.
+            tokeniser: Tokeniser to use. Defaults to WordTokeniser().
+
+        Raises:
+            InvalidThresholdError: If min_score is not in range [0.0, 1.0].
+        """
+        if not 0.0 <= min_score <= 1.0:
+            raise InvalidThresholdError(
+                f"min_score must be between 0.0 and 1.0, got {min_score}"
+            )
+        if variant not in (1, 2, 3, 4):
+            raise InvalidThresholdError(f"variant must be 1, 2, 3, or 4, got {variant}")
+
+        self._min_score = min_score
+        self._variant = variant
+        self._metric = Bleu(tokeniser=tokeniser)
+
+    @property
+    def name(self) -> str:
+        """Return the name of this check."""
+        return f"bleu-{self._variant}"
+
+    def check(self, text: str, context: ValidationContext) -> CheckResult:
+        """
+        Run the BLEU check.
+
+        Args:
+            text: The text to validate.
+            context: Validation context containing reference text.
+
+        Returns:
+            CheckResult with pass/fail status.
+
+        Raises:
+            ValidationError: If reference text is missing from context.
+        """
+        if context.reference is None:
+            raise ValidationError(f"{self.name} requires reference text in context")
+
+        result = self._metric.score(text, context.reference)
+
+        # Select the appropriate BLEU variant
+        score_map = {
+            1: result.bleu1,
+            2: result.bleu2,
+            3: result.bleu3,
+            4: result.bleu4,
+        }
+        actual_score = score_map[self._variant]
+        passed = actual_score >= self._min_score
+
+        if passed:
+            message = (
+                f"BLEU-{self._variant} score {actual_score:.2f} "
+                f"meets minimum {self._min_score:.2f}"
+            )
+        else:
+            message = (
+                f"BLEU-{self._variant} score {actual_score:.2f} "
+                f"below minimum {self._min_score:.2f}"
+            )
+
+        return CheckResult(
+            name=self.name,
+            passed=passed,
+            actual=actual_score,
+            threshold=self._min_score,
+            message=message,
+        )
+
+
+class RougeValidator:
+    """Validates that ROUGE score meets minimum threshold."""
+
+    def __init__(
+        self,
+        min_score: float,
+        variant: Literal["1", "2", "l"] = "l",
+        tokeniser: WordTokeniser | None = None,
+    ) -> None:
+        """
+        Initialise the ROUGE validator.
+
+        Args:
+            min_score: Minimum ROUGE F-measure required (0.0 to 1.0).
+            variant: ROUGE variant ("1", "2", or "l"). Defaults to "l".
+            tokeniser: Tokeniser to use. Defaults to WordTokeniser().
+
+        Raises:
+            InvalidThresholdError: If min_score is not in range [0.0, 1.0].
+        """
+        if not 0.0 <= min_score <= 1.0:
+            raise InvalidThresholdError(
+                f"min_score must be between 0.0 and 1.0, got {min_score}"
+            )
+        if variant not in ("1", "2", "l"):
+            raise InvalidThresholdError(
+                f"variant must be '1', '2', or 'l', got '{variant}'"
+            )
+
+        self._min_score = min_score
+        self._variant = variant
+        self._metric = Rouge(tokeniser=tokeniser)
+
+    @property
+    def name(self) -> str:
+        """Return the name of this check."""
+        return f"rouge-{self._variant}"
+
+    def check(self, text: str, context: ValidationContext) -> CheckResult:
+        """
+        Run the ROUGE check.
+
+        Args:
+            text: The text to validate.
+            context: Validation context containing reference text.
+
+        Returns:
+            CheckResult with pass/fail status.
+
+        Raises:
+            ValidationError: If reference text is missing from context.
+        """
+        if context.reference is None:
+            raise ValidationError(f"{self.name} requires reference text in context")
+
+        result = self._metric.score(text, context.reference)
+
+        # Select the appropriate ROUGE variant (use F-measure)
+        score_map = {
+            "1": result.rouge1.fmeasure,
+            "2": result.rouge2.fmeasure,
+            "l": result.rouge_l.fmeasure,
+        }
+        actual_score = score_map[self._variant]
+        passed = actual_score >= self._min_score
+
+        if passed:
+            message = (
+                f"ROUGE-{self._variant.upper()} score {actual_score:.2f} "
+                f"meets minimum {self._min_score:.2f}"
+            )
+        else:
+            message = (
+                f"ROUGE-{self._variant.upper()} score {actual_score:.2f} "
+                f"below minimum {self._min_score:.2f}"
+            )
+
+        return CheckResult(
+            name=self.name,
+            passed=passed,
+            actual=actual_score,
+            threshold=self._min_score,
+            message=message,
+        )
+
+
+class LexicalValidator:
+    """Validates lexical similarity meets threshold."""
+
+    def __init__(
+        self,
+        min_jaccard: float | None = None,
+        min_overlap: float | None = None,
+        tokeniser: WordTokeniser | None = None,
+    ) -> None:
+        """
+        Initialise the lexical validator.
+
+        Args:
+            min_jaccard: Minimum Jaccard similarity required (0.0 to 1.0).
+            min_overlap: Minimum token overlap required (0.0 to 1.0).
+            tokeniser: Tokeniser to use. Defaults to WordTokeniser().
+
+        Raises:
+            InvalidThresholdError: If thresholds are invalid or none provided.
+        """
+        if min_jaccard is None and min_overlap is None:
+            raise InvalidThresholdError(
+                "At least one of min_jaccard or min_overlap must be provided"
+            )
+
+        if min_jaccard is not None and not 0.0 <= min_jaccard <= 1.0:
+            raise InvalidThresholdError(
+                f"min_jaccard must be between 0.0 and 1.0, got {min_jaccard}"
+            )
+
+        if min_overlap is not None and not 0.0 <= min_overlap <= 1.0:
+            raise InvalidThresholdError(
+                f"min_overlap must be between 0.0 and 1.0, got {min_overlap}"
+            )
+
+        self._min_jaccard = min_jaccard
+        self._min_overlap = min_overlap
+        self._metric = Lexical(tokeniser=tokeniser)
+
+    @property
+    def name(self) -> str:
+        """Return the name of this check."""
+        return "lexical"
+
+    def check(self, text: str, context: ValidationContext) -> CheckResult:
+        """
+        Run the lexical similarity check.
+
+        Args:
+            text: The text to validate.
+            context: Validation context containing reference text.
+
+        Returns:
+            CheckResult with pass/fail status.
+
+        Raises:
+            ValidationError: If reference text is missing from context.
+        """
+        if context.reference is None:
+            raise ValidationError(f"{self.name} requires reference text in context")
+
+        result = self._metric.score(text, context.reference)
+
+        # Check each threshold that was specified
+        failures = []
+        if self._min_jaccard is not None and result.jaccard < self._min_jaccard:
+            failures.append(
+                f"Jaccard {result.jaccard:.2f} below minimum {self._min_jaccard:.2f}"
+            )
+
+        if self._min_overlap is not None and result.token_overlap < self._min_overlap:
+            failures.append(
+                f"token overlap {result.token_overlap:.2f} "
+                f"below minimum {self._min_overlap:.2f}"
+            )
+
+        passed = len(failures) == 0
+
+        if passed:
+            parts = []
+            if self._min_jaccard is not None:
+                parts.append(f"Jaccard {result.jaccard:.2f} >= {self._min_jaccard:.2f}")
+            if self._min_overlap is not None:
+                parts.append(
+                    f"overlap {result.token_overlap:.2f} >= {self._min_overlap:.2f}"
+                )
+            message = "Lexical similarity: " + ", ".join(parts)
+        else:
+            message = "Lexical similarity: " + "; ".join(failures)
+
+        # Build actual value dict
+        actual = {"jaccard": result.jaccard, "token_overlap": result.token_overlap}
+        threshold = {}
+        if self._min_jaccard is not None:
+            threshold["min_jaccard"] = self._min_jaccard
+        if self._min_overlap is not None:
+            threshold["min_overlap"] = self._min_overlap
+
+        return CheckResult(
+            name=self.name,
+            passed=passed,
+            actual=actual,
+            threshold=threshold,
+            message=message,
+        )
+
+
+class SemanticValidator:
+    """Validates that semantic similarity meets minimum threshold.
+
+    Requires the `veritext[semantic]` extra to be installed.
+    """
+
+    def __init__(
+        self,
+        min_score: float,
+        model: str = "all-MiniLM-L6-v2",
+        cache_embeddings: bool = True,
+    ) -> None:
+        """
+        Initialise the semantic validator.
+
+        Args:
+            min_score: Minimum semantic similarity required (0.0 to 1.0).
+            model: Name of the sentence-transformers model to use.
+            cache_embeddings: Whether to cache embeddings for repeated texts.
+
+        Raises:
+            InvalidThresholdError: If min_score is not in range [0.0, 1.0].
+            DependencyError: If sentence-transformers is not installed.
+        """
+        if not 0.0 <= min_score <= 1.0:
+            raise InvalidThresholdError(
+                f"min_score must be between 0.0 and 1.0, got {min_score}"
+            )
+
+        self._min_score = min_score
+        # Lazy import to avoid loading PyTorch unless needed
+        from veritext.semantic.similarity import SemanticSimilarity
+
+        self._metric: SemanticSimilarity = SemanticSimilarity(
+            model=model, cache_embeddings=cache_embeddings
+        )
+
+    @property
+    def name(self) -> str:
+        """Return the name of this check."""
+        return "semantic"
+
+    def check(self, text: str, context: ValidationContext) -> CheckResult:
+        """
+        Run the semantic similarity check.
+
+        Args:
+            text: The text to validate.
+            context: Validation context containing reference text.
+
+        Returns:
+            CheckResult with pass/fail status.
+
+        Raises:
+            ValidationError: If reference text is missing from context.
+        """
+        if context.reference is None:
+            raise ValidationError(f"{self.name} requires reference text in context")
+
+        result = self._metric.score(text, context.reference)
+        passed = result.similarity >= self._min_score
+
+        if passed:
+            message = (
+                f"Semantic similarity {result.similarity:.2f} "
+                f"meets minimum {self._min_score:.2f}"
+            )
+        else:
+            message = (
+                f"Semantic similarity {result.similarity:.2f} "
+                f"below minimum {self._min_score:.2f}"
+            )
+
+        return CheckResult(
+            name=self.name,
+            passed=passed,
+            actual=result.similarity,
+            threshold=self._min_score,
+            message=message,
+        )
@@ -0,0 +1 @@
+"""Tests for the benchmark module."""
@@ -0,0 +1,145 @@
+"""Tests for benchmark data models."""
+
+from datetime import UTC, datetime
+
+import pytest
+from pydantic import ValidationError
+
+from veritext.benchmark.models import BenchmarkRun, RegressionReport
+
+
+class TestBenchmarkRun:
+    """Tests for BenchmarkRun model."""
+
+    def test_create_benchmark_run(self) -> None:
+        """BenchmarkRun can be created with required fields."""
+        run = BenchmarkRun(
+            id="test-id-123",
+            benchmark_name="test-benchmark",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0-dev",
+            metrics={"bleu4": 0.75, "rouge_l": 0.82},
+            sample_count=100,
+        )
+
+        assert run.id == "test-id-123"
+        assert run.benchmark_name == "test-benchmark"
+        assert run.veritext_version == "0.1.0-dev"
+        assert run.metrics == {"bleu4": 0.75, "rouge_l": 0.82}
+        assert run.sample_count == 100
+        assert run.metadata == {}
+
+    def test_create_with_metadata(self) -> None:
+        """BenchmarkRun can include optional metadata."""
+        run = BenchmarkRun(
+            id="test-id-456",
+            benchmark_name="test-benchmark",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0-dev",
+            metrics={"bleu4": 0.75},
+            sample_count=50,
+            metadata={"git_sha": "abc123", "model_version": "gpt-4"},
+        )
+
+        assert run.metadata == {"git_sha": "abc123", "model_version": "gpt-4"}
+
+    def test_frozen_model(self) -> None:
+        """BenchmarkRun is immutable."""
+        run = BenchmarkRun(
+            id="test-id",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+
+        with pytest.raises(ValidationError):
+            run.id = "new-id"  # type: ignore[misc]
+
+    def test_serialisation(self) -> None:
+        """BenchmarkRun can be serialised to dict."""
+        run = BenchmarkRun(
+            id="test-id",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+
+        data = run.model_dump()
+        assert data["id"] == "test-id"
+        assert data["benchmark_name"] == "test"
+        assert data["metrics"] == {"bleu4": 0.5}
+
+
+class TestRegressionReport:
+    """Tests for RegressionReport model."""
+
+    def test_no_regression_summary(self) -> None:
+        """Summary indicates no regression when detected is False."""
+        report = RegressionReport(
+            detected=False,
+            baseline={"bleu4": 0.75, "rouge_l": 0.80},
+            current={"bleu4": 0.76, "rouge_l": 0.81},
+            deltas={"bleu4": 0.01, "rouge_l": 0.01},
+            tolerance=0.05,
+        )
+
+        assert "No regression detected" in report.summary
+
+    def test_regression_summary(self) -> None:
+        """Summary lists regressed metrics when detected is True."""
+        report = RegressionReport(
+            detected=True,
+            baseline={"bleu4": 0.75, "rouge_l": 0.80},
+            current={"bleu4": 0.65, "rouge_l": 0.78},
+            deltas={"bleu4": -0.10, "rouge_l": -0.02},
+            tolerance=0.05,
+        )
+
+        assert "Regression detected" in report.summary
+        assert "bleu4" in report.summary
+        assert "0.6500" in report.summary
+        assert "baseline: 0.7500" in report.summary
+
+    def test_regression_excludes_within_tolerance(self) -> None:
+        """Summary only shows metrics that exceed tolerance."""
+        report = RegressionReport(
+            detected=True,
+            baseline={"bleu4": 0.75, "rouge_l": 0.80},
+            current={"bleu4": 0.65, "rouge_l": 0.78},
+            deltas={"bleu4": -0.10, "rouge_l": -0.02},
+            tolerance=0.05,
+        )
+
+        # rouge_l is -0.02, within tolerance of 0.05, so shouldn't appear
+        assert "rouge_l" not in report.summary
+        # bleu4 is -0.10, exceeds tolerance, so should appear
+        assert "bleu4" in report.summary
+
+    def test_frozen_model(self) -> None:
+        """RegressionReport is immutable."""
+        report = RegressionReport(
+            detected=False,
+            baseline={},
+            current={},
+            deltas={},
+            tolerance=0.05,
+        )
+
+        with pytest.raises(ValidationError):
+            report.detected = True  # type: ignore[misc]
+
+    def test_tolerance_in_summary(self) -> None:
+        """Summary includes tolerance threshold."""
+        report = RegressionReport(
+            detected=True,
+            baseline={"metric": 0.80},
+            current={"metric": 0.50},
+            deltas={"metric": -0.30},
+            tolerance=0.10,
+        )
+
+        assert "10.00%" in report.summary
@@ -0,0 +1,229 @@
+"""Tests for regression detection."""
+
+from datetime import UTC, datetime
+
+import pytest
+
+from veritext.benchmark.models import BenchmarkRun
+from veritext.benchmark.regression import compute_baseline, detect_regression
+
+
+def make_run(
+    run_id: str,
+    metrics: dict[str, float],
+    day: int = 1,
+) -> BenchmarkRun:
+    """Helper to create a BenchmarkRun."""
+    return BenchmarkRun(
+        id=run_id,
+        benchmark_name="test",
+        timestamp=datetime(2025, 1, day, 12, 0, 0, tzinfo=UTC),
+        veritext_version="0.1.0",
+        metrics=metrics,
+        sample_count=10,
+    )
+
+
+class TestComputeBaseline:
+    """Tests for baseline computation."""
+
+    def test_empty_runs(self) -> None:
+        """Returns empty baseline for empty runs list."""
+        baseline = compute_baseline([])
+        assert baseline == {}
+
+    def test_single_run(self) -> None:
+        """Single run produces baseline equal to that run's metrics."""
+        runs = [make_run("r1", {"bleu4": 0.75, "rouge_l": 0.80})]
+
+        baseline = compute_baseline(runs)
+
+        assert baseline["bleu4"] == 0.75
+        assert baseline["rouge_l"] == 0.80
+
+    def test_multiple_runs_average(self) -> None:
+        """Baseline is the average of all runs in window."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70}, day=3),
+            make_run("r2", {"bleu4": 0.80}, day=2),
+            make_run("r3", {"bleu4": 0.90}, day=1),
+        ]
+
+        baseline = compute_baseline(runs, window=3)
+
+        assert baseline["bleu4"] == pytest.approx(0.80)  # (0.70+0.80+0.90)/3
+
+    def test_window_limits_runs(self) -> None:
+        """Only includes runs within the window size."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70}, day=5),  # most recent
+            make_run("r2", {"bleu4": 0.80}, day=4),
+            make_run("r3", {"bleu4": 0.90}, day=3),
+            make_run("r4", {"bleu4": 0.60}, day=2),  # excluded
+            make_run("r5", {"bleu4": 0.50}, day=1),  # excluded
+        ]
+
+        baseline = compute_baseline(runs, window=3)
+
+        # Only first 3 runs: (0.70 + 0.80 + 0.90) / 3 = 0.80
+        assert baseline["bleu4"] == pytest.approx(0.80)
+
+    def test_partial_history(self) -> None:
+        """Works when fewer runs than window size exist."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70}),
+            make_run("r2", {"bleu4": 0.80}),
+        ]
+
+        baseline = compute_baseline(runs, window=10)
+
+        # Only 2 runs available: (0.70 + 0.80) / 2 = 0.75
+        assert baseline["bleu4"] == pytest.approx(0.75)
+
+    def test_multiple_metrics(self) -> None:
+        """Computes baseline for all metrics present."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70, "rouge_l": 0.75}),
+            make_run("r2", {"bleu4": 0.80, "rouge_l": 0.85}),
+        ]
+
+        baseline = compute_baseline(runs)
+
+        assert baseline["bleu4"] == pytest.approx(0.75)
+        assert baseline["rouge_l"] == pytest.approx(0.80)
+
+    def test_varying_metrics(self) -> None:
+        """Handles runs with different metric sets."""
+        runs = [
+            make_run("r1", {"bleu4": 0.70, "rouge_l": 0.75}),
+            make_run("r2", {"bleu4": 0.80}),  # No rouge_l
+        ]
+
+        baseline = compute_baseline(runs)
+
+        # bleu4 appears in both runs
+        assert baseline["bleu4"] == pytest.approx(0.75)
+        # rouge_l only appears in one run
+        assert baseline["rouge_l"] == pytest.approx(0.75)
+
+
+class TestDetectRegression:
+    """Tests for regression detection."""
+
+    def test_no_baseline(self) -> None:
+        """No regression when baseline is empty."""
+        report = detect_regression(
+            current={"bleu4": 0.70},
+            baseline={},
+            tolerance=0.05,
+        )
+
+        assert not report.detected
+        assert report.deltas == {}
+
+    def test_no_regression_stable(self) -> None:
+        """No regression when metrics are stable."""
+        report = detect_regression(
+            current={"bleu4": 0.75},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        assert not report.detected
+        assert report.deltas["bleu4"] == pytest.approx(0.0)
+
+    def test_no_regression_improved(self) -> None:
+        """No regression when metrics improved."""
+        report = detect_regression(
+            current={"bleu4": 0.85},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        assert not report.detected
+        assert report.deltas["bleu4"] == pytest.approx(0.10)
+
+    def test_no_regression_within_tolerance(self) -> None:
+        """No regression when drop is within tolerance."""
+        report = detect_regression(
+            current={"bleu4": 0.73},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        assert not report.detected
+        assert report.deltas["bleu4"] == pytest.approx(-0.02)
+
+    def test_regression_detected(self) -> None:
+        """Regression detected when metric drops beyond tolerance."""
+        report = detect_regression(
+            current={"bleu4": 0.65},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        assert report.detected
+        assert report.deltas["bleu4"] == pytest.approx(-0.10)
+
+    def test_regression_at_tolerance_boundary(self) -> None:
+        """Drop at tolerance boundary is not a regression."""
+        # Use a value clearly at the boundary (accounting for float precision)
+        # The implementation checks delta < -tolerance (strictly less than)
+        report = detect_regression(
+            current={"bleu4": 0.50},
+            baseline={"bleu4": 0.50},
+            tolerance=0.05,
+        )
+
+        # Delta is 0.0, well within tolerance
+        assert not report.detected
+        assert report.deltas["bleu4"] == 0.0
+
+    def test_regression_just_beyond_tolerance(self) -> None:
+        """Just beyond tolerance is a regression."""
+        report = detect_regression(
+            current={"bleu4": 0.6999},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        # Delta is -0.0501, which is < -tolerance
+        assert report.detected
+
+    def test_multiple_metrics_any_regresses(self) -> None:
+        """Regression detected if any metric exceeds tolerance."""
+        report = detect_regression(
+            current={"bleu4": 0.65, "rouge_l": 0.80},
+            baseline={"bleu4": 0.75, "rouge_l": 0.80},
+            tolerance=0.05,
+        )
+
+        assert report.detected
+        # Only bleu4 regressed
+        assert report.deltas["bleu4"] == pytest.approx(-0.10)
+        assert report.deltas["rouge_l"] == pytest.approx(0.0)
+
+    def test_report_contains_all_values(self) -> None:
+        """Report includes baseline, current, and deltas."""
+        baseline = {"bleu4": 0.75, "rouge_l": 0.80}
+        current = {"bleu4": 0.65, "rouge_l": 0.82}
+
+        report = detect_regression(current, baseline, tolerance=0.05)
+
+        assert report.baseline == baseline
+        assert report.current == current
+        assert report.tolerance == 0.05
+        assert "bleu4" in report.deltas
+        assert "rouge_l" in report.deltas
+
+    def test_missing_metric_in_current(self) -> None:
+        """Missing metric in current treated as zero."""
+        report = detect_regression(
+            current={},
+            baseline={"bleu4": 0.75},
+            tolerance=0.05,
+        )
+
+        # 0.0 - 0.75 = -0.75, which is a regression
+        assert report.detected
+        assert report.deltas["bleu4"] == pytest.approx(-0.75)
@@ -0,0 +1,247 @@
+"""Tests for benchmark runner."""
+
+from pathlib import Path
+
+import pytest
+
+from veritext.benchmark.models import BenchmarkRun
+from veritext.benchmark.runner import Benchmark
+from veritext.core.exceptions import RegressionDetectedError
+
+
+@pytest.fixture
+def benchmark(tmp_path: Path) -> Benchmark:
+    """Create a Benchmark instance with temporary storage."""
+    return Benchmark("test-suite", storage_path=tmp_path / "benchmarks")
+
+
+@pytest.fixture
+def sample_data() -> tuple[list[str], list[str]]:
+    """Sample candidates and references for testing."""
+    candidates = [
+        "The quick brown fox jumps over the lazy dog.",
+        "A fast auburn fox leaps above the sleepy hound.",
+    ]
+    references = [
+        "The quick brown fox jumps over the lazy dog.",
+        "The swift brown fox jumps over the lazy dog.",
+    ]
+    return candidates, references
+
+
+class TestBenchmarkInit:
+    """Tests for Benchmark initialisation."""
+
+    def test_creates_storage_directory(self, tmp_path: Path) -> None:
+        """Benchmark creates storage directory on init."""
+        storage_path = tmp_path / "benchmarks"
+        Benchmark("my-suite", storage_path=storage_path)
+
+        assert storage_path.exists()
+
+    def test_name_property(self, benchmark: Benchmark) -> None:
+        """Benchmark exposes its name."""
+        assert benchmark.name == "test-suite"
+
+
+class TestEvaluate:
+    """Tests for the evaluate method."""
+
+    def test_evaluate_stores_run(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Evaluate creates and stores a benchmark run."""
+        candidates, references = sample_data
+
+        run = benchmark.evaluate(candidates, references)
+
+        assert isinstance(run, BenchmarkRun)
+        assert run.benchmark_name == "test-suite"
+        assert run.sample_count == 2
+
+    def test_evaluate_returns_metrics(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Evaluate computes default metrics."""
+        candidates, references = sample_data
+
+        run = benchmark.evaluate(candidates, references)
+
+        # Default metrics are rouge_l and bleu4
+        assert "rouge_l" in run.metrics
+        assert "bleu4" in run.metrics
+        assert 0.0 <= run.metrics["rouge_l"] <= 1.0
+        assert 0.0 <= run.metrics["bleu4"] <= 1.0
+
+    def test_evaluate_custom_metrics(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Evaluate can compute custom metrics."""
+        candidates, references = sample_data
+
+        run = benchmark.evaluate(
+            candidates, references, metrics=["bleu1", "bleu2", "rouge1"]
+        )
+
+        assert "bleu1" in run.metrics
+        assert "bleu2" in run.metrics
+        assert "rouge1" in run.metrics
+        assert "bleu4" not in run.metrics  # Not requested
+
+    def test_evaluate_with_metadata(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Evaluate can include metadata."""
+        candidates, references = sample_data
+
+        run = benchmark.evaluate(
+            candidates, references, metadata={"git_sha": "abc123", "model": "gpt-4"}
+        )
+
+        assert run.metadata == {"git_sha": "abc123", "model": "gpt-4"}
+
+    def test_evaluate_stores_retrievable(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Stored run can be retrieved."""
+        candidates, references = sample_data
+        run = benchmark.evaluate(candidates, references)
+
+        history = benchmark.get_history()
+
+        assert len(history) == 1
+        assert history[0].id == run.id
+
+
+class TestCheckRegression:
+    """Tests for regression checking."""
+
+    def test_check_no_runs(self, benchmark: Benchmark) -> None:
+        """No regression when no runs exist."""
+        report = benchmark.check_regression()
+
+        assert not report.detected
+        assert report.baseline == {}
+        assert report.current == {}
+
+    def test_check_single_run(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """No regression with single run (no baseline)."""
+        candidates, references = sample_data
+        benchmark.evaluate(candidates, references)
+
+        report = benchmark.check_regression()
+
+        # First run has no baseline to compare against
+        assert not report.detected
+
+    def test_check_stable_metrics(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """No regression when metrics are stable."""
+        candidates, references = sample_data
+
+        # Run multiple times with same data
+        for _ in range(3):
+            benchmark.evaluate(candidates, references)
+
+        report = benchmark.check_regression()
+        assert not report.detected
+
+    def test_check_reports_regression(self, tmp_path: Path) -> None:
+        """Reports regression when metrics drop significantly."""
+        benchmark = Benchmark("regress-test", storage_path=tmp_path / "benchmarks")
+
+        # First run with good metrics
+        good_candidates = ["The quick brown fox jumps."]
+        good_references = ["The quick brown fox jumps."]
+        benchmark.evaluate(good_candidates, good_references)
+
+        # Second run with worse metrics (different text)
+        bad_candidates = ["Something completely different here."]
+        benchmark.evaluate(bad_candidates, good_references)
+
+        report = benchmark.check_regression(tolerance=0.05)
+
+        # Should detect regression since second run is very different
+        assert report.detected or any(d < -0.05 for d in report.deltas.values())
+
+
+class TestAssertNoRegression:
+    """Tests for assert_no_regression method."""
+
+    def test_passes_when_stable(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Does not raise when metrics are stable."""
+        candidates, references = sample_data
+
+        for _ in range(3):
+            benchmark.evaluate(candidates, references)
+
+        # Should not raise
+        benchmark.assert_no_regression()
+
+    def test_raises_on_regression(self, tmp_path: Path) -> None:
+        """Raises RegressionDetectedError when quality drops."""
+        benchmark = Benchmark("regress-test", storage_path=tmp_path / "benchmarks")
+
+        # Establish baseline with perfect match
+        perfect = ["The quick brown fox."]
+        benchmark.evaluate(perfect, perfect)
+
+        # Second run with terrible match
+        terrible = ["Completely unrelated text."]
+        benchmark.evaluate(terrible, perfect)
+
+        with pytest.raises(RegressionDetectedError):
+            benchmark.assert_no_regression(tolerance=0.05)
+
+
+class TestGetHistory:
+    """Tests for get_history method."""
+
+    def test_empty_history(self, benchmark: Benchmark) -> None:
+        """Returns empty list when no runs."""
+        history = benchmark.get_history()
+        assert history == []
+
+    def test_returns_runs(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Returns benchmark runs."""
+        candidates, references = sample_data
+
+        run1 = benchmark.evaluate(candidates, references)
+        run2 = benchmark.evaluate(candidates, references)
+
+        history = benchmark.get_history()
+
+        assert len(history) == 2
+        assert history[0].id == run2.id  # Most recent first
+        assert history[1].id == run1.id
+
+    def test_respects_limit(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Respects limit parameter."""
+        candidates, references = sample_data
+
+        for _ in range(5):
+            benchmark.evaluate(candidates, references)
+
+        history = benchmark.get_history(limit=3)
+        assert len(history) == 3
+
+    def test_default_limit(
+        self, benchmark: Benchmark, sample_data: tuple[list[str], list[str]]
+    ) -> None:
+        """Default limit is 20."""
+        candidates, references = sample_data
+
+        for _ in range(25):
+            benchmark.evaluate(candidates, references)
+
+        history = benchmark.get_history()
+        assert len(history) == 20
@@ -0,0 +1,297 @@
+"""Tests for benchmark SQLite storage."""
+
+import sqlite3
+import threading
+from datetime import UTC, datetime
+from pathlib import Path
+
+import pytest
+
+from veritext.benchmark.models import BenchmarkRun
+from veritext.benchmark.storage import BenchmarkStorage
+from veritext.core.exceptions import StorageError
+
+
+@pytest.fixture
+def db_path(tmp_path: Path) -> Path:
+    """Return a temporary database path."""
+    return tmp_path / "benchmarks" / "test.db"
+
+
+@pytest.fixture
+def storage(db_path: Path) -> BenchmarkStorage:
+    """Create a BenchmarkStorage instance."""
+    return BenchmarkStorage(db_path)
+
+
+@pytest.fixture
+def sample_run() -> BenchmarkRun:
+    """Create a sample benchmark run."""
+    return BenchmarkRun(
+        id="run-001",
+        benchmark_name="test-suite",
+        timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+        veritext_version="0.1.0-dev",
+        metrics={"bleu4": 0.75, "rouge_l": 0.82},
+        sample_count=100,
+        metadata={"git_sha": "abc123"},
+    )
+
+
+class TestDatabaseCreation:
+    """Tests for database initialisation."""
+
+    def test_creates_database_file(self, db_path: Path) -> None:
+        """Storage creates the database file on init."""
+        assert not db_path.exists()
+        BenchmarkStorage(db_path)
+        assert db_path.exists()
+
+    def test_creates_parent_directories(self, tmp_path: Path) -> None:
+        """Storage creates parent directories if needed."""
+        nested_path = tmp_path / "deep" / "nested" / "path" / "test.db"
+        BenchmarkStorage(nested_path)
+        assert nested_path.exists()
+
+    def test_creates_tables(self, db_path: Path) -> None:
+        """Storage creates required tables."""
+        BenchmarkStorage(db_path)
+
+        conn = sqlite3.connect(str(db_path))
+        cursor = conn.execute("SELECT name FROM sqlite_master WHERE type='table'")
+        tables = {row[0] for row in cursor.fetchall()}
+        conn.close()
+
+        assert "benchmark_runs" in tables
+        assert "benchmark_metrics" in tables
+
+    def test_creates_index(self, db_path: Path) -> None:
+        """Storage creates index on benchmark_name and timestamp."""
+        BenchmarkStorage(db_path)
+
+        conn = sqlite3.connect(str(db_path))
+        cursor = conn.execute("SELECT name FROM sqlite_master WHERE type='index'")
+        indices = {row[0] for row in cursor.fetchall()}
+        conn.close()
+
+        assert "idx_benchmark_name" in indices
+
+
+class TestSaveRun:
+    """Tests for saving benchmark runs."""
+
+    def test_save_run(
+        self, storage: BenchmarkStorage, sample_run: BenchmarkRun
+    ) -> None:
+        """Storage can save a benchmark run."""
+        storage.save_run(sample_run)
+
+        runs = storage.get_runs("test-suite")
+        assert len(runs) == 1
+        assert runs[0].id == "run-001"
+
+    def test_save_preserves_all_fields(
+        self, storage: BenchmarkStorage, sample_run: BenchmarkRun
+    ) -> None:
+        """Saved run preserves all fields correctly."""
+        storage.save_run(sample_run)
+
+        runs = storage.get_runs("test-suite")
+        run = runs[0]
+
+        assert run.id == sample_run.id
+        assert run.benchmark_name == sample_run.benchmark_name
+        assert run.timestamp == sample_run.timestamp
+        assert run.veritext_version == sample_run.veritext_version
+        assert run.metrics == sample_run.metrics
+        assert run.sample_count == sample_run.sample_count
+        assert run.metadata == sample_run.metadata
+
+    def test_save_duplicate_id_raises(
+        self, storage: BenchmarkStorage, sample_run: BenchmarkRun
+    ) -> None:
+        """Saving a run with duplicate ID raises StorageError."""
+        storage.save_run(sample_run)
+
+        with pytest.raises(StorageError, match="already exists"):
+            storage.save_run(sample_run)
+
+    def test_save_run_empty_metadata(self, storage: BenchmarkStorage) -> None:
+        """Run with empty metadata saves correctly."""
+        run = BenchmarkRun(
+            id="run-no-meta",
+            benchmark_name="test-suite",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0-dev",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+
+        storage.save_run(run)
+        retrieved = storage.get_latest_run("test-suite")
+
+        assert retrieved is not None
+        assert retrieved.metadata == {}
+
+
+class TestGetRuns:
+    """Tests for retrieving benchmark runs."""
+
+    def test_get_runs_empty_database(self, storage: BenchmarkStorage) -> None:
+        """Returns empty list for empty database."""
+        runs = storage.get_runs("nonexistent")
+        assert runs == []
+
+    def test_get_runs_filters_by_name(self, storage: BenchmarkStorage) -> None:
+        """Returns only runs matching the benchmark name."""
+        run1 = BenchmarkRun(
+            id="run-1",
+            benchmark_name="suite-a",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+        run2 = BenchmarkRun(
+            id="run-2",
+            benchmark_name="suite-b",
+            timestamp=datetime(2025, 1, 15, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.6},
+            sample_count=10,
+        )
+
+        storage.save_run(run1)
+        storage.save_run(run2)
+
+        runs_a = storage.get_runs("suite-a")
+        runs_b = storage.get_runs("suite-b")
+
+        assert len(runs_a) == 1
+        assert runs_a[0].id == "run-1"
+        assert len(runs_b) == 1
+        assert runs_b[0].id == "run-2"
+
+    def test_get_runs_ordered_by_timestamp(self, storage: BenchmarkStorage) -> None:
+        """Returns runs ordered by timestamp, most recent first."""
+        run_old = BenchmarkRun(
+            id="run-old",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 10, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+        run_new = BenchmarkRun(
+            id="run-new",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 20, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.6},
+            sample_count=10,
+        )
+
+        # Save in reverse order
+        storage.save_run(run_new)
+        storage.save_run(run_old)
+
+        runs = storage.get_runs("test")
+        assert runs[0].id == "run-new"
+        assert runs[1].id == "run-old"
+
+    def test_get_runs_with_limit(self, storage: BenchmarkStorage) -> None:
+        """Respects limit parameter."""
+        for i in range(5):
+            run = BenchmarkRun(
+                id=f"run-{i}",
+                benchmark_name="test",
+                timestamp=datetime(2025, 1, i + 1, 12, 0, 0, tzinfo=UTC),
+                veritext_version="0.1.0",
+                metrics={"bleu4": 0.5 + i * 0.1},
+                sample_count=10,
+            )
+            storage.save_run(run)
+
+        runs = storage.get_runs("test", limit=3)
+        assert len(runs) == 3
+
+
+class TestGetLatestRun:
+    """Tests for getting the latest run."""
+
+    def test_get_latest_run_empty(self, storage: BenchmarkStorage) -> None:
+        """Returns None for empty database."""
+        result = storage.get_latest_run("nonexistent")
+        assert result is None
+
+    def test_get_latest_run(self, storage: BenchmarkStorage) -> None:
+        """Returns the most recent run."""
+        run_old = BenchmarkRun(
+            id="run-old",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 10, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.5},
+            sample_count=10,
+        )
+        run_new = BenchmarkRun(
+            id="run-new",
+            benchmark_name="test",
+            timestamp=datetime(2025, 1, 20, 12, 0, 0, tzinfo=UTC),
+            veritext_version="0.1.0",
+            metrics={"bleu4": 0.6},
+            sample_count=10,
+        )
+
+        storage.save_run(run_old)
+        storage.save_run(run_new)
+
+        latest = storage.get_latest_run("test")
+        assert latest is not None
+        assert latest.id == "run-new"
+
+
+class TestConcurrentAccess:
+    """Tests for concurrent database access."""
+
+    def test_concurrent_writes(self, db_path: Path) -> None:
+        """Multiple threads can write concurrently with WAL mode."""
+        errors: list[Exception] = []
+
+        def write_run(run_id: int) -> None:
+            try:
+                storage = BenchmarkStorage(db_path)
+                run = BenchmarkRun(
+                    id=f"run-{run_id}",
+                    benchmark_name="test",
+                    timestamp=datetime(2025, 1, 15, 12, 0, run_id, tzinfo=UTC),
+                    veritext_version="0.1.0",
+                    metrics={"bleu4": 0.5},
+                    sample_count=10,
+                )
+                storage.save_run(run)
+            except Exception as e:
+                errors.append(e)
+
+        threads = [threading.Thread(target=write_run, args=(i,)) for i in range(10)]
+        for t in threads:
+            t.start()
+        for t in threads:
+            t.join()
+
+        assert not errors, f"Concurrent writes failed: {errors}"
+
+        storage = BenchmarkStorage(db_path)
+        runs = storage.get_runs("test")
+        assert len(runs) == 10
+
+    def test_wal_mode_enabled(self, db_path: Path) -> None:
+        """Database uses WAL journal mode."""
+        BenchmarkStorage(db_path)
+
+        conn = sqlite3.connect(str(db_path))
+        cursor = conn.execute("PRAGMA journal_mode")
+        mode = cursor.fetchone()[0]
+        conn.close()
+
+        assert mode.lower() == "wal"
@@ -0,0 +1,274 @@
+"""Tests for the readability metric."""
+
+import pytest
+
+from veritext.metrics import Readability, ReadabilityResult
+
+
+class TestReadability:
+    """Tests for the Readability metric class."""
+
+    @pytest.fixture
+    def readability(self) -> Readability:
+        """Provide a readability metric instance."""
+        return Readability()
+
+    def test_name(self, readability: Readability) -> None:
+        """Test that name returns 'readability'."""
+        assert readability.name == "readability"
+
+    def test_requires_reference(self, readability: Readability) -> None:
+        """Test that readability does NOT require reference text."""
+        assert readability.requires_reference is False
+
+    def test_simple_text(self, readability: Readability) -> None:
+        """Test readability of simple, easy text."""
+        # Simple children's text - short sentences, simple words
+        text = "The cat sat. The dog ran. I see a bird."
+        result = readability.score(text)
+
+        # Should have low grade level and high reading ease
+        assert result.flesch_kincaid_grade < 5.0
+        assert result.flesch_reading_ease > 80.0
+
+    def test_complex_text(self, readability: Readability) -> None:
+        """Test readability of complex, academic text."""
+        # Complex academic text - long sentences, polysyllabic words
+        text = (
+            "The implementation of sophisticated computational methodologies "
+            "necessitates comprehensive understanding of algorithmic complexity "
+            "and architectural considerations."
+        )
+        result = readability.score(text)
+
+        # Should have high grade level and low reading ease
+        assert result.flesch_kincaid_grade > 12.0
+        assert result.flesch_reading_ease < 30.0
+
+    def test_medium_text(self, readability: Readability) -> None:
+        """Test readability of medium-difficulty text."""
+        text = (
+            "The weather today is quite pleasant. "
+            "Many people are enjoying the sunshine in the park. "
+            "Children play while parents watch nearby."
+        )
+        result = readability.score(text)
+
+        # Should be middle of the road
+        assert 3.0 < result.flesch_kincaid_grade < 10.0
+        assert 50.0 < result.flesch_reading_ease < 90.0
+
+    def test_single_sentence(self, readability: Readability) -> None:
+        """Test readability with a single sentence."""
+        text = "The cat sat on the mat."
+        result = readability.score(text)
+
+        # Should compute without error
+        assert result.flesch_kincaid_grade is not None
+        assert result.flesch_reading_ease is not None
+
+    def test_single_word(self, readability: Readability) -> None:
+        """Test readability with a single word."""
+        text = "Cat"
+        result = readability.score(text)
+
+        # Should handle single word (1 word, 1 sentence, 1 syllable)
+        assert result.flesch_kincaid_grade is not None
+        assert result.flesch_reading_ease is not None
+
+    def test_empty_text(self, readability: Readability) -> None:
+        """Test that empty text returns zero scores."""
+        result = readability.score("")
+
+        assert result.flesch_kincaid_grade == 0.0
+        assert result.flesch_reading_ease == 0.0
+
+    def test_whitespace_only(self, readability: Readability) -> None:
+        """Test that whitespace-only text returns zero scores."""
+        result = readability.score("   \t\n  ")
+
+        assert result.flesch_kincaid_grade == 0.0
+        assert result.flesch_reading_ease == 0.0
+
+    def test_reference_ignored(self, readability: Readability) -> None:
+        """Test that reference parameter is ignored."""
+        text = "The cat sat on the mat."
+
+        # Score with no reference
+        result1 = readability.score(text)
+        # Score with reference (should be ignored)
+        result2 = readability.score(text, "Completely different text")
+        # Score with list of references
+        result3 = readability.score(text, ["ref1", "ref2"])
+
+        # All should produce identical results
+        assert result1.flesch_kincaid_grade == result2.flesch_kincaid_grade
+        assert result1.flesch_reading_ease == result2.flesch_reading_ease
+        assert result1.flesch_kincaid_grade == result3.flesch_kincaid_grade
+
+    def test_punctuation_handling(self, readability: Readability) -> None:
+        """Test that punctuation affects sentence counting."""
+        # Same words, different sentence structure
+        text1 = "The cat sat on the mat"  # 1 sentence
+        text2 = "The cat sat. On the mat."  # 2 sentences
+
+        result1 = readability.score(text1)
+        result2 = readability.score(text2)
+
+        # Different sentence counts should affect scores
+        assert result1.flesch_kincaid_grade != result2.flesch_kincaid_grade
+
+    def test_question_marks_count_sentences(self, readability: Readability) -> None:
+        """Test that question marks end sentences."""
+        text = "What is this? It is a test."
+        result = readability.score(text)
+
+        # Should count as 2 sentences
+        # With 7 words total, words_per_sentence = 3.5
+        assert result.flesch_kincaid_grade is not None
+
+    def test_exclamation_marks_count_sentences(self, readability: Readability) -> None:
+        """Test that exclamation marks end sentences."""
+        text = "Wow! That is amazing!"
+        result = readability.score(text)
+
+        # Should count as 2 sentences
+        assert result.flesch_kincaid_grade is not None
+
+    def test_multiple_punctuation(self, readability: Readability) -> None:
+        """Test handling of multiple punctuation marks."""
+        text = "What?! That's crazy... Well then."
+        result = readability.score(text)
+
+        # Should handle gracefully
+        assert result.flesch_kincaid_grade is not None
+
+    def test_result_score_property(self, readability: Readability) -> None:
+        """Test that result.score returns flesch_reading_ease."""
+        result = readability.score("The cat sat on the mat.")
+        assert result.score == result.flesch_reading_ease
+
+    def test_contractions(self, readability: Readability) -> None:
+        """Test handling of contractions."""
+        text = "I'm going to the store. It's not far away."
+        result = readability.score(text)
+
+        # Should handle contractions as words
+        assert result.flesch_kincaid_grade is not None
+        assert result.flesch_reading_ease is not None
+
+
+class TestReadabilityBatch:
+    """Tests for readability batch scoring."""
+
+    @pytest.fixture
+    def readability(self) -> Readability:
+        """Provide a readability metric instance."""
+        return Readability()
+
+    def test_batch_score_basic(self, readability: Readability) -> None:
+        """Test basic batch scoring."""
+        candidates = [
+            "The cat sat on the mat.",
+            "A dog ran through the park.",
+        ]
+        result = readability.batch_score(candidates)
+
+        assert result.count == 2
+        assert len(result.results) == 2
+
+    def test_batch_score_statistics(self, readability: Readability) -> None:
+        """Test that batch scoring computes statistics."""
+        candidates = [
+            "Cat sat.",  # Very simple
+            "The implementation of sophisticated methodologies requires expertise.",
+        ]
+        result = readability.batch_score(candidates)
+
+        # Check statistics are computed
+        assert "flesch_kincaid_grade" in result.stats
+        assert "flesch_reading_ease" in result.stats
+
+        # First should be easier than second
+        assert (
+            result.results[0].flesch_reading_ease
+            > result.results[1].flesch_reading_ease
+        )
+
+    def test_batch_score_percentiles(self, readability: Readability) -> None:
+        """Test that batch scoring computes percentiles."""
+        candidates = ["a", "b", "c", "d", "e"]
+        result = readability.batch_score(candidates)
+
+        stats = result.stats["flesch_reading_ease"]
+        assert 25 in stats.percentiles
+        assert 50 in stats.percentiles
+        assert 75 in stats.percentiles
+        assert 95 in stats.percentiles
+
+    def test_batch_score_references_ignored(self, readability: Readability) -> None:
+        """Test that batch scoring ignores references."""
+        candidates = ["The cat sat.", "A dog ran."]
+
+        result1 = readability.batch_score(candidates)
+        result2 = readability.batch_score(candidates, ["ref1", "ref2"])
+
+        # Results should be identical
+        assert result1.results[0].flesch_kincaid_grade == (
+            result2.results[0].flesch_kincaid_grade
+        )
+
+    def test_batch_score_empty_list_raises(self, readability: Readability) -> None:
+        """Test that empty candidate list raises ValueError."""
+        with pytest.raises(ValueError, match="empty"):
+            readability.batch_score([])
+
+
+class TestReadabilityResult:
+    """Tests for ReadabilityResult type."""
+
+    def test_frozen(self) -> None:
+        """Test that ReadabilityResult is frozen."""
+        from pydantic import ValidationError
+
+        result = ReadabilityResult(flesch_kincaid_grade=5.0, flesch_reading_ease=70.0)
+        with pytest.raises(ValidationError):
+            result.flesch_kincaid_grade = 6.0  # type: ignore[misc]
+
+    def test_values(self) -> None:
+        """Test that values are stored correctly."""
+        result = ReadabilityResult(flesch_kincaid_grade=8.5, flesch_reading_ease=65.0)
+        assert result.flesch_kincaid_grade == 8.5
+        assert result.flesch_reading_ease == 65.0
+
+    def test_score_property(self) -> None:
+        """Test that score property returns flesch_reading_ease."""
+        result = ReadabilityResult(flesch_kincaid_grade=8.5, flesch_reading_ease=65.0)
+        assert result.score == 65.0
+
+
+class TestSyllableCounting:
+    """Tests for syllable counting heuristics."""
+
+    @pytest.fixture
+    def readability(self) -> Readability:
+        """Provide a readability metric instance."""
+        return Readability()
+
+    def test_monosyllabic_words(self, readability: Readability) -> None:
+        """Test that monosyllabic words don't inflate scores."""
+        # All one-syllable words
+        text = "The cat sat on the mat."
+        result = readability.score(text)
+
+        # Should be very easy to read
+        assert result.flesch_reading_ease > 90.0
+
+    def test_polysyllabic_words(self, readability: Readability) -> None:
+        """Test that polysyllabic words affect scores."""
+        # Words with multiple syllables
+        text = "International communication facilitates understanding."
+        result = readability.score(text)
+
+        # Should be harder to read
+        assert result.flesch_reading_ease < 50.0
@@ -0,0 +1,295 @@
+"""Tests for the ROUGE metric."""
+
+import pytest
+
+from veritext.metrics import Rouge, RougeResult, RougeScore
+
+
+class TestRouge:
+    """Tests for the Rouge metric class."""
+
+    @pytest.fixture
+    def rouge(self) -> Rouge:
+        """Provide a ROUGE metric instance."""
+        return Rouge()
+
+    def test_name(self, rouge: Rouge) -> None:
+        """Test that name returns 'rouge'."""
+        assert rouge.name == "rouge"
+
+    def test_requires_reference(self, rouge: Rouge) -> None:
+        """Test that ROUGE requires reference text."""
+        assert rouge.requires_reference is True
+
+    def test_identical_texts(self, rouge: Rouge) -> None:
+        """Test that identical texts produce perfect scores."""
+        text = "The cat sat on the mat"
+        result = rouge.score(text, text)
+
+        assert result.rouge1.precision == 1.0
+        assert result.rouge1.recall == 1.0
+        assert result.rouge1.fmeasure == 1.0
+        assert result.rouge2.fmeasure == 1.0
+        assert result.rouge_l.fmeasure == 1.0
+
+    def test_no_overlap(self, rouge: Rouge) -> None:
+        """Test that texts with no overlap produce zero scores."""
+        candidate = "apple banana cherry"
+        reference = "dog elephant fox"
+        result = rouge.score(candidate, reference)
+
+        assert result.rouge1.precision == 0.0
+        assert result.rouge1.recall == 0.0
+        assert result.rouge1.fmeasure == 0.0
+        assert result.rouge2.fmeasure == 0.0
+        assert result.rouge_l.fmeasure == 0.0
+
+    def test_partial_overlap_rouge1(self, rouge: Rouge) -> None:
+        """Test ROUGE-1 with partial overlap."""
+        candidate = "the cat sat"
+        reference = "the dog sat"
+        result = rouge.score(candidate, reference)
+
+        # Candidate: {the, cat, sat}, Reference: {the, dog, sat}
+        # Overlap: {the, sat} = 2
+        # Precision = 2/3, Recall = 2/3
+        assert abs(result.rouge1.precision - 2 / 3) < 1e-10
+        assert abs(result.rouge1.recall - 2 / 3) < 1e-10
+
+    def test_partial_overlap_rouge2(self, rouge: Rouge) -> None:
+        """Test ROUGE-2 (bigram) with partial overlap."""
+        candidate = "the cat sat on the mat"
+        reference = "the cat lay on the mat"
+        result = rouge.score(candidate, reference)
+
+        # Bigrams in candidate: (the, cat), (cat, sat), (sat, on), (on, the), (the, mat)
+        # Bigrams in reference: (the, cat), (cat, lay), (lay, on), (on, the), (the, mat)
+        # Overlap: (the, cat), (on, the), (the, mat) = 3
+        # Precision = 3/5, Recall = 3/5
+        assert abs(result.rouge2.precision - 3 / 5) < 1e-10
+        assert abs(result.rouge2.recall - 3 / 5) < 1e-10
+
+    def test_rouge_l_basic(self, rouge: Rouge) -> None:
+        """Test ROUGE-L (LCS) computation."""
+        candidate = "the cat sat on the mat"
+        reference = "the cat sat"
+        result = rouge.score(candidate, reference)
+
+        # LCS = "the cat sat" = 3 tokens
+        # Precision = 3/6 = 0.5, Recall = 3/3 = 1.0
+        assert result.rouge_l.precision == 0.5
+        assert result.rouge_l.recall == 1.0
+
+    def test_rouge_l_non_contiguous(self, rouge: Rouge) -> None:
+        """Test ROUGE-L with non-contiguous LCS."""
+        candidate = "the big cat sat"
+        reference = "the cat sat"
+        result = rouge.score(candidate, reference)
+
+        # LCS = "the cat sat" = 3 (skipping "big")
+        # Precision = 3/4, Recall = 3/3 = 1.0
+        assert result.rouge_l.precision == 0.75
+        assert result.rouge_l.recall == 1.0
+
+    def test_precision_vs_recall(self, rouge: Rouge) -> None:
+        """Test that precision and recall differ appropriately."""
+        # Short candidate, long reference
+        candidate = "the cat"
+        reference = "the cat sat on the mat"
+        result = rouge.score(candidate, reference)
+
+        # Precision should be high (all candidate tokens in reference)
+        assert result.rouge1.precision == 1.0
+        # Recall should be lower (not all reference tokens in candidate)
+        assert result.rouge1.recall < 1.0
+
+    def test_empty_candidate(self, rouge: Rouge) -> None:
+        """Test that empty candidate returns zero scores."""
+        result = rouge.score("", "The cat sat")
+
+        assert result.rouge1.fmeasure == 0.0
+        assert result.rouge2.fmeasure == 0.0
+        assert result.rouge_l.fmeasure == 0.0
+
+    def test_whitespace_only_candidate(self, rouge: Rouge) -> None:
+        """Test that whitespace-only candidate returns zero scores."""
+        result = rouge.score("   \t\n  ", "The cat sat")
+
+        assert result.rouge1.fmeasure == 0.0
+        assert result.rouge_l.fmeasure == 0.0
+
+    def test_empty_reference_raises(self, rouge: Rouge) -> None:
+        """Test that empty reference raises ValueError."""
+        with pytest.raises(ValueError, match="cannot be empty"):
+            rouge.score("The cat sat", "")
+
+    def test_none_reference_raises(self, rouge: Rouge) -> None:
+        """Test that None reference raises ValueError."""
+        with pytest.raises(ValueError, match="requires reference"):
+            rouge.score("The cat sat", None)
+
+    def test_multiple_references_uses_max(self, rouge: Rouge) -> None:
+        """Test that multiple references use max scores."""
+        candidate = "the cat sat on the mat"
+        references = [
+            "a dog ran across the room",  # Low overlap
+            "the cat sat on the mat",  # Exact match
+        ]
+        result = rouge.score(candidate, references)
+
+        # Should get perfect scores due to exact match
+        assert result.rouge1.fmeasure == 1.0
+        assert result.rouge_l.fmeasure == 1.0
+
+    def test_multiple_references_partial(self, rouge: Rouge) -> None:
+        """Test multiple references with partial matches."""
+        candidate = "the quick brown fox"
+        references = [
+            "the fast brown fox",  # 3/4 match
+            "a quick brown dog",  # 3/4 match different tokens
+        ]
+        result = rouge.score(candidate, references)
+
+        # Should pick best from either reference
+        assert result.rouge1.fmeasure > 0.0
+
+    def test_result_score_property(self, rouge: Rouge) -> None:
+        """Test that result.score returns rouge_l.fmeasure."""
+        result = rouge.score("The cat sat", "The cat sat")
+        assert result.score == result.rouge_l.fmeasure
+
+    def test_case_insensitivity(self, rouge: Rouge) -> None:
+        """Test that ROUGE is case insensitive by default."""
+        result = rouge.score("THE CAT SAT", "the cat sat")
+        assert result.rouge1.fmeasure == 1.0
+        assert result.rouge_l.fmeasure == 1.0
+
+    def test_punctuation_ignored(self, rouge: Rouge) -> None:
+        """Test that punctuation is ignored by default."""
+        result = rouge.score("The cat sat.", "The cat sat!")
+        assert result.rouge1.fmeasure == 1.0
+
+    def test_single_word(self, rouge: Rouge) -> None:
+        """Test ROUGE with single word texts."""
+        result = rouge.score("cat", "cat")
+
+        assert result.rouge1.fmeasure == 1.0
+        # ROUGE-2 should be 0 for single words (no bigrams)
+        assert result.rouge2.fmeasure == 0.0
+        assert result.rouge_l.fmeasure == 1.0
+
+    def test_fmeasure_calculation(self, rouge: Rouge) -> None:
+        """Test that F-measure is calculated correctly."""
+        # Create a case where P != R
+        candidate = "the cat sat on"
+        reference = "the cat"
+        result = rouge.score(candidate, reference)
+
+        # P = 2/4 = 0.5, R = 2/2 = 1.0
+        # F = 2 * 0.5 * 1.0 / (0.5 + 1.0) = 1.0 / 1.5 = 2/3
+        expected_f = 2 * 0.5 * 1.0 / (0.5 + 1.0)
+        assert abs(result.rouge1.fmeasure - expected_f) < 1e-10
+
+
+class TestRougeBatch:
+    """Tests for ROUGE batch scoring."""
+
+    @pytest.fixture
+    def rouge(self) -> Rouge:
+        """Provide a ROUGE metric instance."""
+        return Rouge()
+
+    def test_batch_score_basic(self, rouge: Rouge) -> None:
+        """Test basic batch scoring."""
+        candidates = ["The cat sat", "A dog runs"]
+        references = ["The cat sat", "A dog runs"]
+        result = rouge.batch_score(candidates, references)
+
+        assert result.count == 2
+        assert len(result.results) == 2
+        assert all(r.rouge_l.fmeasure == 1.0 for r in result.results)
+
+    def test_batch_score_statistics(self, rouge: Rouge) -> None:
+        """Test that batch scoring computes statistics."""
+        candidates = ["The cat sat", "Completely different words"]
+        references = ["The cat sat", "The cat sat"]
+        result = rouge.batch_score(candidates, references)
+
+        # Check statistics are computed
+        assert "rouge1_fmeasure" in result.stats
+        assert "rouge2_fmeasure" in result.stats
+        assert "rouge_l_fmeasure" in result.stats
+        assert "rouge1_precision" in result.stats
+        assert "rouge1_recall" in result.stats
+
+        # First result should be 1.0, second should be 0.0
+        assert result.results[0].rouge1.fmeasure == 1.0
+        assert result.results[1].rouge1.fmeasure == 0.0
+
+    def test_batch_score_percentiles(self, rouge: Rouge) -> None:
+        """Test that batch scoring computes percentiles."""
+        candidates = ["a", "b", "c", "d", "e"]
+        references = ["a", "b", "c", "d", "e"]
+        result = rouge.batch_score(candidates, references)
+
+        stats = result.stats["rouge1_fmeasure"]
+        assert 25 in stats.percentiles
+        assert 50 in stats.percentiles
+        assert 75 in stats.percentiles
+        assert 95 in stats.percentiles
+
+    def test_batch_score_none_references_raises(self, rouge: Rouge) -> None:
+        """Test that batch scoring raises for None references."""
+        with pytest.raises(ValueError, match="requires reference"):
+            rouge.batch_score(["text"], None)
+
+    def test_batch_score_length_mismatch_raises(self, rouge: Rouge) -> None:
+        """Test that batch scoring raises for mismatched lengths."""
+        with pytest.raises(ValueError, match="must match"):
+            rouge.batch_score(["a", "b"], ["a"])
+
+    def test_batch_score_with_multiple_references(self, rouge: Rouge) -> None:
+        """Test batch scoring with multiple references per candidate."""
+        candidates = [
+            "The cat sat on the mat",
+            "A quick brown fox",
+        ]
+        references = [
+            ["The cat sat on the mat", "A cat rests on floor"],
+            ["A quick brown fox", "The fast brown fox"],
+        ]
+        result = rouge.batch_score(candidates, references)
+
+        assert result.count == 2
+        # Both should get perfect scores due to exact matches
+        assert result.results[0].rouge_l.fmeasure == 1.0
+        assert result.results[1].rouge_l.fmeasure == 1.0
+
+
+class TestRougeResult:
+    """Tests for RougeResult and RougeScore types."""
+
+    def test_rouge_score_frozen(self) -> None:
+        """Test that RougeScore is frozen."""
+        from pydantic import ValidationError
+
+        score = RougeScore(precision=0.5, recall=0.6, fmeasure=0.55)
+        with pytest.raises(ValidationError):
+            score.precision = 0.7  # type: ignore[misc]
+
+    def test_rouge_result_frozen(self) -> None:
+        """Test that RougeResult is frozen."""
+        from pydantic import ValidationError
+
+        score = RougeScore(precision=0.5, recall=0.6, fmeasure=0.55)
+        result = RougeResult(rouge1=score, rouge2=score, rouge_l=score)
+        with pytest.raises(ValidationError):
+            result.rouge1 = score  # type: ignore[misc]
+
+    def test_score_property(self) -> None:
+        """Test that score property returns rouge_l.fmeasure."""
+        r1 = RougeScore(precision=0.9, recall=0.9, fmeasure=0.9)
+        r2 = RougeScore(precision=0.8, recall=0.8, fmeasure=0.8)
+        rl = RougeScore(precision=0.7, recall=0.7, fmeasure=0.7)
+        result = RougeResult(rouge1=r1, rouge2=r2, rouge_l=rl)
+        assert result.score == 0.7
@@ -0,0 +1 @@
+"""Tests for the Veritext pytest plugin."""
@@ -0,0 +1,32 @@
+"""Pytest configuration for pytest_plugin tests."""
+
+import pytest
+
+from veritext.pytest_plugin.fixtures import ValidatorFactory
+
+# Enable the pytester fixture for plugin testing
+pytest_plugins = ["pytester"]
+
+# Re-export fixtures from the plugin module for testing
+
+
+@pytest.fixture
+def text_validator() -> ValidatorFactory:
+    """Provide a factory for building validators."""
+    return ValidatorFactory()
+
+
+@pytest.fixture
+def validation_context() -> type:
+    """Provide a factory for creating ValidationContext objects."""
+    from typing import Any
+
+    from veritext.core.types import ValidationContext
+
+    def _create(
+        reference: str | list[str] | None = None,
+        **metadata: Any,
+    ) -> ValidationContext:
+        return ValidationContext(reference=reference, metadata=metadata)
+
+    return _create
@@ -0,0 +1,211 @@
+"""Tests for the validate_text assertion function."""
+
+import pytest
+
+from veritext.pytest_plugin import validate_text
+
+
+class TestValidateTextBasicValidation:
+    """Test basic validation scenarios."""
+
+    def test_passes_with_valid_length(self) -> None:
+        """Test validation passes when length constraints are met."""
+        text = "The quick brown fox jumps over the lazy dog."
+        validate_text(text, min_length=10, max_length=100)
+
+    def test_fails_when_too_short(self) -> None:
+        """Test validation fails when text is below minimum length."""
+        text = "Short."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, min_length=50)
+        assert "length" in str(exc_info.value).lower()
+
+    def test_fails_when_too_long(self) -> None:
+        """Test validation fails when text exceeds maximum length."""
+        text = "A" * 100
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, max_length=50)
+        assert "length" in str(exc_info.value).lower()
+
+
+class TestValidateTextReadability:
+    """Test readability validation."""
+
+    def test_passes_with_simple_text(self) -> None:
+        """Test validation passes for simple, readable text."""
+        text = "The cat sat on the mat. It was a nice day."
+        validate_text(text, max_reading_grade=10.0)
+
+    def test_fails_with_complex_text(self) -> None:
+        """Test validation fails for overly complex text."""
+        text = (
+            "The implementation of sophisticated metacognitive strategies "
+            "necessitates the comprehensive understanding of epistemological "
+            "frameworks and their corresponding methodological implications."
+        )
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, max_reading_grade=3.0)
+        assert "readability" in str(exc_info.value).lower()
+
+
+class TestValidateTextPatterns:
+    """Test pattern matching validation."""
+
+    def test_passes_when_contains_pattern(self) -> None:
+        """Test validation passes when required pattern is present."""
+        text = "Please contact support@example.com for assistance."
+        validate_text(text, must_contain=["support@example.com"])
+
+    def test_fails_when_missing_required_pattern(self) -> None:
+        """Test validation fails when required pattern is missing."""
+        text = "Please contact us for assistance."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, must_contain=["@example.com"])
+        assert "contains" in str(exc_info.value).lower()
+
+    def test_passes_when_excludes_pattern(self) -> None:
+        """Test validation passes when forbidden pattern is absent."""
+        text = "The report is complete and reviewed."
+        validate_text(text, must_exclude=["TODO", "FIXME"])
+
+    def test_fails_when_contains_forbidden_pattern(self) -> None:
+        """Test validation fails when forbidden pattern is present."""
+        text = "The report is almost done. TODO: add conclusion."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, must_exclude=["TODO"])
+        assert "excludes" in str(exc_info.value).lower()
+
+
+class TestValidateTextComparisonMetrics:
+    """Test comparison-based validation (BLEU, ROUGE)."""
+
+    def test_passes_with_high_bleu_score(self) -> None:
+        """Test validation passes when BLEU score meets threshold."""
+        reference = "The quick brown fox jumps over the lazy dog."
+        text = "The quick brown fox jumps over the lazy dog."
+        validate_text(text, reference=reference, min_bleu=0.9)
+
+    def test_fails_with_low_bleu_score(self) -> None:
+        """Test validation fails when BLEU score is below threshold."""
+        reference = "The quick brown fox jumps over the lazy dog."
+        text = "A slow red cat sleeps under the active mouse."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, reference=reference, min_bleu=0.5)
+        assert "bleu" in str(exc_info.value).lower()
+
+    def test_passes_with_high_rouge_score(self) -> None:
+        """Test validation passes when ROUGE score meets threshold."""
+        reference = "Machine learning models require extensive training data."
+        text = "Machine learning models need extensive training data."
+        validate_text(text, reference=reference, min_rouge=0.5)
+
+    def test_fails_with_low_rouge_score(self) -> None:
+        """Test validation fails when ROUGE score is below threshold."""
+        reference = "The algorithm processes input data efficiently."
+        text = "Cats enjoy sleeping in sunny spots."
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, reference=reference, min_rouge=0.5)
+        assert "rouge" in str(exc_info.value).lower()
+
+
+class TestValidateTextErrorHandling:
+    """Test error handling and edge cases."""
+
+    def test_raises_value_error_when_no_criteria(self) -> None:
+        """Test that ValueError is raised when no validation criteria provided."""
+        with pytest.raises(ValueError, match="At least one validation criterion"):
+            validate_text("Some text")
+
+    def test_raises_value_error_when_bleu_without_reference(self) -> None:
+        """Test that ValueError is raised when BLEU requested without reference."""
+        with pytest.raises(ValueError, match="Reference text required"):
+            validate_text("Some text", min_bleu=0.5)
+
+    def test_raises_value_error_when_rouge_without_reference(self) -> None:
+        """Test that ValueError is raised when ROUGE requested without reference."""
+        with pytest.raises(ValueError, match="Reference text required"):
+            validate_text("Some text", min_rouge=0.5)
+
+    def test_raises_value_error_when_semantic_without_reference(self) -> None:
+        """Test that ValueError is raised for semantic without reference."""
+        with pytest.raises(ValueError, match="Reference text required"):
+            validate_text("Some text", min_semantic=0.5)
+
+
+class TestValidateTextMultipleCriteria:
+    """Test validation with multiple criteria combined."""
+
+    def test_passes_all_criteria(self) -> None:
+        """Test validation passes when all criteria are met."""
+        reference = "The quick brown fox jumps over the lazy dog."
+        text = "The quick brown fox jumps over the lazy dog."
+        validate_text(
+            text,
+            reference=reference,
+            min_bleu=0.9,
+            min_length=10,
+            max_length=100,
+        )
+
+    def test_fails_when_one_criterion_fails(self) -> None:
+        """Test validation fails when any criterion fails."""
+        reference = "The quick brown fox jumps over the lazy dog."
+        text = "The quick brown fox jumps over the lazy dog."
+        with pytest.raises(AssertionError):
+            validate_text(
+                text,
+                reference=reference,
+                min_bleu=0.9,
+                max_length=10,  # This will fail
+            )
+
+
+class TestValidateTextFailureMessage:
+    """Test failure message formatting."""
+
+    def test_failure_message_includes_text_preview(self) -> None:
+        """Test that failure message includes preview of the text."""
+        text = "Short text"
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, min_length=100)
+        assert "Short text" in str(exc_info.value)
+
+    def test_failure_message_truncates_long_text(self) -> None:
+        """Test that long text is truncated in failure message."""
+        text = "A" * 200
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, max_length=50)
+        message = str(exc_info.value)
+        assert "..." in message
+        assert "A" * 200 not in message
+
+    def test_failure_message_includes_check_details(self) -> None:
+        """Test that failure message includes check name and details."""
+        text = "Short"
+        with pytest.raises(AssertionError) as exc_info:
+            validate_text(text, min_length=100)
+        message = str(exc_info.value)
+        assert "Failed checks:" in message
+        assert "length" in message.lower()
+
+
+class TestValidateTextListReference:
+    """Test validation with list of reference texts."""
+
+    def test_bleu_with_multiple_references(self) -> None:
+        """Test BLEU validation accepts multiple reference texts."""
+        references = [
+            "The quick brown fox jumps over the lazy dog.",
+            "A fast brown fox leaps over a sleepy dog.",
+        ]
+        text = "The quick brown fox jumps over the lazy dog."
+        validate_text(text, reference=references, min_bleu=0.9)
+
+    def test_rouge_with_multiple_references(self) -> None:
+        """Test ROUGE validation accepts multiple reference texts."""
+        references = [
+            "Machine learning requires data.",
+            "ML models need training data.",
+        ]
+        text = "Machine learning models require training data."
+        validate_text(text, reference=references, min_rouge=0.3)
@@ -0,0 +1,88 @@
+"""Tests for the pytest plugin fixtures."""
+
+from veritext.core.types import ValidationContext
+from veritext.pytest_plugin.fixtures import ValidatorFactory
+from veritext.validators import bleu, length
+
+
+class TestValidatorFactory:
+    """Test the ValidatorFactory class."""
+
+    def test_creates_validator_from_checks(self) -> None:
+        """Test that factory creates a callable validator."""
+        factory = ValidatorFactory()
+        validate = factory(checks=[length(min_chars=5)])
+
+        result = validate("Hello, World!")
+        assert result.passed
+
+    def test_validator_uses_provided_reference(self) -> None:
+        """Test that factory passes reference to context."""
+        factory = ValidatorFactory()
+        reference = "The quick brown fox."
+        validate = factory(
+            checks=[bleu(min_score=0.5)],
+            reference=reference,
+        )
+
+        # Exact match should pass
+        result = validate("The quick brown fox.")
+        assert result.passed
+
+    def test_validator_returns_validation_result(self) -> None:
+        """Test that validator returns a ValidationResult."""
+        factory = ValidatorFactory()
+        validate = factory(checks=[length(min_chars=100)])
+
+        result = validate("Short")
+        assert not result.passed
+        assert len(result.checks) == 1
+        assert result.checks[0].name == "length"
+
+
+class TestTextValidatorFixture:
+    """Test the text_validator fixture."""
+
+    def test_fixture_returns_factory(self, text_validator: ValidatorFactory) -> None:
+        """Test that fixture provides a ValidatorFactory."""
+        assert isinstance(text_validator, ValidatorFactory)
+
+    def test_fixture_can_create_validators(
+        self,
+        text_validator: ValidatorFactory,
+    ) -> None:
+        """Test that fixture can be used to create validators."""
+        validate = text_validator(checks=[length(min_chars=5, max_chars=50)])
+
+        assert validate("Hello, World!").passed
+        assert not validate("Hi").passed
+
+
+class TestValidationContextFixture:
+    """Test the validation_context fixture."""
+
+    def test_fixture_creates_context(
+        self,
+        validation_context: type,
+    ) -> None:
+        """Test that fixture creates ValidationContext."""
+        ctx = validation_context(reference="Test reference")
+        assert isinstance(ctx, ValidationContext)
+        assert ctx.reference == "Test reference"
+
+    def test_fixture_accepts_metadata(
+        self,
+        validation_context: type,
+    ) -> None:
+        """Test that fixture passes metadata to context."""
+        ctx = validation_context(reference="Test", source="unit_test", version=1)
+        assert ctx.metadata["source"] == "unit_test"
+        assert ctx.metadata["version"] == 1
+
+    def test_fixture_allows_no_reference(
+        self,
+        validation_context: type,
+    ) -> None:
+        """Test that fixture allows creating context without reference."""
+        ctx = validation_context()
+        assert ctx.reference is None
@@ -0,0 +1,100 @@
+"""Tests for the pytest plugin hooks."""
+
+import pytest
+
+
+@pytest.fixture
+def plugin_pytester(pytester: pytest.Pytester) -> pytest.Pytester:
+    """Configure pytester to use the veritext plugin."""
+    pytester.makeconftest(
+        """
+        pytest_plugins = ['veritext.pytest_plugin']
+        """
+    )
+    return pytester
+
+
+def test_plugin_registers_marker(plugin_pytester: pytest.Pytester) -> None:
+    """Test that the text_validation marker is registered."""
+    plugin_pytester.makepyfile(
+        """
+        import pytest
+
+        @pytest.mark.text_validation
+        def test_example():
+            pass
+        """
+    )
+    # Run with strict markers - this will fail if marker isn't registered
+    result = plugin_pytester.runpytest("--strict-markers")
+    result.assert_outcomes(passed=1)
+
+
+def test_marker_can_be_used(plugin_pytester: pytest.Pytester) -> None:
+    """Test that the text_validation marker can filter tests."""
+    plugin_pytester.makepyfile(
+        """
+        import pytest
+
+        @pytest.mark.text_validation
+        def test_marked():
+            pass
+
+        def test_unmarked():
+            pass
+        """
+    )
+    # Run only marked tests
+    result = plugin_pytester.runpytest("-m", "text_validation")
+    result.assert_outcomes(passed=1)
+
+
+def test_validate_text_is_importable(plugin_pytester: pytest.Pytester) -> None:
+    """Test that validate_text can be imported from the plugin."""
+    plugin_pytester.makepyfile(
+        """
+        from veritext.pytest_plugin import validate_text
+
+        def test_import():
+            assert callable(validate_text)
+        """
+    )
+    result = plugin_pytester.runpytest()
+    result.assert_outcomes(passed=1)
+
+
+def test_validate_text_works_in_tests(plugin_pytester: pytest.Pytester) -> None:
+    """Test that validate_text can be used in test functions."""
+    plugin_pytester.makepyfile(
+        """
+        from veritext.pytest_plugin import validate_text
+
+        def test_validation_passes():
+            validate_text(
+                "The quick brown fox jumps over the lazy dog.",
+                min_length=10,
+                max_length=100,
+            )
+        """
+    )
+    result = plugin_pytester.runpytest()
+    result.assert_outcomes(passed=1)
+
+
+def test_validate_text_failure_in_tests(plugin_pytester: pytest.Pytester) -> None:
+    """Test that validate_text failures are reported properly."""
+    plugin_pytester.makepyfile(
+        """
+        from veritext.pytest_plugin import validate_text
+
+        def test_validation_fails():
+            validate_text(
+                "Short",
+                min_length=100,
+            )
+        """
+    )
+    result = plugin_pytester.runpytest()
+    result.assert_outcomes(failed=1)
+    # Check that failure message contains useful information
+    result.stdout.fnmatch_lines(["*Text validation failed*"])
@@ -0,0 +1 @@
+"""Tests for semantic similarity module."""
@@ -0,0 +1,240 @@
+"""Tests for the semantic similarity metric."""
+
+import pytest
+
+# Skip all tests if sentence-transformers is not installed
+pytest.importorskip("sentence_transformers")
+
+from veritext.metrics.results import SemanticResult
+from veritext.semantic import SemanticSimilarity
+
+
+class TestSemanticSimilarity:
+    """Tests for the SemanticSimilarity metric class."""
+
+    @pytest.fixture
+    def semantic(self) -> SemanticSimilarity:
+        """Provide a SemanticSimilarity metric instance."""
+        return SemanticSimilarity()
+
+    def test_name(self, semantic: SemanticSimilarity) -> None:
+        """Test that name returns 'semantic'."""
+        assert semantic.name == "semantic"
+
+    def test_requires_reference(self, semantic: SemanticSimilarity) -> None:
+        """Test that semantic similarity requires reference text."""
+        assert semantic.requires_reference is True
+
+    def test_identical_texts(self, semantic: SemanticSimilarity) -> None:
+        """Test that identical texts produce high similarity."""
+        text = "The cat sat on the mat"
+        result = semantic.score(text, text)
+
+        # Identical texts should have very high similarity (close to 1.0)
+        assert result.similarity >= 0.99
+        assert result.model == "all-MiniLM-L6-v2"
+
+    def test_semantically_similar_texts(self, semantic: SemanticSimilarity) -> None:
+        """Test that semantically similar texts have high similarity."""
+        candidate = "The cat sat on the mat"
+        reference = "A feline rested on the rug"
+        result = semantic.score(candidate, reference)
+
+        # Similar meanings should have reasonable similarity
+        assert result.similarity > 0.3
+
+    def test_unrelated_texts(self, semantic: SemanticSimilarity) -> None:
+        """Test that unrelated texts have low similarity."""
+        candidate = "The quick brown fox"
+        reference = "Quantum physics describes particle behaviour"
+        result = semantic.score(candidate, reference)
+
+        # Unrelated texts should have low similarity
+        assert result.similarity < 0.5
+
+    def test_empty_candidate(self, semantic: SemanticSimilarity) -> None:
+        """Test that empty candidate returns zero similarity."""
+        result = semantic.score("", "The cat sat on the mat")
+        assert result.similarity == 0.0
+
+    def test_whitespace_only_candidate(self, semantic: SemanticSimilarity) -> None:
+        """Test that whitespace-only candidate returns zero similarity."""
+        result = semantic.score("   \t\n  ", "The cat sat on the mat")
+        assert result.similarity == 0.0
+
+    def test_none_reference_raises(self, semantic: SemanticSimilarity) -> None:
+        """Test that None reference raises ValueError."""
+        with pytest.raises(ValueError, match="requires reference"):
+            semantic.score("The cat sat", None)
+
+    def test_empty_reference_raises(self, semantic: SemanticSimilarity) -> None:
+        """Test that empty reference raises ValueError."""
+        with pytest.raises(ValueError, match="cannot be empty"):
+            semantic.score("The cat sat", "")
+
+    def test_whitespace_reference_raises(self, semantic: SemanticSimilarity) -> None:
+        """Test that whitespace-only reference raises ValueError."""
+        with pytest.raises(ValueError, match="cannot be empty"):
+            semantic.score("The cat sat", "   \t\n  ")
+
+    def test_multiple_references(self, semantic: SemanticSimilarity) -> None:
+        """Test semantic similarity with multiple references uses max."""
+        candidate = "The cat sat on the mat"
+        references = [
+            "A dog ran through the park",
+            "The cat sat on the mat",  # Exact match
+        ]
+        result = semantic.score(candidate, references)
+
+        # Should get high similarity due to exact match reference
+        assert result.similarity >= 0.99
+
+    def test_multiple_references_takes_max(self, semantic: SemanticSimilarity) -> None:
+        """Test that multiple references returns maximum similarity."""
+        candidate = "The cat sat on the mat"
+        references = [
+            "Quantum physics is complex",  # Low similarity
+            "A feline rested on the rug",  # Higher similarity
+        ]
+        result = semantic.score(candidate, references)
+
+        # Should use the higher similarity
+        assert result.similarity > 0.3
+
+    def test_result_score_property(self, semantic: SemanticSimilarity) -> None:
+        """Test that result.score returns similarity."""
+        result = semantic.score("The cat sat", "The cat sat")
+        assert result.score == result.similarity
+
+    def test_caching_behaviour(self) -> None:
+        """Test that caching works for repeated texts."""
+        semantic = SemanticSimilarity(cache_embeddings=True)
+
+        # Score same texts multiple times
+        text = "The cat sat on the mat"
+        result1 = semantic.score(text, text)
+        result2 = semantic.score(text, text)
+
+        # Results should be identical
+        assert result1.similarity == result2.similarity
+
+        # Clear cache and check again
+        semantic.clear_cache()
+        result3 = semantic.score(text, text)
+        assert result3.similarity == result1.similarity
+
+    def test_caching_disabled(self) -> None:
+        """Test that caching can be disabled."""
+        semantic = SemanticSimilarity(cache_embeddings=False)
+
+        text = "The cat sat on the mat"
+        result1 = semantic.score(text, text)
+        result2 = semantic.score(text, text)
+
+        # Results should still be identical (just not cached)
+        assert result1.similarity == result2.similarity
+
+        # Clear cache should not raise even when disabled
+        semantic.clear_cache()
+
+    def test_custom_model(self) -> None:
+        """Test that custom model name is recorded in result."""
+        # Use the same model but verify it's recorded correctly
+        semantic = SemanticSimilarity(model="all-MiniLM-L6-v2")
+        result = semantic.score("Test text", "Test text")
+        assert result.model == "all-MiniLM-L6-v2"
+
+
+class TestSemanticSimilarityBatch:
+    """Tests for semantic similarity batch scoring."""
+
+    @pytest.fixture
+    def semantic(self) -> SemanticSimilarity:
+        """Provide a SemanticSimilarity metric instance."""
+        return SemanticSimilarity()
+
+    def test_batch_score_basic(self, semantic: SemanticSimilarity) -> None:
+        """Test basic batch scoring."""
+        candidates = ["The cat sat on the mat", "A quick brown dog runs fast"]
+        references = ["The cat sat on the mat", "A quick brown dog runs fast"]
+        result = semantic.batch_score(candidates, references)
+
+        assert result.count == 2
+        assert len(result.results) == 2
+        # Identical texts should have very high similarity
+        assert all(r.similarity >= 0.99 for r in result.results)
+
+    def test_batch_score_statistics(self, semantic: SemanticSimilarity) -> None:
+        """Test that batch scoring computes statistics."""
+        candidates = ["The cat sat", "Quantum physics is complex"]
+        references = ["The cat sat", "The cat sat"]
+        result = semantic.batch_score(candidates, references)
+
+        # Check statistics are computed
+        assert "similarity" in result.stats
+
+        # Mean should be between min and max
+        stats = result.stats["similarity"]
+        assert stats.min <= stats.mean <= stats.max
+
+    def test_batch_score_percentiles(self, semantic: SemanticSimilarity) -> None:
+        """Test that batch scoring computes percentiles."""
+        candidates = ["a", "b", "c", "d", "e"]
+        references = ["a", "b", "c", "d", "e"]
+        result = semantic.batch_score(candidates, references)
+
+        stats = result.stats["similarity"]
+        assert 25 in stats.percentiles
+        assert 50 in stats.percentiles
+        assert 75 in stats.percentiles
+        assert 95 in stats.percentiles
+
+    def test_batch_score_none_references_raises(
+        self, semantic: SemanticSimilarity
+    ) -> None:
+        """Test that batch scoring raises for None references."""
+        with pytest.raises(ValueError, match="requires reference"):
+            semantic.batch_score(["text"], None)
+
+    def test_batch_score_length_mismatch_raises(
+        self, semantic: SemanticSimilarity
+    ) -> None:
+        """Test that batch scoring raises for mismatched lengths."""
+        with pytest.raises(ValueError, match="must match"):
+            semantic.batch_score(["a", "b"], ["a"])
+
+    def test_batch_score_with_multiple_references(
+        self, semantic: SemanticSimilarity
+    ) -> None:
+        """Test batch scoring with multiple references per candidate."""
+        candidates = [
+            "The cat sat on the mat",
+            "A quick brown dog runs fast",
+        ]
+        references = [
+            ["The cat sat on the mat", "A cat rests on floor"],
+            ["A quick brown dog runs fast", "Dogs run very quickly"],
+        ]
+        result = semantic.batch_score(candidates, references)
+
+        assert result.count == 2
+        # First pair has exact match
+        assert result.results[0].similarity >= 0.99
+        assert result.results[1].similarity >= 0.99
+
+
+class TestSemanticResult:
+    """Tests for SemanticResult type."""
+
+    def test_frozen(self) -> None:
+        """Test that SemanticResult is frozen."""
+        from pydantic import ValidationError
+
+        result = SemanticResult(similarity=0.85, model="test-model")
+        with pytest.raises(ValidationError):
+            result.similarity = 0.9  # type: ignore[misc]
+
+    def test_score_property(self) -> None:
+        """Test that score property returns similarity."""
+        result = SemanticResult(similarity=0.75, model="test-model")
+        assert result.score == 0.75
@@ -0,0 +1 @@
+"""Tests for the validators module."""
@@ -0,0 +1,198 @@
+"""Tests for composite validators."""
+
+import pytest
+
+from veritext.core.types import ValidationContext
+from veritext.validators import all_of, any_of, bleu, contains, excludes, length
+from veritext.validators.composite import AllOf, AnyOf
+
+
+class TestAllOf:
+    """Tests for AllOf composite validator."""
+
+    def test_all_of_passes_when_all_checks_pass(self) -> None:
+        """Test that AllOf passes when all checks pass."""
+        validator = AllOf(
+            checks=[
+                length(min_words=2),
+                contains(patterns=["hello"]),
+            ]
+        )
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is True
+        assert len(result.checks) == 2
+        assert all(c.passed for c in result.checks)
+
+    def test_all_of_fails_when_one_check_fails(self) -> None:
+        """Test that AllOf fails when any check fails."""
+        validator = AllOf(
+            checks=[
+                length(min_words=2),
+                contains(patterns=["goodbye"]),
+            ]
+        )
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is False
+        assert len(result.checks) == 2
+        assert len(result.failed_checks) == 1
+
+    def test_all_of_fails_when_all_checks_fail(self) -> None:
+        """Test that AllOf fails when all checks fail."""
+        validator = AllOf(
+            checks=[
+                length(min_words=10),
+                contains(patterns=["goodbye"]),
+            ]
+        )
+        context = ValidationContext()
+        result = validator.check("hello", context)
+
+        assert result.passed is False
+        assert len(result.failed_checks) == 2
+
+    def test_all_of_with_metric_validators(self) -> None:
+        """Test AllOf with metric-based validators."""
+        validator = AllOf(
+            checks=[
+                bleu(min_score=0.5),
+                length(min_words=3),
+            ]
+        )
+        context = ValidationContext(reference="the quick brown fox")
+        result = validator.check("the quick brown fox jumps", context)
+
+        assert result.passed is True
+        assert len(result.checks) == 2
+
+    def test_all_of_failure_summary(self) -> None:
+        """Test the failure summary property."""
+        validator = AllOf(
+            checks=[
+                length(min_words=10),
+                contains(patterns=["goodbye"]),
+            ]
+        )
+        context = ValidationContext()
+        result = validator.check("hello", context)
+
+        summary = result.failure_summary
+        assert "failed" in summary.lower()
+        assert "length" in summary
+        assert "contains" in summary
+
+    def test_all_of_raises_on_empty_checks(self) -> None:
+        """Test that empty checks list raises error."""
+        with pytest.raises(ValueError, match="cannot be empty"):
+            AllOf(checks=[])
+
+    def test_all_of_name_property(self) -> None:
+        """Test the name property."""
+        validator = AllOf(checks=[length(min_chars=1)])
+        assert validator.name == "all_of"
+
+    def test_all_of_factory_function(self) -> None:
+        """Test the all_of() factory function."""
+        validator = all_of(checks=[length(min_chars=1)])
+        assert isinstance(validator, AllOf)
+
+
+class TestAnyOf:
+    """Tests for AnyOf composite validator."""
+
+    def test_any_of_passes_when_any_check_passes(self) -> None:
+        """Test that AnyOf passes when any check passes."""
+        validator = AnyOf(
+            checks=[
+                length(min_words=10),  # Will fail
+                contains(patterns=["hello"]),  # Will pass
+            ]
+        )
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is True
+        assert len(result.checks) == 2
+        # At least one check passed
+        assert any(c.passed for c in result.checks)
+
+    def test_any_of_passes_when_all_checks_pass(self) -> None:
+        """Test that AnyOf passes when all checks pass."""
+        validator = AnyOf(
+            checks=[
+                length(min_words=2),
+                contains(patterns=["hello"]),
+            ]
+        )
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is True
+        assert all(c.passed for c in result.checks)
+
+    def test_any_of_fails_when_all_checks_fail(self) -> None:
+        """Test that AnyOf fails when all checks fail."""
+        validator = AnyOf(
+            checks=[
+                length(min_words=10),
+                contains(patterns=["goodbye"]),
+            ]
+        )
+        context = ValidationContext()
+        result = validator.check("hello", context)
+
+        assert result.passed is False
+        assert not any(c.passed for c in result.checks)
+
+    def test_any_of_with_metric_validators(self) -> None:
+        """Test AnyOf with metric-based validators."""
+        validator = AnyOf(
+            checks=[
+                bleu(min_score=0.9),  # Might fail
+                length(min_words=3),  # Should pass
+            ]
+        )
+        context = ValidationContext(reference="different text entirely")
+        result = validator.check("the quick brown fox jumps", context)
+
+        assert result.passed is True  # Length check passes
+
+    def test_any_of_with_excludes(self) -> None:
+        """Test AnyOf with excludes validator."""
+        validator = AnyOf(
+            checks=[
+                excludes(patterns=["error"]),
+                excludes(patterns=["warning"]),
+            ]
+        )
+        context = ValidationContext()
+
+        # Should pass - neither pattern found
+        result = validator.check("All is well", context)
+        assert result.passed is True
+
+        # Should pass - one pattern found, other not
+        result = validator.check("This is an error", context)
+        assert result.passed is True
+
+        # Should fail - both patterns found
+        result = validator.check("error and warning", context)
+        assert result.passed is False
+
+    def test_any_of_raises_on_empty_checks(self) -> None:
+        """Test that empty checks list raises error."""
+        with pytest.raises(ValueError, match="cannot be empty"):
+            AnyOf(checks=[])
+
+    def test_any_of_name_property(self) -> None:
+        """Test the name property."""
+        validator = AnyOf(checks=[length(min_chars=1)])
+        assert validator.name == "any_of"
+
+    def test_any_of_factory_function(self) -> None:
+        """Test the any_of() factory function."""
+        validator = any_of(checks=[length(min_chars=1)])
+        assert isinstance(validator, AnyOf)
@@ -0,0 +1,334 @@
+"""Tests for constraint validators."""
+
+import pytest
+
+from veritext.core.exceptions import InvalidThresholdError
+from veritext.core.types import ValidationContext
+from veritext.validators import contains, excludes, length, readability
+from veritext.validators.constraint import (
+    ContainsValidator,
+    ExcludesValidator,
+    LengthValidator,
+    ReadabilityValidator,
+)
+
+
+class TestLengthValidator:
+    """Tests for LengthValidator."""
+
+    def test_length_validator_min_chars_passes(self) -> None:
+        """Test that validator passes when char count meets minimum."""
+        validator = LengthValidator(min_chars=10)
+        context = ValidationContext()
+        result = validator.check("hello world!", context)
+
+        assert result.passed is True
+        assert result.name == "length"
+        assert result.actual["chars"] == 12
+
+    def test_length_validator_min_chars_fails(self) -> None:
+        """Test that validator fails when char count below minimum."""
+        validator = LengthValidator(min_chars=20)
+        context = ValidationContext()
+        result = validator.check("hello", context)
+
+        assert result.passed is False
+        assert "< min" in result.message
+
+    def test_length_validator_max_chars_passes(self) -> None:
+        """Test that validator passes when char count within maximum."""
+        validator = LengthValidator(max_chars=20)
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is True
+        assert result.actual["chars"] == 11
+
+    def test_length_validator_max_chars_fails(self) -> None:
+        """Test that validator fails when char count exceeds maximum."""
+        validator = LengthValidator(max_chars=5)
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is False
+        assert "> max" in result.message
+
+    def test_length_validator_min_words_passes(self) -> None:
+        """Test that validator passes when word count meets minimum."""
+        validator = LengthValidator(min_words=3)
+        context = ValidationContext()
+        result = validator.check("the quick brown fox", context)
+
+        assert result.passed is True
+        assert result.actual["words"] == 4
+
+    def test_length_validator_min_words_fails(self) -> None:
+        """Test that validator fails when word count below minimum."""
+        validator = LengthValidator(min_words=10)
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is False
+        assert "words < min" in result.message
+
+    def test_length_validator_max_words_passes(self) -> None:
+        """Test that validator passes when word count within maximum."""
+        validator = LengthValidator(max_words=5)
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is True
+
+    def test_length_validator_max_words_fails(self) -> None:
+        """Test that validator fails when word count exceeds maximum."""
+        validator = LengthValidator(max_words=2)
+        context = ValidationContext()
+        result = validator.check("the quick brown fox", context)
+
+        assert result.passed is False
+        assert "words > max" in result.message
+
+    def test_length_validator_combined_constraints(self) -> None:
+        """Test validator with multiple constraints."""
+        validator = LengthValidator(
+            min_chars=5, max_chars=50, min_words=2, max_words=10
+        )
+        context = ValidationContext()
+        result = validator.check("the quick brown fox", context)
+
+        assert result.passed is True
+        assert "min_chars" in result.threshold
+        assert "max_chars" in result.threshold
+        assert "min_words" in result.threshold
+        assert "max_words" in result.threshold
+
+    def test_length_validator_raises_when_no_constraints(self) -> None:
+        """Test that validator raises when no constraints provided."""
+        with pytest.raises(InvalidThresholdError, match="At least one"):
+            LengthValidator()
+
+    def test_length_validator_raises_on_negative_values(self) -> None:
+        """Test that negative constraint values raise error."""
+        with pytest.raises(InvalidThresholdError, match="min_chars must be >= 0"):
+            LengthValidator(min_chars=-1)
+
+        with pytest.raises(InvalidThresholdError, match="max_chars must be >= 0"):
+            LengthValidator(max_chars=-1)
+
+        with pytest.raises(InvalidThresholdError, match="min_words must be >= 0"):
+            LengthValidator(min_words=-1)
+
+        with pytest.raises(InvalidThresholdError, match="max_words must be >= 0"):
+            LengthValidator(max_words=-1)
+
+    def test_length_validator_raises_on_invalid_range(self) -> None:
+        """Test that min > max raises error."""
+        with pytest.raises(InvalidThresholdError, match="cannot exceed max_chars"):
+            LengthValidator(min_chars=100, max_chars=50)
+
+        with pytest.raises(InvalidThresholdError, match="cannot exceed max_words"):
+            LengthValidator(min_words=20, max_words=5)
+
+    def test_length_factory_function(self) -> None:
+        """Test the length() factory function."""
+        validator = length(min_chars=10, max_words=100)
+        assert isinstance(validator, LengthValidator)
+        assert validator.name == "length"
+
+
+class TestReadabilityValidator:
+    """Tests for ReadabilityValidator."""
+
+    def test_readability_validator_max_grade_passes(self) -> None:
+        """Test that validator passes when grade level within maximum."""
+        validator = ReadabilityValidator(max_grade=12.0)
+        context = ValidationContext()
+        # Simple text should have low grade level
+        result = validator.check("The cat sat on the mat. It was a nice day.", context)
+
+        assert result.passed is True
+        assert result.name == "readability"
+        assert "grade" in result.actual
+
+    def test_readability_validator_max_grade_fails(self) -> None:
+        """Test that validator fails when grade level exceeds maximum."""
+        validator = ReadabilityValidator(max_grade=1.0)
+        context = ValidationContext()
+        # Complex text
+        result = validator.check(
+            "The implementation of sophisticated methodologies necessitates "
+            "comprehensive analytical frameworks for systematic evaluation.",
+            context,
+        )
+
+        assert result.passed is False
+        assert "grade level" in result.message
+        assert "> max" in result.message
+
+    def test_readability_validator_min_ease_passes(self) -> None:
+        """Test that validator passes when reading ease meets minimum."""
+        validator = ReadabilityValidator(min_ease=30.0)
+        context = ValidationContext()
+        # Simple text should have high reading ease
+        result = validator.check("The cat sat. The dog ran. It was fun.", context)
+
+        assert result.passed is True
+        assert "ease" in result.actual
+
+    def test_readability_validator_min_ease_fails(self) -> None:
+        """Test that validator fails when reading ease below minimum."""
+        validator = ReadabilityValidator(min_ease=100.0)
+        context = ValidationContext()
+        result = validator.check(
+            "The implementation of sophisticated methodologies necessitates "
+            "comprehensive analytical frameworks.",
+            context,
+        )
+
+        assert result.passed is False
+        assert "reading ease" in result.message
+        assert "< min" in result.message
+
+    def test_readability_validator_combined_constraints(self) -> None:
+        """Test validator with both grade and ease constraints."""
+        validator = ReadabilityValidator(max_grade=12.0, min_ease=30.0)
+        context = ValidationContext()
+        result = validator.check("The cat sat on the mat.", context)
+
+        assert "max_grade" in result.threshold
+        assert "min_ease" in result.threshold
+
+    def test_readability_validator_raises_when_no_constraints(self) -> None:
+        """Test that validator raises when no constraints provided."""
+        with pytest.raises(InvalidThresholdError, match="At least one"):
+            ReadabilityValidator()
+
+    def test_readability_factory_function(self) -> None:
+        """Test the readability() factory function."""
+        validator = readability(max_grade=8.0, min_ease=60.0)
+        assert isinstance(validator, ReadabilityValidator)
+        assert validator.name == "readability"
+
+
+class TestContainsValidator:
+    """Tests for ContainsValidator."""
+
+    def test_contains_validator_passes_when_pattern_found(self) -> None:
+        """Test that validator passes when all patterns are found."""
+        validator = ContainsValidator(patterns=["hello", "world"])
+        context = ValidationContext()
+        result = validator.check("Hello World!", context)
+
+        assert result.passed is True
+        assert result.name == "contains"
+        assert result.actual["found"] == 2
+        assert result.actual["missing"] == []
+
+    def test_contains_validator_fails_when_pattern_missing(self) -> None:
+        """Test that validator fails when a pattern is missing."""
+        validator = ContainsValidator(patterns=["hello", "goodbye"])
+        context = ValidationContext()
+        result = validator.check("Hello World!", context)
+
+        assert result.passed is False
+        assert "goodbye" in result.actual["missing"]
+        assert "missing" in result.message
+
+    def test_contains_validator_case_insensitive_by_default(self) -> None:
+        """Test that matching is case-insensitive by default."""
+        validator = ContainsValidator(patterns=["HELLO"])
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is True
+
+    def test_contains_validator_case_sensitive(self) -> None:
+        """Test case-sensitive matching."""
+        validator = ContainsValidator(patterns=["HELLO"], case_sensitive=True)
+        context = ValidationContext()
+        result = validator.check("hello world", context)
+
+        assert result.passed is False
+
+    def test_contains_validator_regex_patterns(self) -> None:
+        """Test regex pattern matching."""
+        validator = ContainsValidator(patterns=[r"\d{3}-\d{4}"])
+        context = ValidationContext()
+        result = validator.check("Call 555-1234 for info", context)
+
+        assert result.passed is True
+
+    def test_contains_validator_raises_on_empty_patterns(self) -> None:
+        """Test that empty patterns list raises error."""
+        with pytest.raises(InvalidThresholdError, match="cannot be empty"):
+            ContainsValidator(patterns=[])
+
+    def test_contains_factory_function(self) -> None:
+        """Test the contains() factory function."""
+        validator = contains(patterns=["test"], case_sensitive=True)
+        assert isinstance(validator, ContainsValidator)
+        assert validator.name == "contains"
+
+
+class TestExcludesValidator:
+    """Tests for ExcludesValidator."""
+
+    def test_excludes_validator_passes_when_pattern_absent(self) -> None:
+        """Test that validator passes when all patterns are absent."""
+        validator = ExcludesValidator(patterns=["bad", "forbidden"])
+        context = ValidationContext()
+        result = validator.check("This is good text.", context)
+
+        assert result.passed is True
+        assert result.name == "excludes"
+        assert result.actual["found"] == []
+
+    def test_excludes_validator_fails_when_pattern_found(self) -> None:
+        """Test that validator fails when a forbidden pattern is found."""
+        validator = ExcludesValidator(patterns=["bad", "forbidden"])
+        context = ValidationContext()
+        result = validator.check("This is bad text.", context)
+
+        assert result.passed is False
+        assert "bad" in result.actual["found"]
+        assert "forbidden" in result.message
+
+    def test_excludes_validator_case_insensitive_by_default(self) -> None:
+        """Test that matching is case-insensitive by default."""
+        validator = ExcludesValidator(patterns=["BAD"])
+        context = ValidationContext()
+        result = validator.check("This is bad text.", context)
+
+        assert result.passed is False
+
+    def test_excludes_validator_case_sensitive(self) -> None:
+        """Test case-sensitive matching."""
+        validator = ExcludesValidator(patterns=["BAD"], case_sensitive=True)
+        context = ValidationContext()
+        result = validator.check("This is bad text.", context)
+
+        assert result.passed is True
+
+    def test_excludes_validator_regex_patterns(self) -> None:
+        """Test regex pattern matching."""
+        validator = ExcludesValidator(patterns=[r"\b\d{4}\b"])  # 4-digit numbers
+        context = ValidationContext()
+
+        # Should fail when pattern found
+        result = validator.check("PIN is 1234", context)
+        assert result.passed is False
+
+        # Should pass when pattern absent
+        result = validator.check("No numbers here", context)
+        assert result.passed is True
+
+    def test_excludes_validator_raises_on_empty_patterns(self) -> None:
+        """Test that empty patterns list raises error."""
+        with pytest.raises(InvalidThresholdError, match="cannot be empty"):
+            ExcludesValidator(patterns=[])
+
+    def test_excludes_factory_function(self) -> None:
+        """Test the excludes() factory function."""
+        validator = excludes(patterns=["test"], case_sensitive=True)
+        assert isinstance(validator, ExcludesValidator)
+        assert validator.name == "excludes"
@@ -0,0 +1,283 @@
+"""Tests for metric-based validators."""
+
+import pytest
+
+from veritext.core.exceptions import InvalidThresholdError, ValidationError
+from veritext.core.types import ValidationContext
+from veritext.validators import bleu, lexical, rouge
+from veritext.validators.metric import BleuValidator, LexicalValidator, RougeValidator
+
+
+class TestBleuValidator:
+    """Tests for BleuValidator."""
+
+    def test_bleu_validator_passes_when_score_meets_threshold(self) -> None:
+        """Test that validator passes when BLEU score meets threshold."""
+        validator = BleuValidator(min_score=0.5, variant=4)
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("the cat sat on the mat", context)
+
+        assert result.passed is True
+        assert result.name == "bleu-4"
+        assert result.actual == 1.0  # Identical text
+        assert result.threshold == 0.5
+
+    def test_bleu_validator_fails_when_score_below_threshold(self) -> None:
+        """Test that validator fails when BLEU score is below threshold."""
+        validator = BleuValidator(min_score=0.9, variant=4)
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("a dog ran through the park", context)
+
+        assert result.passed is False
+        assert result.name == "bleu-4"
+        assert result.actual < 0.9
+        assert "below minimum" in result.message
+
+    def test_bleu_validator_variant_selection(self) -> None:
+        """Test different BLEU variants."""
+        context = ValidationContext(reference="the quick brown fox jumps")
+
+        for variant in (1, 2, 3, 4):
+            validator = BleuValidator(min_score=0.0, variant=variant)  # type: ignore[arg-type]
+            result = validator.check("the quick brown fox", context)
+            assert result.name == f"bleu-{variant}"
+
+    def test_bleu_validator_raises_on_missing_reference(self) -> None:
+        """Test that validator raises when reference is missing."""
+        validator = BleuValidator(min_score=0.5)
+        context = ValidationContext()
+
+        with pytest.raises(ValidationError, match="requires reference text"):
+            validator.check("some text", context)
+
+    def test_bleu_validator_raises_on_invalid_min_score(self) -> None:
+        """Test that invalid min_score raises error."""
+        with pytest.raises(InvalidThresholdError, match=r"between 0\.0 and 1\.0"):
+            BleuValidator(min_score=1.5)
+
+        with pytest.raises(InvalidThresholdError, match=r"between 0\.0 and 1\.0"):
+            BleuValidator(min_score=-0.1)
+
+    def test_bleu_validator_raises_on_invalid_variant(self) -> None:
+        """Test that invalid variant raises error."""
+        with pytest.raises(InvalidThresholdError, match="variant must be"):
+            BleuValidator(min_score=0.5, variant=5)  # type: ignore[arg-type]
+
+    def test_bleu_factory_function(self) -> None:
+        """Test the bleu() factory function."""
+        validator = bleu(min_score=0.6, variant=2)
+        assert isinstance(validator, BleuValidator)
+        assert validator.name == "bleu-2"
+
+
+class TestRougeValidator:
+    """Tests for RougeValidator."""
+
+    def test_rouge_validator_passes_when_score_meets_threshold(self) -> None:
+        """Test that validator passes when ROUGE score meets threshold."""
+        validator = RougeValidator(min_score=0.5, variant="l")
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("the cat sat on the mat", context)
+
+        assert result.passed is True
+        assert result.name == "rouge-l"
+        assert result.actual == 1.0  # Identical text
+        assert result.threshold == 0.5
+
+    def test_rouge_validator_fails_when_score_below_threshold(self) -> None:
+        """Test that validator fails when ROUGE score is below threshold."""
+        validator = RougeValidator(min_score=0.9, variant="l")
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("a dog ran through the park", context)
+
+        assert result.passed is False
+        assert result.actual < 0.9
+        assert "below minimum" in result.message
+
+    def test_rouge_validator_variant_selection(self) -> None:
+        """Test different ROUGE variants."""
+        context = ValidationContext(reference="the quick brown fox jumps")
+
+        for variant in ("1", "2", "l"):
+            validator = RougeValidator(min_score=0.0, variant=variant)  # type: ignore[arg-type]
+            result = validator.check("the quick brown fox", context)
+            assert result.name == f"rouge-{variant}"
+
+    def test_rouge_validator_raises_on_missing_reference(self) -> None:
+        """Test that validator raises when reference is missing."""
+        validator = RougeValidator(min_score=0.5)
+        context = ValidationContext()
+
+        with pytest.raises(ValidationError, match="requires reference text"):
+            validator.check("some text", context)
+
+    def test_rouge_validator_raises_on_invalid_min_score(self) -> None:
+        """Test that invalid min_score raises error."""
+        with pytest.raises(InvalidThresholdError, match=r"between 0\.0 and 1\.0"):
+            RougeValidator(min_score=1.5)
+
+    def test_rouge_validator_raises_on_invalid_variant(self) -> None:
+        """Test that invalid variant raises error."""
+        with pytest.raises(InvalidThresholdError, match="variant must be"):
+            RougeValidator(min_score=0.5, variant="3")  # type: ignore[arg-type]
+
+    def test_rouge_factory_function(self) -> None:
+        """Test the rouge() factory function."""
+        validator = rouge(min_score=0.6, variant="2")
+        assert isinstance(validator, RougeValidator)
+        assert validator.name == "rouge-2"
+
+
+class TestLexicalValidator:
+    """Tests for LexicalValidator."""
+
+    def test_lexical_validator_passes_on_jaccard(self) -> None:
+        """Test that validator passes when Jaccard similarity meets threshold."""
+        validator = LexicalValidator(min_jaccard=0.5)
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("the cat sat on the mat", context)
+
+        assert result.passed is True
+        assert result.name == "lexical"
+        assert result.actual["jaccard"] == 1.0
+
+    def test_lexical_validator_fails_on_jaccard(self) -> None:
+        """Test that validator fails when Jaccard is below threshold."""
+        validator = LexicalValidator(min_jaccard=0.9)
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("a dog ran through the park", context)
+
+        assert result.passed is False
+        assert "Jaccard" in result.message
+        assert "below minimum" in result.message
+
+    def test_lexical_validator_passes_on_overlap(self) -> None:
+        """Test that validator passes when token overlap meets threshold."""
+        validator = LexicalValidator(min_overlap=0.5)
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("the cat sat on the mat", context)
+
+        assert result.passed is True
+        assert result.actual["token_overlap"] == 1.0
+
+    def test_lexical_validator_fails_on_overlap(self) -> None:
+        """Test that validator fails when overlap is below threshold."""
+        validator = LexicalValidator(min_overlap=0.9)
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("a dog ran through", context)
+
+        assert result.passed is False
+        assert "overlap" in result.message
+
+    def test_lexical_validator_with_both_thresholds(self) -> None:
+        """Test validator with both Jaccard and overlap thresholds."""
+        validator = LexicalValidator(min_jaccard=0.3, min_overlap=0.5)
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("the cat sat", context)
+
+        # Should check both thresholds
+        assert "min_jaccard" in result.threshold
+        assert "min_overlap" in result.threshold
+
+    def test_lexical_validator_raises_when_no_threshold(self) -> None:
+        """Test that validator raises when no threshold is provided."""
+        with pytest.raises(InvalidThresholdError, match="At least one"):
+            LexicalValidator()
+
+    def test_lexical_validator_raises_on_invalid_jaccard(self) -> None:
+        """Test that invalid Jaccard threshold raises error."""
+        with pytest.raises(InvalidThresholdError, match="min_jaccard"):
+            LexicalValidator(min_jaccard=1.5)
+
+    def test_lexical_validator_raises_on_invalid_overlap(self) -> None:
+        """Test that invalid overlap threshold raises error."""
+        with pytest.raises(InvalidThresholdError, match="min_overlap"):
+            LexicalValidator(min_overlap=-0.1)
+
+    def test_lexical_validator_raises_on_missing_reference(self) -> None:
+        """Test that validator raises when reference is missing."""
+        validator = LexicalValidator(min_jaccard=0.5)
+        context = ValidationContext()
+
+        with pytest.raises(ValidationError, match="requires reference text"):
+            validator.check("some text", context)
+
+    def test_lexical_factory_function(self) -> None:
+        """Test the lexical() factory function."""
+        validator = lexical(min_jaccard=0.5, min_overlap=0.6)
+        assert isinstance(validator, LexicalValidator)
+        assert validator.name == "lexical"
+
+
+# SemanticValidator tests - conditionally run if sentence-transformers is installed
+class TestSemanticValidator:
+    """Tests for SemanticValidator."""
+
+    @staticmethod
+    def _skip_if_no_transformers() -> None:
+        """Skip test if sentence-transformers is not installed."""
+        pytest.importorskip("sentence_transformers")
+
+    def test_semantic_validator_passes_when_score_meets_threshold(self) -> None:
+        """Test that validator passes when semantic similarity meets threshold."""
+        self._skip_if_no_transformers()
+        from veritext.validators.metric import SemanticValidator
+
+        validator = SemanticValidator(min_score=0.5)
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check("the cat sat on the mat", context)
+
+        assert result.passed is True
+        assert result.name == "semantic"
+        assert result.actual >= 0.99  # Identical text
+        assert result.threshold == 0.5
+
+    def test_semantic_validator_fails_when_score_below_threshold(self) -> None:
+        """Test that validator fails when semantic similarity is below threshold."""
+        self._skip_if_no_transformers()
+        from veritext.validators.metric import SemanticValidator
+
+        validator = SemanticValidator(min_score=0.99)
+        context = ValidationContext(reference="the cat sat on the mat")
+        result = validator.check(
+            "quantum physics describes particle behaviour", context
+        )
+
+        assert result.passed is False
+        assert result.name == "semantic"
+        assert result.actual < 0.99
+        assert "below minimum" in result.message
+
+    def test_semantic_validator_raises_on_missing_reference(self) -> None:
+        """Test that validator raises when reference is missing."""
+        self._skip_if_no_transformers()
+        from veritext.validators.metric import SemanticValidator
+
+        validator = SemanticValidator(min_score=0.5)
+        context = ValidationContext()
+
+        with pytest.raises(ValidationError, match="requires reference text"):
+            validator.check("some text", context)
+
+    def test_semantic_validator_raises_on_invalid_min_score(self) -> None:
+        """Test that invalid min_score raises error without loading model."""
+        # This test doesn't need sentence-transformers since validation happens first
+        with pytest.raises(InvalidThresholdError, match=r"between 0\.0 and 1\.0"):
+            from veritext.validators.metric import SemanticValidator
+
+            SemanticValidator(min_score=1.5)
+
+        with pytest.raises(InvalidThresholdError, match=r"between 0\.0 and 1\.0"):
+            from veritext.validators.metric import SemanticValidator
+
+            SemanticValidator(min_score=-0.1)
+
+    def test_semantic_factory_function(self) -> None:
+        """Test the semantic() factory function."""
+        self._skip_if_no_transformers()
+        from veritext.validators import semantic
+        from veritext.validators.metric import SemanticValidator
+
+        validator = semantic(min_score=0.6)
+        assert isinstance(validator, SemanticValidator)
+        assert validator.name == "semantic"
Author	SHA1	Message	Date
kschappell	07ac70e835	docs(changelog): add benchmark entries Document benchmark module features in changelog.	2026-02-03 18:10:19 +00:00
kschappell	6d1bece815	test(benchmark): add benchmark module tests Comprehensive tests for models, storage, regression detection, and runner.	2026-02-03 18:10:13 +00:00
kschappell	40fa39485e	feat(benchmark): add module exports Public API exports for the benchmark module.	2026-02-03 18:10:07 +00:00
kschappell	9115f0c25b	feat(benchmark): add Benchmark runner class Main Benchmark class for evaluating text quality and tracking regressions.	2026-02-03 18:10:01 +00:00
kschappell	83c4b4bee5	feat(benchmark): add regression detection Rolling window baseline computation and statistical regression detection.	2026-02-03 18:09:55 +00:00
kschappell	44e3e8f4ea	feat(benchmark): add SQLite storage backend Persistent storage for benchmark history with WAL mode for concurrent access.	2026-02-03 18:09:49 +00:00
kschappell	45dfe07772	feat(benchmark): add BenchmarkRun and RegressionReport models Data models for benchmark runs and regression reports using Pydantic.	2026-02-03 18:09:43 +00:00
kschappell	6bafc43754	docs(changelog): add pytest plugin entries	2026-02-03 17:40:52 +00:00
kschappell	012b306749	test(pytest-plugin): add plugin tests Cover validate_text assertions, fixture factories, marker registration, and pytest integration using pytester for subprocess testing.	2026-02-03 17:40:46 +00:00
kschappell	ac7c5c69cf	feat(pytest-plugin): add validate_text assertion Primary API for text validation in pytest with keyword arguments for BLEU, ROUGE, semantic similarity, length, readability, and pattern matching. Includes detailed failure formatting.	2026-02-03 17:40:40 +00:00
kschappell	cd36c54e22	feat(pytest-plugin): add plugin hooks and markers Register text_validation marker via pytest_configure hook.	2026-02-03 17:40:33 +00:00
kschappell	107fc4e275	docs(changelog): add semantic similarity entries	2026-02-03 17:31:14 +00:00
kschappell	571b770281	test(semantic): add semantic similarity tests	2026-02-03 17:31:07 +00:00
kschappell	8b3536873e	feat(validators): add SemanticValidator	2026-02-03 17:31:01 +00:00
kschappell	9a4ac359a3	feat(semantic): add SemanticSimilarity metric	2026-02-03 17:30:56 +00:00
kschappell	de5ad93524	feat(metrics): add SemanticResult type	2026-02-03 17:30:50 +00:00
kschappell	cab8099d06	docs(changelog): add validator entries Document validators module with Check protocol, metric validators, constraint validators, composite validators, and factory functions.	2026-02-03 17:14:37 +00:00
kschappell	e2be3daffd	test(validators): add validator tests Add comprehensive tests for metric validators, constraint validators, and composite validators covering pass/fail cases and error handling.	2026-02-03 17:14:32 +00:00
kschappell	9239300fd9	feat(validators): add factory functions and exports Export all validators and provide factory functions for clean API: bleu(), rouge(), lexical(), length(), readability(), contains(), excludes(), all_of(), any_of().	2026-02-03 17:14:26 +00:00
kschappell	b9f805b2f4	feat(validators): add composite validators Implement AllOf and AnyOf for combining multiple checks into composite validation rules.	2026-02-03 17:14:20 +00:00
kschappell	75cd7b68de	feat(validators): add constraint validators Implement LengthValidator, ReadabilityValidator, ContainsValidator, and ExcludesValidator for text constraints without reference text.	2026-02-03 17:14:14 +00:00
kschappell	b2b5eb1518	feat(validators): add metric-based validators Implement BleuValidator, RougeValidator, and LexicalValidator for validating text against reference using metric thresholds.	2026-02-03 17:14:09 +00:00
kschappell	9e7b0131b3	feat(validators): add Check protocol and base types Define the Check protocol for validation checks that compute a score and return pass/fail results with diagnostics.	2026-02-03 17:14:03 +00:00
kschappell	b8ab5811dd	docs(changelog): add ROUGE and readability entries	2026-02-03 17:03:39 +00:00
kschappell	62fac688e4	test(metrics): add ROUGE and readability tests	2026-02-03 17:03:34 +00:00
kschappell	14ac7dbbb9	feat(metrics): export ROUGE and readability from module	2026-02-03 17:03:28 +00:00
kschappell	aad933f9c4	feat(metrics): add readability implementation	2026-02-03 17:03:24 +00:00
kschappell	2a7476046d	feat(metrics): add ROUGE implementation	2026-02-03 17:03:19 +00:00
kschappell	914c738013	feat(metrics): add ROUGE and readability result types	2026-02-03 17:03:14 +00:00
				`@@ -0,0 +1 @@`
				`"""Tests for the Veritext pytest plugin."""`
				`@@ -0,0 +1 @@`
				`"""Tests for semantic similarity module."""`