docs: add branch creation instruction to git workflow

Explicitly documents the requirement to create a new branch before starting work from a plan, consistent with the parent workspace CLAUDE.md instruction.
docs(changelog): add CLI entries
2026-02-03 19:06:45 +00:00 · 2026-02-03 18:22:50 +00:00 · 2026-02-03 18:22:31 +00:00 · 2026-02-03 18:20:28 +00:00 · 2026-02-03 18:19:20 +00:00 · 2026-02-03 18:17:33 +00:00
13 changed files with 1582 additions and 0 deletions
@@ -83,6 +83,11 @@ Each layer depends only on layers below it.
 ## Git Workflow
 ### Before Starting Work
 When starting work from a plan, create a new branch matching the plan's scope before
 making any changes. Do not reuse an existing branch from previous work, even if related.
 ### Commits
 - Format: `type(scope): description`
@@ -45,3 +45,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `assert_no_regression()` raises `RegressionDetectedError` for CI integration
 - Customisable tolerance threshold and window size for regression detection
 - Metadata support for tracking git SHA, model versions, etc.
 - Command-line interface (CLI) via `veritext` command
 - `veritext validate` command for inline and file-based text validation
 - JSONL input format support for batch validation (`--file` option)
 - Separate candidate/reference file support (`--reference-file` option)
 - Multiple output formats: table (default), JSON, and simple text
 - `veritext benchmark run` command for running evaluations and storing results
 - `veritext benchmark show` command for viewing benchmark history
 - `veritext benchmark check` command for regression detection with exit code 1 on failure
 - Rich-formatted terminal output with tables and coloured panels
@@ -0,0 +1,5 @@
 """CLI module: Command-line interface for Veritext."""
 from veritext.cli.main import app
 __all__ = ["app"]
@@ -0,0 +1,166 @@
 """Benchmark commands for quality tracking."""
 from pathlib import Path
 from typing import Annotated
 import typer
 from veritext.benchmark import Benchmark
 from veritext.cli.formatters import (
    console,
    format_benchmark_history,
    format_regression_report,
 )
 from veritext.cli.readers import read_jsonl
 benchmark_app = typer.Typer(
    name="benchmark",
    help="Track and compare text quality over time.",
    no_args_is_help=True,
 )
@benchmark_app.command("run")
 def benchmark_run(
    name: Annotated[
        str,
        typer.Argument(help="Name for this benchmark suite."),
    ],
    file: Annotated[
        Path,
        typer.Option("--file", "-f", help="JSONL file with candidate/reference pairs."),
    ],
    metrics: Annotated[
        str,
        typer.Option(
            "--metrics",
            "-m",
            help="Comma-separated metrics to track (e.g., rouge_l,bleu4).",
        ),
    ] = "rouge_l,bleu4",
    storage_path: Annotated[
        Path,
        typer.Option(
            "--storage",
            "-s",
            help="Directory for benchmark data storage.",
        ),
    ] = Path("benchmarks"),
 ) -> None:
    """
    Run a benchmark evaluation and store the results.
    Example:
        veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
    """
    # Read text pairs
    try:
        pairs = read_jsonl(file)
    except (FileNotFoundError, ValueError) as e:
        console.print(f"[red]Error:[/red] {e}")
        raise typer.Exit(code=1) from e
    if not pairs:
        console.print("[yellow]Warning:[/yellow] No text pairs found in file.")
        raise typer.Exit(code=0)
    # Parse metrics
    metric_names = [m.strip() for m in metrics.split(",")]
    candidates = [p.candidate for p in pairs]
    references = [p.reference for p in pairs]
    # Run benchmark
    bench = Benchmark(name, storage_path=storage_path)
    run = bench.evaluate(candidates, references, metrics=metric_names)
    console.print(f"[green]Benchmark '{name}' completed.[/green]")
    console.print(f"Samples: {run.sample_count}")
    console.print("\nMetrics:")
    for metric_name, value in sorted(run.metrics.items()):
        console.print(f"  {metric_name}: {value:.4f}")
@benchmark_app.command("show")
 def benchmark_show(
    name: Annotated[
        str,
        typer.Argument(help="Name of the benchmark suite."),
    ],
    last: Annotated[
        int,
        typer.Option("--last", "-n", help="Number of recent runs to show."),
    ] = 20,
    storage_path: Annotated[
        Path,
        typer.Option(
            "--storage",
            "-s",
            help="Directory for benchmark data storage.",
        ),
    ] = Path("benchmarks"),
 ) -> None:
    """
    Show benchmark history for a suite.
    Example:
        veritext benchmark show my_bench --last 10
    """
    bench = Benchmark(name, storage_path=storage_path)
    runs = bench.get_history(limit=last)
    if not runs:
        console.print(f"[yellow]No benchmark runs found for '{name}'.[/yellow]")
        raise typer.Exit(code=0)
    table = format_benchmark_history(runs)
    console.print(table)
@benchmark_app.command("check")
 def benchmark_check(
    name: Annotated[
        str,
        typer.Argument(help="Name of the benchmark suite."),
    ],
    tolerance: Annotated[
        float,
        typer.Option(
            "--tolerance",
            "-t",
            help="Maximum allowed metric drop (e.g., 0.05 = 5%).",
        ),
    ] = 0.05,
    window: Annotated[
        int,
        typer.Option(
            "--window",
            "-w",
            help="Number of historical runs for baseline.",
        ),
    ] = 10,
    storage_path: Annotated[
        Path,
        typer.Option(
            "--storage",
            "-s",
            help="Directory for benchmark data storage.",
        ),
    ] = Path("benchmarks"),
 ) -> None:
    """
    Check for quality regression against historical baseline.
    Exits with code 1 if regression detected (for CI integration).
    Example:
        veritext benchmark check my_bench --tolerance 0.05
    """
    bench = Benchmark(name, storage_path=storage_path)
    report = bench.check_regression(tolerance=tolerance, window=window)
    panel = format_regression_report(report)
    console.print(panel)
    if report.detected:
        raise typer.Exit(code=1)
@@ -0,0 +1,170 @@
 """Rich output formatters for CLI display."""
 import json
 from rich.console import Console
 from rich.panel import Panel
 from rich.table import Table
 from veritext.benchmark.models import BenchmarkRun, RegressionReport
 console = Console()
 def format_validation_table(
    results: dict[str, float],
    threshold: float | None = None,
 ) -> Table:
    """
    Format validation results as a Rich table.
    Args:
        results: Dictionary of metric names to scores.
        threshold: Optional threshold for pass/fail colouring.
    Returns:
        Rich Table object.
    """
    table = Table(title="Validation Results", show_header=True, header_style="bold")
    table.add_column("Metric", style="cyan")
    table.add_column("Score", justify="right")
    if threshold is not None:
        table.add_column("Status", justify="center")
    for metric, score in sorted(results.items()):
        score_str = f"{score:.4f}"
        if threshold is not None:
            status = "[green]PASS[/green]" if score >= threshold else "[red]FAIL[/red]"
            table.add_row(metric, score_str, status)
        else:
            table.add_row(metric, score_str)
    return table
 def format_validation_json(results: dict[str, float]) -> str:
    """
    Format validation results as JSON.
    Args:
        results: Dictionary of metric names to scores.
    Returns:
        JSON string.
    """
    return json.dumps(results, indent=2)
 def format_validation_simple(results: dict[str, float]) -> str:
    """
    Format validation results as simple text output.
    Args:
        results: Dictionary of metric names to scores.
    Returns:
        Simple text string with one metric per line.
    """
    lines = [f"{metric}: {score:.4f}" for metric, score in sorted(results.items())]
    return "\n".join(lines)
 def format_benchmark_history(runs: list[BenchmarkRun]) -> Table:
    """
    Format benchmark run history as a Rich table.
    Args:
        runs: List of BenchmarkRun objects (most recent first).
    Returns:
        Rich Table object.
    """
    if not runs:
        table = Table(title="Benchmark History")
        table.add_column("No runs found")
        return table
    # Get all metric names from the runs
    metric_names: set[str] = set()
    for run in runs:
        metric_names.update(run.metrics.keys())
    sorted_metrics = sorted(metric_names)
    table = Table(title="Benchmark History", show_header=True, header_style="bold")
    table.add_column("Timestamp", style="cyan")
    table.add_column("Samples", justify="right")
    for metric in sorted_metrics:
        table.add_column(metric, justify="right")
    for run in runs:
        timestamp = run.timestamp.strftime("%Y-%m-%d %H:%M")
        samples = str(run.sample_count)
        metric_values = [f"{run.metrics.get(m, 0.0):.4f}" for m in sorted_metrics]
        table.add_row(timestamp, samples, *metric_values)
    return table
 def format_regression_report(report: RegressionReport) -> Panel:
    """
    Format a regression report as a Rich panel.
    Args:
        report: RegressionReport object.
    Returns:
        Rich Panel object with formatted report.
    """
    if not report.detected:
        content = (
            f"[green]No regression detected.[/green]\nTolerance: {report.tolerance:.2%}"
        )
        return Panel(content, title="Regression Check", border_style="green")
    # Build regression details
    lines = [
        "[red]Regression detected![/red]",
        f"Tolerance: {report.tolerance:.2%}",
        "",
        "Metric details:",
    ]
    for metric in sorted(report.deltas.keys()):
        baseline = report.baseline.get(metric, 0.0)
        current = report.current.get(metric, 0.0)
        delta = report.deltas[metric]
        if delta < -report.tolerance:
            status = "[red]REGRESSED[/red]"
        else:
            status = "[green]OK[/green]"
        lines.append(
            f"  {metric}: {current:.4f} (baseline: {baseline:.4f}, "
            f"delta: {delta:+.4f}) {status}"
        )
    return Panel("\n".join(lines), title="Regression Check", border_style="red")
 def print_validation_output(
    results: dict[str, float],
    output_format: str = "table",
    threshold: float | None = None,
 ) -> None:
    """
    Print validation results in the specified format.
    Args:
        results: Dictionary of metric names to scores.
        output_format: Output format ('table', 'json', or 'simple').
        threshold: Optional threshold for pass/fail colouring (table only).
    """
    if output_format == "json":
        console.print(format_validation_json(results))
    elif output_format == "simple":
        console.print(format_validation_simple(results))
    else:
        console.print(format_validation_table(results, threshold))
@@ -0,0 +1,37 @@
 """Veritext CLI entry point."""
 import typer
 import veritext
 from veritext.cli.benchmark import benchmark_app
 from veritext.cli.validate import validate
 app = typer.Typer(
    name="veritext",
    help="Semantic text validation framework.",
    no_args_is_help=True,
 )
 # Register commands
 app.command()(validate)
 app.add_typer(benchmark_app)
@app.callback(invoke_without_command=True)
 def main(
    version: bool | None = typer.Option(
        None,
        "--version",
        "-V",
        help="Show version and exit.",
        is_eager=True,
    ),
 ) -> None:
    """Veritext: Semantic text validation framework for Python."""
    if version:
        typer.echo(f"veritext {veritext.__version__}")
        raise typer.Exit()
 if __name__ == "__main__":
    app()
@@ -0,0 +1,120 @@
 """Input readers for CLI operations."""
 import json
 from dataclasses import dataclass
 from pathlib import Path
@dataclass
 class TextPair:
    """A candidate-reference text pair for validation."""
    candidate: str
    reference: str
 def read_jsonl(path: Path) -> list[TextPair]:
    """
    Read text pairs from a JSONL file.
    Each line must be a JSON object with 'candidate' and 'reference' keys.
    Args:
        path: Path to the JSONL file.
    Returns:
        List of TextPair objects.
    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If any line is malformed or missing required keys.
    """
    if not path.exists():
        raise FileNotFoundError(f"File not found: {path}")
    pairs: list[TextPair] = []
    with path.open() as f:
        for line_num, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            try:
                data = json.loads(line)
            except json.JSONDecodeError as e:
                raise ValueError(f"Invalid JSON on line {line_num}: {e}") from e
            if "candidate" not in data:
                raise ValueError(f"Missing 'candidate' key on line {line_num}")
            if "reference" not in data:
                raise ValueError(f"Missing 'reference' key on line {line_num}")
            pairs.append(
                TextPair(
                    candidate=str(data["candidate"]),
                    reference=str(data["reference"]),
                )
            )
    return pairs
 def read_paired_jsonl(candidates_path: Path, references_path: Path) -> list[TextPair]:
    """
    Read text pairs from separate candidate and reference JSONL files.
    Each file should contain one JSON object per line with a 'text' key.
    Args:
        candidates_path: Path to the candidates JSONL file.
        references_path: Path to the references JSONL file.
    Returns:
        List of TextPair objects.
    Raises:
        FileNotFoundError: If either file does not exist.
        ValueError: If files have different lengths or are malformed.
    """
    candidates = _read_text_jsonl(candidates_path, "candidates")
    references = _read_text_jsonl(references_path, "references")
    if len(candidates) != len(references):
        raise ValueError(
            f"Number of candidates ({len(candidates)}) does not match "
            f"number of references ({len(references)})"
        )
    return [
        TextPair(candidate=c, reference=r)
        for c, r in zip(candidates, references, strict=True)
    ]
 def _read_text_jsonl(path: Path, label: str) -> list[str]:
    """Read text values from a JSONL file with 'text' key per line."""
    if not path.exists():
        raise FileNotFoundError(f"{label.capitalize()} file not found: {path}")
    texts: list[str] = []
    with path.open() as f:
        for line_num, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            try:
                data = json.loads(line)
            except json.JSONDecodeError as e:
                raise ValueError(
                    f"Invalid JSON in {label} file on line {line_num}: {e}"
                ) from e
            if "text" not in data:
                raise ValueError(
                    f"Missing 'text' key in {label} file on line {line_num}"
                )
            texts.append(str(data["text"]))
    return texts
@@ -0,0 +1,213 @@
 """Validate command for computing text metrics."""
 from pathlib import Path
 from typing import Annotated
 import typer
 from veritext.cli.formatters import console, print_validation_output
 from veritext.cli.readers import read_jsonl, read_paired_jsonl
 from veritext.metrics.bleu import Bleu
 from veritext.metrics.lexical import Lexical
 from veritext.metrics.rouge import Rouge
 # Available metrics mapped to their computation functions
 AVAILABLE_METRICS = frozenset(
    {"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"}
 )
 def _compute_metrics(
    candidate: str,
    reference: str,
    metric_names: list[str],
 ) -> dict[str, float]:
    """Compute requested metrics for a single text pair."""
    results: dict[str, float] = {}
    bleu = Bleu()
    rouge = Rouge()
    lexical = Lexical()
    for metric in metric_names:
        if metric == "bleu" or metric == "bleu4":
            bleu_result = bleu.score(candidate, reference)
            results["bleu4"] = bleu_result.bleu4
        elif metric == "bleu1":
            bleu_result = bleu.score(candidate, reference)
            results["bleu1"] = bleu_result.bleu1
        elif metric == "bleu2":
            bleu_result = bleu.score(candidate, reference)
            results["bleu2"] = bleu_result.bleu2
        elif metric == "bleu3":
            bleu_result = bleu.score(candidate, reference)
            results["bleu3"] = bleu_result.bleu3
        elif metric == "rouge" or metric == "rouge_l":
            rouge_result = rouge.score(candidate, reference)
            results["rouge_l"] = rouge_result.rouge_l.fmeasure
        elif metric == "lexical":
            lexical_result = lexical.score(candidate, reference)
            results["jaccard"] = lexical_result.jaccard
            results["token_overlap"] = lexical_result.token_overlap
    return results
 def _compute_batch_metrics(
    candidates: list[str],
    references: list[str],
    metric_names: list[str],
 ) -> dict[str, float]:
    """Compute average metrics for a batch of text pairs."""
    bleu = Bleu()
    rouge = Rouge()
    lexical = Lexical()
    results: dict[str, float] = {}
    for metric in metric_names:
        if metric == "bleu" or metric == "bleu4":
            bleu_batch = bleu.batch_score(candidates, references)
            stats = bleu_batch.stats.get("bleu4")
            if stats:
                results["bleu4"] = stats.mean
        elif metric == "bleu1":
            bleu_batch = bleu.batch_score(candidates, references)
            stats = bleu_batch.stats.get("bleu1")
            if stats:
                results["bleu1"] = stats.mean
        elif metric == "bleu2":
            bleu_batch = bleu.batch_score(candidates, references)
            stats = bleu_batch.stats.get("bleu2")
            if stats:
                results["bleu2"] = stats.mean
        elif metric == "bleu3":
            bleu_batch = bleu.batch_score(candidates, references)
            stats = bleu_batch.stats.get("bleu3")
            if stats:
                results["bleu3"] = stats.mean
        elif metric == "rouge" or metric == "rouge_l":
            rouge_batch = rouge.batch_score(candidates, references)
            stats = rouge_batch.stats.get("rouge_l_fmeasure")
            if stats:
                results["rouge_l"] = stats.mean
        elif metric == "lexical":
            lexical_batch = lexical.batch_score(candidates, references)
            jaccard_stats = lexical_batch.stats.get("jaccard")
            overlap_stats = lexical_batch.stats.get("token_overlap")
            if jaccard_stats:
                results["jaccard"] = jaccard_stats.mean
            if overlap_stats:
                results["token_overlap"] = overlap_stats.mean
    return results
 def _parse_metrics(metrics_str: str) -> list[str]:
    """Parse comma-separated metric names."""
    metrics = [m.strip().lower() for m in metrics_str.split(",")]
    # Validate metric names
    invalid = [m for m in metrics if m not in AVAILABLE_METRICS]
    if invalid:
        raise typer.BadParameter(
            f"Unknown metrics: {', '.join(invalid)}. "
            f"Available: {', '.join(sorted(AVAILABLE_METRICS))}"
        )
    return metrics
 def validate(
    text: Annotated[
        str | None,
        typer.Argument(help="Candidate text to validate (inline mode)."),
    ] = None,
    reference: Annotated[
        str | None,
        typer.Option("--reference", "-r", help="Reference text for comparison."),
    ] = None,
    file: Annotated[
        Path | None,
        typer.Option("--file", "-f", help="JSONL file with candidate/reference pairs."),
    ] = None,
    reference_file: Annotated[
        Path | None,
        typer.Option(
            "--reference-file",
            "-R",
            help="Separate JSONL file with references (requires --file).",
        ),
    ] = None,
    metrics: Annotated[
        str,
        typer.Option(
            "--metrics",
            "-m",
            help="Comma-separated metrics: bleu, bleu1-4, rouge, rouge_l, lexical.",
        ),
    ] = "bleu,rouge",
    output: Annotated[
        str,
        typer.Option("--output", "-o", help="Output format: table, json, or simple."),
    ] = "table",
    threshold: Annotated[
        float | None,
        typer.Option("--threshold", "-t", help="Score threshold for pass/fail status."),
    ] = None,
 ) -> None:
    """
    Validate text quality using various metrics.
    Use inline mode for single texts:
        veritext validate "text" -r "reference" -m bleu,rouge
    Use file mode for batches:
        veritext validate -f outputs.jsonl -m bleu,rouge
    """
    # Parse and validate metric names
    try:
        metric_names = _parse_metrics(metrics)
    except typer.BadParameter as e:
        console.print(f"[red]Error:[/red] {e}")
        raise typer.Exit(code=1) from e
    # Validate output format
    if output not in ("table", "json", "simple"):
        console.print(f"[red]Error:[/red] Invalid output format: {output}")
        raise typer.Exit(code=1)
    # Determine mode: inline vs file
    if file is not None:
        # File mode
        try:
            if reference_file is not None:
                pairs = read_paired_jsonl(file, reference_file)
            else:
                pairs = read_jsonl(file)
        except (FileNotFoundError, ValueError) as e:
            console.print(f"[red]Error:[/red] {e}")
            raise typer.Exit(code=1) from e
        if not pairs:
            console.print("[yellow]Warning:[/yellow] No text pairs found in file.")
            raise typer.Exit(code=0)
        candidates = [p.candidate for p in pairs]
        references = [p.reference for p in pairs]
        results = _compute_batch_metrics(candidates, references, metric_names)
        console.print(f"[dim]Evaluated {len(pairs)} text pairs.[/dim]\n")
    elif text is not None and reference is not None:
        # Inline mode
        results = _compute_metrics(text, reference, metric_names)
    else:
        # Invalid usage
        console.print(
            "[red]Error:[/red] Provide either text and --reference, "
            "or --file for batch mode."
        )
        raise typer.Exit(code=1)
    print_validation_output(results, output, threshold)
@@ -0,0 +1 @@
 """CLI test suite."""
@@ -0,0 +1,337 @@
 """Tests for CLI benchmark commands."""
 from pathlib import Path
 from typer.testing import CliRunner
 from veritext.cli.main import app
 runner = CliRunner()
 class TestBenchmarkRun:
    """Tests for benchmark run command."""
    def test_benchmark_run_basic(self, tmp_path: Path) -> None:
        """Test basic benchmark run."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text(
            '{"candidate": "hello world today", "reference": "hello world today"}\n'
            '{"candidate": "foo bar baz qux", "reference": "foo bar baz qux"}'
        )
        storage_path = tmp_path / "benchmarks"
        result = runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(data_file),
                "-m",
                "rouge_l,bleu4",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "Benchmark 'test_bench' completed" in result.stdout
        assert "Samples: 2" in result.stdout
        assert "rouge_l:" in result.stdout
        assert "bleu4:" in result.stdout
    def test_benchmark_run_file_not_found(self, tmp_path: Path) -> None:
        """Test benchmark run with non-existent file."""
        result = runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                "/nonexistent/file.jsonl",
                "-s",
                str(tmp_path / "benchmarks"),
            ],
        )
        assert result.exit_code == 1
        assert "Error" in result.stdout
    def test_benchmark_run_creates_storage(self, tmp_path: Path) -> None:
        """Test that benchmark run creates storage directory."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text('{"candidate": "hello", "reference": "hello"}')
        storage_path = tmp_path / "new_benchmarks"
        result = runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(data_file),
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert storage_path.exists()
 class TestBenchmarkShow:
    """Tests for benchmark show command."""
    def test_benchmark_show_no_runs(self, tmp_path: Path) -> None:
        """Test showing benchmark with no runs."""
        storage_path = tmp_path / "benchmarks"
        storage_path.mkdir()
        result = runner.invoke(
            app,
            [
                "benchmark",
                "show",
                "nonexistent_bench",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "No benchmark runs found" in result.stdout
    def test_benchmark_show_with_runs(self, tmp_path: Path) -> None:
        """Test showing benchmark history with runs."""
        # First create some runs
        data_file = tmp_path / "data.jsonl"
        data_file.write_text('{"candidate": "hello world", "reference": "hello world"}')
        storage_path = tmp_path / "benchmarks"
        # Run benchmark twice
        for _ in range(2):
            runner.invoke(
                app,
                [
                    "benchmark",
                    "run",
                    "test_bench",
                    "-f",
                    str(data_file),
                    "-s",
                    str(storage_path),
                ],
            )
        # Show history
        result = runner.invoke(
            app,
            [
                "benchmark",
                "show",
                "test_bench",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "Benchmark History" in result.stdout
    def test_benchmark_show_limit(self, tmp_path: Path) -> None:
        """Test showing limited benchmark history."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text('{"candidate": "hello", "reference": "hello"}')
        storage_path = tmp_path / "benchmarks"
        # Run benchmark 3 times
        for _ in range(3):
            runner.invoke(
                app,
                [
                    "benchmark",
                    "run",
                    "test_bench",
                    "-f",
                    str(data_file),
                    "-s",
                    str(storage_path),
                ],
            )
        # Show only last 2
        result = runner.invoke(
            app,
            [
                "benchmark",
                "show",
                "test_bench",
                "--last",
                "2",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
 class TestBenchmarkCheck:
    """Tests for benchmark check command."""
    def test_benchmark_check_no_regression(self, tmp_path: Path) -> None:
        """Test checking for regression with no regression."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text(
            '{"candidate": "hello world today", "reference": "hello world today"}'
        )
        storage_path = tmp_path / "benchmarks"
        # Run benchmark twice with same data (no regression)
        for _ in range(2):
            runner.invoke(
                app,
                [
                    "benchmark",
                    "run",
                    "test_bench",
                    "-f",
                    str(data_file),
                    "-s",
                    str(storage_path),
                ],
            )
        # Check for regression
        result = runner.invoke(
            app,
            [
                "benchmark",
                "check",
                "test_bench",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "No regression detected" in result.stdout
    def test_benchmark_check_with_regression(self, tmp_path: Path) -> None:
        """Test checking for regression when regression occurs."""
        storage_path = tmp_path / "benchmarks"
        # First run with good data
        good_file = tmp_path / "good.jsonl"
        good_file.write_text(
            '{"candidate": "hello world today", "reference": "hello world today"}'
        )
        runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(good_file),
                "-s",
                str(storage_path),
            ],
        )
        # Second run with bad data (regression)
        bad_file = tmp_path / "bad.jsonl"
        bad_file.write_text(
            '{"candidate": "completely different", "reference": "hello world today"}'
        )
        runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(bad_file),
                "-s",
                str(storage_path),
            ],
        )
        # Check for regression
        result = runner.invoke(
            app,
            [
                "benchmark",
                "check",
                "test_bench",
                "-t",
                "0.05",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 1
        assert "Regression detected" in result.stdout
    def test_benchmark_check_custom_tolerance(self, tmp_path: Path) -> None:
        """Test checking regression with custom tolerance."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text('{"candidate": "hello", "reference": "hello"}')
        storage_path = tmp_path / "benchmarks"
        runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(data_file),
                "-s",
                str(storage_path),
            ],
        )
        result = runner.invoke(
            app,
            [
                "benchmark",
                "check",
                "test_bench",
                "--tolerance",
                "0.10",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "10.00%" in result.stdout
 class TestBenchmarkHelp:
    """Tests for benchmark help output."""
    def test_benchmark_help(self) -> None:
        """Test benchmark help output."""
        result = runner.invoke(app, ["benchmark", "--help"])
        assert result.exit_code == 0
        assert "run" in result.stdout
        assert "show" in result.stdout
        assert "check" in result.stdout
    def test_benchmark_run_help(self) -> None:
        """Test benchmark run help output."""
        result = runner.invoke(app, ["benchmark", "run", "--help"])
        assert result.exit_code == 0
        assert "--file" in result.stdout
        assert "--metrics" in result.stdout
    def test_benchmark_show_help(self) -> None:
        """Test benchmark show help output."""
        result = runner.invoke(app, ["benchmark", "show", "--help"])
        assert result.exit_code == 0
        assert "--last" in result.stdout
    def test_benchmark_check_help(self) -> None:
        """Test benchmark check help output."""
        result = runner.invoke(app, ["benchmark", "check", "--help"])
        assert result.exit_code == 0
        assert "--tolerance" in result.stdout
        assert "--window" in result.stdout
@@ -0,0 +1,141 @@
 """Tests for CLI output formatters."""
 from datetime import UTC, datetime
 from veritext.benchmark.models import BenchmarkRun, RegressionReport
 from veritext.cli.formatters import (
    format_benchmark_history,
    format_regression_report,
    format_validation_json,
    format_validation_simple,
    format_validation_table,
 )
 class TestFormatValidationTable:
    """Tests for format_validation_table function."""
    def test_format_empty_results(self) -> None:
        """Test formatting empty results."""
        table = format_validation_table({})
        assert table.title == "Validation Results"
        assert table.row_count == 0
    def test_format_single_metric(self) -> None:
        """Test formatting a single metric."""
        results = {"bleu4": 0.8523}
        table = format_validation_table(results)
        assert table.row_count == 1
    def test_format_multiple_metrics(self) -> None:
        """Test formatting multiple metrics."""
        results = {"bleu4": 0.85, "rouge_l": 0.92, "jaccard": 0.75}
        table = format_validation_table(results)
        assert table.row_count == 3
    def test_format_with_threshold(self) -> None:
        """Test formatting with threshold for pass/fail."""
        results = {"bleu4": 0.85, "rouge_l": 0.45}
        table = format_validation_table(results, threshold=0.5)
        # Should have 3 columns: Metric, Score, Status
        assert table.row_count == 2
 class TestFormatValidationJson:
    """Tests for format_validation_json function."""
    def test_format_empty_results(self) -> None:
        """Test formatting empty results as JSON."""
        result = format_validation_json({})
        assert result == "{}"
    def test_format_results(self) -> None:
        """Test formatting results as JSON."""
        results = {"bleu4": 0.85, "rouge_l": 0.92}
        result = format_validation_json(results)
        assert '"bleu4": 0.85' in result
        assert '"rouge_l": 0.92' in result
 class TestFormatValidationSimple:
    """Tests for format_validation_simple function."""
    def test_format_empty_results(self) -> None:
        """Test formatting empty results as simple text."""
        result = format_validation_simple({})
        assert result == ""
    def test_format_results(self) -> None:
        """Test formatting results as simple text."""
        results = {"bleu4": 0.8523, "rouge_l": 0.9234}
        result = format_validation_simple(results)
        assert "bleu4: 0.8523" in result
        assert "rouge_l: 0.9234" in result
 class TestFormatBenchmarkHistory:
    """Tests for format_benchmark_history function."""
    def test_format_empty_history(self) -> None:
        """Test formatting empty benchmark history."""
        table = format_benchmark_history([])
        assert table.title == "Benchmark History"
    def test_format_single_run(self) -> None:
        """Test formatting a single benchmark run."""
        run = BenchmarkRun(
            id="test-id",
            benchmark_name="test",
            timestamp=datetime(2024, 1, 15, 10, 30, tzinfo=UTC),
            veritext_version="0.1.0",
            metrics={"rouge_l": 0.85, "bleu4": 0.72},
            sample_count=100,
        )
        table = format_benchmark_history([run])
        assert table.row_count == 1
    def test_format_multiple_runs(self) -> None:
        """Test formatting multiple benchmark runs."""
        runs = [
            BenchmarkRun(
                id=f"test-id-{i}",
                benchmark_name="test",
                timestamp=datetime(2024, 1, i + 1, 10, 30, tzinfo=UTC),
                veritext_version="0.1.0",
                metrics={"rouge_l": 0.8 + i * 0.01},
                sample_count=100,
            )
            for i in range(3)
        ]
        table = format_benchmark_history(runs)
        assert table.row_count == 3
 class TestFormatRegressionReport:
    """Tests for format_regression_report function."""
    def test_format_no_regression(self) -> None:
        """Test formatting report with no regression."""
        report = RegressionReport(
            detected=False,
            baseline={"rouge_l": 0.85},
            current={"rouge_l": 0.86},
            deltas={"rouge_l": 0.01},
            tolerance=0.05,
        )
        panel = format_regression_report(report)
        assert panel.title == "Regression Check"
        assert panel.border_style == "green"
    def test_format_with_regression(self) -> None:
        """Test formatting report with regression detected."""
        report = RegressionReport(
            detected=True,
            baseline={"rouge_l": 0.85, "bleu4": 0.72},
            current={"rouge_l": 0.70, "bleu4": 0.70},
            deltas={"rouge_l": -0.15, "bleu4": -0.02},
            tolerance=0.05,
        )
        panel = format_regression_report(report)
        assert panel.title == "Regression Check"
        assert panel.border_style == "red"
@@ -0,0 +1,145 @@
 """Tests for CLI input readers."""
 import json
 from pathlib import Path
 import pytest
 from veritext.cli.readers import TextPair, read_jsonl, read_paired_jsonl
 class TestTextPair:
    """Tests for TextPair dataclass."""
    def test_create_text_pair(self) -> None:
        """Test creating a TextPair."""
        pair = TextPair(candidate="hello", reference="world")
        assert pair.candidate == "hello"
        assert pair.reference == "world"
 class TestReadJsonl:
    """Tests for read_jsonl function."""
    def test_read_valid_jsonl(self, tmp_path: Path) -> None:
        """Test reading a valid JSONL file."""
        data = [
            {"candidate": "foo", "reference": "bar"},
            {"candidate": "baz", "reference": "qux"},
        ]
        jsonl_file = tmp_path / "data.jsonl"
        jsonl_file.write_text("\n".join(json.dumps(d) for d in data))
        pairs = read_jsonl(jsonl_file)
        assert len(pairs) == 2
        assert pairs[0].candidate == "foo"
        assert pairs[0].reference == "bar"
        assert pairs[1].candidate == "baz"
        assert pairs[1].reference == "qux"
    def test_read_empty_file(self, tmp_path: Path) -> None:
        """Test reading an empty JSONL file."""
        jsonl_file = tmp_path / "empty.jsonl"
        jsonl_file.write_text("")
        pairs = read_jsonl(jsonl_file)
        assert pairs == []
    def test_read_file_with_blank_lines(self, tmp_path: Path) -> None:
        """Test reading a JSONL file with blank lines."""
        jsonl_file = tmp_path / "data.jsonl"
        content = '{"candidate": "a", "reference": "b"}\n\n{"candidate": "c", "reference": "d"}\n'
        jsonl_file.write_text(content)
        pairs = read_jsonl(jsonl_file)
        assert len(pairs) == 2
    def test_read_file_not_found(self, tmp_path: Path) -> None:
        """Test reading a non-existent file."""
        with pytest.raises(FileNotFoundError):
            read_jsonl(tmp_path / "nonexistent.jsonl")
    def test_read_invalid_json(self, tmp_path: Path) -> None:
        """Test reading a file with invalid JSON."""
        jsonl_file = tmp_path / "invalid.jsonl"
        jsonl_file.write_text("not valid json")
        with pytest.raises(ValueError, match="Invalid JSON on line 1"):
            read_jsonl(jsonl_file)
    def test_read_missing_candidate_key(self, tmp_path: Path) -> None:
        """Test reading a file missing the candidate key."""
        jsonl_file = tmp_path / "data.jsonl"
        jsonl_file.write_text('{"reference": "bar"}')
        with pytest.raises(ValueError, match="Missing 'candidate' key on line 1"):
            read_jsonl(jsonl_file)
    def test_read_missing_reference_key(self, tmp_path: Path) -> None:
        """Test reading a file missing the reference key."""
        jsonl_file = tmp_path / "data.jsonl"
        jsonl_file.write_text('{"candidate": "foo"}')
        with pytest.raises(ValueError, match="Missing 'reference' key on line 1"):
            read_jsonl(jsonl_file)
 class TestReadPairedJsonl:
    """Tests for read_paired_jsonl function."""
    def test_read_paired_valid(self, tmp_path: Path) -> None:
        """Test reading valid paired JSONL files."""
        candidates_file = tmp_path / "candidates.jsonl"
        references_file = tmp_path / "references.jsonl"
        candidates_file.write_text('{"text": "foo"}\n{"text": "bar"}')
        references_file.write_text('{"text": "baz"}\n{"text": "qux"}')
        pairs = read_paired_jsonl(candidates_file, references_file)
        assert len(pairs) == 2
        assert pairs[0].candidate == "foo"
        assert pairs[0].reference == "baz"
        assert pairs[1].candidate == "bar"
        assert pairs[1].reference == "qux"
    def test_read_paired_length_mismatch(self, tmp_path: Path) -> None:
        """Test reading paired files with different lengths."""
        candidates_file = tmp_path / "candidates.jsonl"
        references_file = tmp_path / "references.jsonl"
        candidates_file.write_text('{"text": "foo"}\n{"text": "bar"}')
        references_file.write_text('{"text": "baz"}')
        with pytest.raises(ValueError, match="does not match"):
            read_paired_jsonl(candidates_file, references_file)
    def test_read_paired_candidates_not_found(self, tmp_path: Path) -> None:
        """Test reading when candidates file doesn't exist."""
        references_file = tmp_path / "references.jsonl"
        references_file.write_text('{"text": "baz"}')
        with pytest.raises(FileNotFoundError, match="Candidates file not found"):
            read_paired_jsonl(tmp_path / "nonexistent.jsonl", references_file)
    def test_read_paired_references_not_found(self, tmp_path: Path) -> None:
        """Test reading when references file doesn't exist."""
        candidates_file = tmp_path / "candidates.jsonl"
        candidates_file.write_text('{"text": "foo"}')
        with pytest.raises(FileNotFoundError, match="References file not found"):
            read_paired_jsonl(candidates_file, tmp_path / "nonexistent.jsonl")
    def test_read_paired_missing_text_key(self, tmp_path: Path) -> None:
        """Test reading paired files with missing text key."""
        candidates_file = tmp_path / "candidates.jsonl"
        references_file = tmp_path / "references.jsonl"
        candidates_file.write_text('{"value": "foo"}')
        references_file.write_text('{"text": "baz"}')
        with pytest.raises(ValueError, match="Missing 'text' key in candidates file"):
            read_paired_jsonl(candidates_file, references_file)
@@ -0,0 +1,233 @@
 """Tests for CLI validate command."""
 import json
 from pathlib import Path
 from typer.testing import CliRunner
 from veritext.cli.main import app
 runner = CliRunner()
 class TestValidateInline:
    """Tests for inline validation mode."""
    def test_validate_inline_basic(self) -> None:
        """Test basic inline validation."""
        result = runner.invoke(
            app,
            [
                "validate",
                "The quick brown fox jumps",
                "-r",
                "The quick brown fox jumps",
                "-m",
                "bleu",
            ],
        )
        assert result.exit_code == 0
        assert "bleu4" in result.stdout
    def test_validate_inline_with_rouge(self) -> None:
        """Test inline validation with ROUGE metric."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world today",
                "-r",
                "hello world here",
                "-m",
                "rouge",
            ],
        )
        assert result.exit_code == 0
        assert "rouge_l" in result.stdout
    def test_validate_inline_with_lexical(self) -> None:
        """Test inline validation with lexical metric."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world",
                "-r",
                "hello everyone",
                "-m",
                "lexical",
            ],
        )
        assert result.exit_code == 0
        assert "jaccard" in result.stdout
        assert "token_overlap" in result.stdout
    def test_validate_inline_json_output(self) -> None:
        """Test inline validation with JSON output."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world today",
                "-r",
                "hello world today",
                "-m",
                "bleu",
                "-o",
                "json",
            ],
        )
        assert result.exit_code == 0
        data = json.loads(result.stdout)
        assert "bleu4" in data
    def test_validate_inline_simple_output(self) -> None:
        """Test inline validation with simple output."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world today",
                "-r",
                "hello world today",
                "-m",
                "rouge",
                "-o",
                "simple",
            ],
        )
        assert result.exit_code == 0
        assert "rouge_l:" in result.stdout
    def test_validate_inline_missing_reference(self) -> None:
        """Test inline validation without reference."""
        result = runner.invoke(
            app,
            ["validate", "hello world", "-m", "bleu"],
        )
        assert result.exit_code == 1
        assert "Error" in result.stdout
    def test_validate_inline_invalid_metric(self) -> None:
        """Test inline validation with invalid metric."""
        result = runner.invoke(
            app,
            ["validate", "hello", "-r", "world", "-m", "invalid_metric"],
        )
        assert result.exit_code == 1
        assert "Unknown metrics" in result.stdout
 class TestValidateFile:
    """Tests for file-based validation mode."""
    def test_validate_file_basic(self, tmp_path: Path) -> None:
        """Test basic file-based validation."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text(
            '{"candidate": "hello world today", "reference": "hello world today"}\n'
            '{"candidate": "foo bar baz", "reference": "foo bar baz"}'
        )
        result = runner.invoke(
            app,
            ["validate", "-f", str(data_file), "-m", "bleu"],
        )
        assert result.exit_code == 0
        assert "bleu4" in result.stdout
        assert "Evaluated 2 text pairs" in result.stdout
    def test_validate_file_not_found(self) -> None:
        """Test file-based validation with non-existent file."""
        result = runner.invoke(
            app,
            ["validate", "-f", "/nonexistent/file.jsonl", "-m", "bleu"],
        )
        assert result.exit_code == 1
        assert "Error" in result.stdout
    def test_validate_paired_files(self, tmp_path: Path) -> None:
        """Test validation with separate candidate and reference files."""
        candidates_file = tmp_path / "candidates.jsonl"
        references_file = tmp_path / "references.jsonl"
        candidates_file.write_text(
            '{"text": "hello world today"}\n{"text": "foo bar baz"}'
        )
        references_file.write_text(
            '{"text": "hello world today"}\n{"text": "foo bar baz"}'
        )
        result = runner.invoke(
            app,
            [
                "validate",
                "-f",
                str(candidates_file),
                "-R",
                str(references_file),
                "-m",
                "bleu",
            ],
        )
        assert result.exit_code == 0
        assert "Evaluated 2 text pairs" in result.stdout
 class TestValidateOptions:
    """Tests for validate command options."""
    def test_validate_with_threshold(self) -> None:
        """Test validation with threshold option."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world today",
                "-r",
                "hello world today",
                "-m",
                "bleu",
                "-t",
                "0.5",
            ],
        )
        assert result.exit_code == 0
        # Table output should include Status column
        assert "Status" in result.stdout or "PASS" in result.stdout
    def test_validate_invalid_output_format(self) -> None:
        """Test validation with invalid output format."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello",
                "-r",
                "world",
                "-m",
                "bleu",
                "-o",
                "invalid",
            ],
        )
        assert result.exit_code == 1
        assert "Invalid output format" in result.stdout
    def test_validate_multiple_metrics(self) -> None:
        """Test validation with multiple metrics."""
        result = runner.invoke(
            app,
            [
                "validate",
                "The quick brown fox",
                "-r",
                "The quick brown fox",
                "-m",
                "bleu,rouge,lexical",
            ],
        )
        assert result.exit_code == 0
        assert "bleu4" in result.stdout
        assert "rouge_l" in result.stdout
        assert "jaccard" in result.stdout
Author	SHA1	Message	Date
kschappell	d5df8b52e6	docs: add branch creation instruction to git workflow Explicitly documents the requirement to create a new branch before starting work from a plan, consistent with the parent workspace CLAUDE.md instruction.	2026-02-03 19:06:45 +00:00
kschappell	8b7c087de7	docs(changelog): add CLI entries Document command-line interface including validate command, benchmark subcommands, and output formatting options.	2026-02-03 18:22:50 +00:00
kschappell	c54f8c3f6f	test(cli): add CLI tests Add comprehensive test suite for validate command, benchmark commands, input readers, and output formatters using Typer CliRunner.	2026-02-03 18:22:31 +00:00
kschappell	0cadfd4d23	feat(cli): add benchmark subcommands Add benchmark run, show, and check commands for quality tracking with regression detection supporting CI integration.	2026-02-03 18:20:28 +00:00
kschappell	e128720917	feat(cli): add validate command Implement validate command with inline and file-based modes supporting BLEU, ROUGE, and lexical metrics with multiple output formats.	2026-02-03 18:19:20 +00:00
kschappell	f713d5e8a6	feat(cli): add Rich output formatters Add formatters for validation results (table/json/simple) and benchmark history display with regression report panels.	2026-02-03 18:17:33 +00:00
kschappell	9853b57843	feat(cli): add JSONL and directory input readers Add TextPair dataclass and read_jsonl/read_paired_jsonl functions for parsing candidate-reference pairs from JSONL files.	2026-02-03 18:16:34 +00:00
kschappell	55faae3e1b	feat(cli): add CLI entry point with version command Initialise Typer app with --version flag and help text.	2026-02-03 18:16:07 +00:00