refactor: CLI cleanup and documentation updates

- Refactor CLI metric computation to eliminate code duplication - Update version format to PEP 440 compliance (0.1.0.dev0) - Cache Settings instance via @lru_cache for performance - Document composite validators' protocol deviation - Consolidate redundant empty checks in ROUGE-L computation - Add Phase 10 (Portfolio Demos) to implementation plan
fix(pytest-plugin): remove duplicate plugin registration in tests
2026-02-04 15:38:46 +00:00 · 2026-02-04 00:43:20 +00:00 · 2026-02-04 00:23:06 +00:00 · 2026-02-04 00:22:57 +00:00 · 2026-02-04 00:22:47 +00:00 · 2026-02-03 21:31:48 +00:00
31 changed files with 2799 additions and 45 deletions
@@ -83,6 +83,11 @@ Each layer depends only on layers below it.
 ## Git Workflow
 ### Before Starting Work
 When starting work from a plan, create a new branch matching the plan's scope before
 making any changes. Do not reuse an existing branch from previous work, even if related.
 ### Commits
 - Format: `type(scope): description`
@@ -7,35 +7,85 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 ### Changed
 - Refactored CLI metric computation to eliminate code duplication
 - Version format updated from `0.1.0-dev` to `0.1.0.dev0` (PEP 440 compliance)
 - Settings instance is now cached via `@lru_cache` for better performance
 - Documented composite validators' intentional deviation from `Check` protocol return type
 ### Fixed
 - Consolidated redundant empty checks in ROUGE-L computation
 - Fixed README example using incorrect property names (`grade_level` → `flesch_kincaid_grade`, `reading_ease` → `flesch_reading_ease`)
 ### Documentation
 - Added Phase 10 (Portfolio Demos) to implementation plan: Streamlit demo and Jupyter notebooks
 - Updated project plan with portfolio demo section
 - Fixed potential crash in ROUGE metric when all references are empty after tokenisation
 - Fixed potential division by zero in readability metric when text has no sentence endings
 - Fixed unbounded cache growth in `SemanticSimilarity` by implementing LRU eviction with configurable max size
 - Fixed mutable list aliasing in `AllOf` and `AnyOf` composite validators
 - Fixed regex pattern validation in `ContainsValidator` and `ExcludesValidator` to fail at init time rather than during `check()`
 - Fixed pytest plugin tests failing with duplicate plugin registration error
 ### Added
 - Added `.score` property to `LexicalResult` for API consistency with other result types
 - Added `cache_max_size` parameter to `SemanticSimilarity` (default: 1000 embeddings)
 - Added test coverage for `core/config.py` and `core/logging.py` modules
 ## [0.1.0] — 2026-02-03
 Initial release of Veritext, a semantic text validation framework for Python.
 ### Added
 #### Core
 - Project scaffold with pyproject.toml and development tooling
 - Core exception hierarchy (`VeritextError` and subclasses)
 - Core types: `ValidationContext`, `CheckResult`, `ValidationResult`
 - Word tokeniser with Unicode normalisation support
 - Configuration module with pydantic-settings
 - Structured logging with structlog
 #### Metrics
 - Metrics module with `Metric` protocol, `AggregateStats`, and `BatchResult` types
 - BLEU metric implementation (BLEU-1 through BLEU-4 with brevity penalty)
 - Lexical similarity metric (Jaccard similarity and token overlap)
 - ROUGE metric (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F-measure)
 - Lexical similarity metric (Jaccard similarity and token overlap)
 - Flesch-Kincaid readability metrics (grade level and reading ease)
 - Batch scoring with aggregate statistics for all metrics
 #### Validators
 - Validators module with `Check` protocol for validation checks
 - Metric-based validators: `BleuValidator`, `RougeValidator`, `LexicalValidator`
 - Constraint validators: `LengthValidator`, `ReadabilityValidator`, `ContainsValidator`, `ExcludesValidator`
 - Composite validators: `AllOf` (all checks must pass), `AnyOf` (any check must pass)
 - Factory functions for clean validator API (`bleu()`, `rouge()`, `lexical()`, `length()`, `readability()`, `contains()`, `excludes()`, `all_of()`, `any_of()`)
 #### Semantic Similarity
 - Semantic similarity module with embedding-based text comparison (requires `veritext[semantic]` extra)
 - `SemanticSimilarity` metric using sentence-transformers for semantic relatedness
 - `SemanticValidator` for threshold-based semantic similarity validation
 - `semantic()` factory function for creating semantic validators
 - Embedding caching for performance optimisation in repeated comparisons
 #### Pytest Plugin
 - Native pytest plugin for CI/CD integration (entry point: `pytest11`)
 - `validate_text()` assertion function for expressive test assertions
 - `text_validation` marker for filtering validation tests
 - Pytest fixtures: `text_validator` factory and `validation_context` helper
 - Detailed failure messages with text preview and check diagnostics
 #### Benchmarking
 - Benchmark module for quality tracking and regression detection
 - `Benchmark` class for evaluating text quality over time with metric storage
 - `BenchmarkRun` and `RegressionReport` data models for tracking runs
@@ -45,3 +95,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `assert_no_regression()` raises `RegressionDetectedError` for CI integration
 - Customisable tolerance threshold and window size for regression detection
 - Metadata support for tracking git SHA, model versions, etc.
 #### CLI
 - Command-line interface (CLI) via `veritext` command
 - `veritext validate` command for inline and file-based text validation
 - JSONL input format support for batch validation (`--file` option)
 - Separate candidate/reference file support (`--reference-file` option)
 - Multiple output formats: table (default), JSON, and simple text
 - `veritext benchmark run` command for running evaluations and storing results
 - `veritext benchmark show` command for viewing benchmark history
 - `veritext benchmark check` command for regression detection with exit code 1 on failure
 - Rich-formatted terminal output with tables and coloured panels
 #### Documentation
 - Comprehensive readme with usage examples
 - Example scripts: basic validation, chatbot testing, benchmark regression
@@ -871,6 +871,59 @@ uv run pytest --cov=src/veritext --cov-report=term-missing
 ---
 ### Phase 10: Portfolio Demos
 **Goal:** Interactive demos for showcasing Veritext without installation.
 **Step 1 — Streamlit Demo:**
 Build a quick interactive web UI for general visitors.
 - [ ] Create `demo/streamlit_app.py`
 - [ ] Text input boxes (candidate + reference)
 - [ ] Metric selector (BLEU, ROUGE, lexical, readability)
 - [ ] Threshold sliders for pass/fail validation
 - [ ] Results table with scores and status
 - [ ] Deploy to homeserver (e.g., `veritext.kschappell.com`)
 **Step 2 — Jupyter Notebook Collection:**
 Deep-dive notebooks targeting data science and ML recruiters.
 - [ ] Create `notebooks/` directory
 - [ ] `01-metrics-overview.ipynb` — Introduction to each metric with visualisations
 - [ ] `02-batch-evaluation.ipynb` — Evaluating model outputs at scale
 - [ ] `03-regression-detection.ipynb` — Tracking quality over time
 - [ ] `04-chatbot-validation.ipynb` — Real-world use case
 **Step 3 — JupyterLite Deployment:**
 Host notebooks as static files running in the browser.
 - [ ] Configure JupyterLite build with veritext pre-installed
 - [ ] Bundle notebooks into static site
 - [ ] Deploy alongside Streamlit demo
 **Files:**
 - `demo/streamlit_app.py`
 - `notebooks/01-metrics-overview.ipynb`
 - `notebooks/02-batch-evaluation.ipynb`
 - `notebooks/03-regression-detection.ipynb`
 - `notebooks/04-chatbot-validation.ipynb`
 - `notebooks/jupyterlite-config.json`
 **Verification:**
 ```bash
 # Streamlit
 uv run streamlit run demo/streamlit_app.py
 # JupyterLite (local preview)
 jupyter lite build --contents notebooks/
 jupyter lite serve
 ```
 ---
 ## Dependencies
 ```toml
@@ -488,3 +488,47 @@ benchmark.assert_no_regression(tolerance=0.03)
 5. **Natural portfolio narrative** — "I was building X and needed a better way to test
   it, so I built this tool." Every interviewer has faced similar problems.
 ---
 ## Portfolio Demos (Future)
 Interactive demos to showcase Veritext without requiring installation.
 ### Streamlit Demo
 A quick interactive web UI for general visitors and recruiters.
 **Features:**
 - Text input boxes (candidate + reference)
 - Metric selector (BLEU, ROUGE, lexical, readability)
 - Threshold sliders for pass/fail validation
 - Results table with scores and status
 **Deployment:** Self-hosted on homeserver (e.g., `veritext.kschappell.com`)
 **Effort:** ~30 minutes
 ### Jupyter Notebook Collection
 Deep-dive notebooks targeting data science and ML recruiters.
 **Notebooks:**
 | Notebook | Purpose |
 |----------|---------|
 | `01-metrics-overview.ipynb` | Introduction to each metric with visualisations |
 | `02-batch-evaluation.ipynb` | Evaluating model outputs at scale, statistical analysis |
 | `03-regression-detection.ipynb` | Tracking quality over time, detecting degradation |
 | `04-chatbot-validation.ipynb` | Real-world use case: validating chatbot responses |
 **Hosting:** JupyterLite (static files, runs in browser via WebAssembly)
 **Deployment:** Self-hosted alongside Streamlit demo
 **Why both:**
 | Demo Type | Audience | Value |
 |-----------|----------|-------|
 | Streamlit | General visitors | Quick, interactive, no friction |
 | Notebooks | Data/ML recruiters | Shows analytical depth, speaks their language |
@@ -0,0 +1,135 @@
 """Basic text validation examples.
 Demonstrates core Veritext functionality:
 - Single metric scoring (BLEU, ROUGE)
 - Validator usage with thresholds
 - Composite validators (all_of, any_of)
 - Constraint validators (length, readability)
 """
 from veritext.core.types import ValidationContext
 from veritext.metrics import Bleu, Rouge
 from veritext.validators import (
    all_of,
    any_of,
    bleu,
    contains,
    excludes,
    length,
    readability,
    rouge,
 )
 def metric_scoring_example() -> None:
    """Score text using individual metrics."""
    candidate = "The quick brown fox jumps over the lazy dog."
    reference = "A fast brown fox leaps over a sleepy dog."
    # BLEU scoring (translation quality)
    bleu_metric = Bleu()
    bleu_result = bleu_metric.score(candidate, reference)
    print("BLEU Scores:")
    print(f"  BLEU-1: {bleu_result.bleu1:.3f}")
    print(f"  BLEU-4: {bleu_result.bleu4:.3f}")
    print(f"  Brevity penalty: {bleu_result.brevity_penalty:.3f}")
    # ROUGE scoring (summary quality)
    rouge_metric = Rouge()
    rouge_result = rouge_metric.score(candidate, reference)
    print("\nROUGE Scores:")
    print(f"  ROUGE-1 F1: {rouge_result.rouge1.fmeasure:.3f}")
    print(f"  ROUGE-L F1: {rouge_result.rouge_l.fmeasure:.3f}")
 def validator_example() -> None:
    """Use validators to make pass/fail decisions."""
    reference = "Machine learning models require training data."
    candidate = "ML models need training data to learn patterns."
    context = ValidationContext(reference=reference)
    # BLEU validator with minimum threshold
    bleu_validator = bleu(min_score=0.3)
    result = bleu_validator.check(candidate, context)
    print(f"\nBLEU validation (min 0.3): {'PASS' if result.passed else 'FAIL'}")
    # ROUGE validator
    rouge_validator = rouge(min_score=0.5)
    result = rouge_validator.check(candidate, context)
    print(f"ROUGE validation (min 0.5): {'PASS' if result.passed else 'FAIL'}")
 def composite_validator_example() -> None:
    """Combine validators with all_of and any_of."""
    reference = "The product launch exceeded all expectations."
    candidate = "The product release performed beyond expectations."
    context = ValidationContext(reference=reference)
    # All checks must pass
    strict_validator = all_of(
        [
            bleu(min_score=0.2),
            rouge(min_score=0.4),
            length(max_chars=100),
        ]
    )
    result = strict_validator.check(candidate, context)
    print(f"\nStrict (all_of): {'PASS' if result.passed else 'FAIL'}")
    if not result.passed:
        print(f"  Failures: {result.failure_summary}")
    # At least one check must pass
    flexible_validator = any_of(
        [
            bleu(min_score=0.8),  # Unlikely to pass
            rouge(min_score=0.4),  # More likely
        ]
    )
    result = flexible_validator.check(candidate, context)
    print(f"Flexible (any_of): {'PASS' if result.passed else 'FAIL'}")
 def constraint_validator_example() -> None:
    """Use constraint validators for text properties."""
    text = "This short guide explains the basics clearly."
    context = ValidationContext()  # No reference needed for constraints
    # Length constraints
    length_validator = length(min_chars=20, max_chars=100, min_words=5, max_words=20)
    result = length_validator.check(text, context)
    print(f"\nLength check: {'PASS' if result.passed else 'FAIL'}")
    # Readability (Flesch-Kincaid)
    readability_validator = readability(max_grade=10.0)
    result = readability_validator.check(text, context)
    print(f"Readability (grade <= 10): {'PASS' if result.passed else 'FAIL'}")
    # Content patterns
    contains_validator = contains(patterns=["guide", "basics"])
    result = contains_validator.check(text, context)
    print(f"Contains required terms: {'PASS' if result.passed else 'FAIL'}")
    excludes_validator = excludes(patterns=["error", "warning"])
    result = excludes_validator.check(text, context)
    print(f"Excludes forbidden terms: {'PASS' if result.passed else 'FAIL'}")
 def main() -> None:
    """Run all examples."""
    print("=" * 60)
    print("Veritext Basic Validation Examples")
    print("=" * 60)
    metric_scoring_example()
    validator_example()
    composite_validator_example()
    constraint_validator_example()
    print("\n" + "=" * 60)
    print("All examples completed.")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,160 @@
 """Benchmark quality tracking with regression detection.
 Demonstrates Veritext's benchmark module for CI integration:
 - Creating a benchmark suite
 - Running evaluations and storing results
 - Checking for quality regression
 - CI integration pattern with exit codes
 """
 import tempfile
 from pathlib import Path
 from veritext.benchmark import Benchmark
 from veritext.core.exceptions import RegressionDetectedError
 def create_sample_data() -> tuple[list[str], list[str]]:
    """Create sample candidate/reference pairs for benchmarking."""
    # Simulated summarisation outputs and references
    candidates = [
        "The new policy aims to reduce carbon emissions by 50% by 2030.",
        "Scientists discovered a new species of deep-sea fish.",
        "The company reported record profits in the third quarter.",
        "Researchers developed a breakthrough treatment for the disease.",
        "The city plans to expand public transportation routes.",
    ]
    references = [
        "The policy targets a 50% reduction in carbon emissions by 2030.",
        "A new deep-sea fish species was discovered by marine biologists.",
        "Record profits were announced by the company for Q3.",
        "A breakthrough disease treatment was developed by researchers.",
        "Public transport expansion is planned for the city.",
    ]
    return candidates, references
 def run_benchmark_example() -> None:
    """Run a benchmark evaluation and view results."""
    # Use a temp directory for this example
    with tempfile.TemporaryDirectory() as tmpdir:
        storage_path = Path(tmpdir) / "benchmarks"
        # Create benchmark suite
        bench = Benchmark("summariser_quality", storage_path=storage_path)
        candidates, references = create_sample_data()
        # Run evaluation
        print("Running benchmark evaluation...")
        run = bench.evaluate(
            candidates=candidates,
            references=references,
            metrics=["rouge_l", "bleu4"],
            metadata={"model": "v1.0", "dataset": "test"},
        )
        print("\nBenchmark run completed:")
        print(f"  Run ID: {run.id[:8]}...")
        print(f"  Samples: {run.sample_count}")
        print("  Metrics:")
        for name, value in run.metrics.items():
            print(f"    {name}: {value:.4f}")
 def regression_detection_example() -> None:
    """Demonstrate regression detection with historical comparison."""
    with tempfile.TemporaryDirectory() as tmpdir:
        storage_path = Path(tmpdir) / "benchmarks"
        bench = Benchmark("summariser_quality", storage_path=storage_path)
        candidates, references = create_sample_data()
        # Simulate historical runs with stable quality
        print("\nBuilding baseline with historical runs...")
        for i in range(5):
            bench.evaluate(
                candidates=candidates,
                references=references,
                metrics=["rouge_l", "bleu4"],
                metadata={"run": f"baseline_{i}"},
            )
            print(f"  Baseline run {i + 1} recorded")
        # Check regression (no degradation expected)
        report = bench.check_regression(tolerance=0.05, window=5)
        print(f"\nRegression check: {'DETECTED' if report.detected else 'NONE'}")
        # Simulate a degraded model
        print("\nSimulating degraded model output...")
        degraded_candidates = [
            "Policy carbon emissions.",  # Much shorter/worse
            "Fish discovered.",
            "Company profits.",
            "Treatment developed.",
            "Transport expansion.",
        ]
        bench.evaluate(
            candidates=degraded_candidates,
            references=references,
            metrics=["rouge_l", "bleu4"],
            metadata={"model": "v1.1-broken"},
        )
        # Check regression (should detect)
        report = bench.check_regression(tolerance=0.05, window=5)
        print(f"Regression check: {'DETECTED' if report.detected else 'NONE'}")
        if report.detected:
            print("\nRegression details:")
            for metric, delta in report.deltas.items():
                baseline = report.baseline.get(metric, 0)
                current = report.current.get(metric, 0)
                print(f"  {metric}: {baseline:.4f} -> {current:.4f} ({delta:+.4f})")
 def ci_integration_example() -> None:
    """CI integration pattern using assert_no_regression()."""
    with tempfile.TemporaryDirectory() as tmpdir:
        storage_path = Path(tmpdir) / "benchmarks"
        bench = Benchmark("ci_check", storage_path=storage_path)
        candidates, references = create_sample_data()
        # Build baseline
        for _ in range(3):
            bench.evaluate(candidates, references, metrics=["rouge_l"])
        # Simulate CI check
        print("\n" + "=" * 50)
        print("CI Integration Example")
        print("=" * 50)
        print("\nRunning evaluation...")
        bench.evaluate(candidates, references, metrics=["rouge_l"])
        print("Checking for regression...")
        try:
            bench.assert_no_regression(tolerance=0.05, window=3)
            print("No regression detected.")
            print("CI status: EXIT 0")
        except RegressionDetectedError as e:
            print(f"Regression detected: {e}")
            print("CI status: EXIT 1")
 def main() -> None:
    """Run all benchmark examples."""
    print("=" * 60)
    print("Veritext Benchmark & Regression Detection Examples")
    print("=" * 60)
    run_benchmark_example()
    regression_detection_example()
    ci_integration_example()
    print("\n" + "=" * 60)
    print("All examples completed.")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,140 @@
 """Pytest integration for chatbot testing.
 Demonstrates Veritext's pytest plugin for testing chatbot responses:
 - validate_text() assertion function
 - Custom test fixtures
 - Test organisation with markers
 """
 import pytest
 from veritext.pytest_plugin import validate_text
 # Sample chatbot responses for testing
 CHATBOT_RESPONSES = {
    "greeting": {
        "input": "Hello!",
        "response": "Hi there! How can I help you today?",
        "expected_keywords": ["help", "hi"],
    },
    "weather": {
        "input": "What's the weather like?",
        "response": "I don't have access to real-time weather data, but you can "
        "check a weather service like weather.com for current conditions.",
        "expected_keywords": ["weather", "check"],
    },
    "farewell": {
        "input": "Goodbye!",
        "response": "Goodbye! Have a great day!",
        "expected_keywords": ["goodbye", "day"],
    },
 }
 # Fixtures for common test setup
@pytest.fixture
 def greeting_response() -> str:
    """Provide a sample greeting response."""
    return CHATBOT_RESPONSES["greeting"]["response"]
@pytest.fixture
 def weather_response() -> str:
    """Provide a sample weather response."""
    return CHATBOT_RESPONSES["weather"]["response"]
 # Basic validation tests
 class TestResponseQuality:
    """Test chatbot response quality using Veritext."""
    def test_greeting_length(self, greeting_response: str) -> None:
        """Greeting responses should be concise."""
        validate_text(
            greeting_response,
            min_length=10,
            max_length=100,
        )
    def test_greeting_readability(self, greeting_response: str) -> None:
        """Greeting responses should be easy to read."""
        validate_text(
            greeting_response,
            max_reading_grade=8.0,
        )
    def test_greeting_contains_keywords(self, greeting_response: str) -> None:
        """Greeting should contain expected terms."""
        validate_text(
            greeting_response,
            must_contain=["help"],
        )
    def test_weather_response_quality(self, weather_response: str) -> None:
        """Weather response should be informative and readable."""
        validate_text(
            weather_response,
            min_length=50,
            max_length=500,
            max_reading_grade=10.0,
            must_contain=["weather"],
        )
 # Tests with reference comparison
 class TestResponseSimilarity:
    """Test response similarity against reference texts."""
    def test_greeting_similarity(self) -> None:
        """Greeting should match expected style."""
        reference = "Hello! How may I assist you today?"
        response = CHATBOT_RESPONSES["greeting"]["response"]
        validate_text(
            response,
            reference=reference,
            min_rouge=0.3,  # Allow variation in wording
            min_length=10,
        )
    def test_farewell_similarity(self) -> None:
        """Farewell should match expected style."""
        reference = "Goodbye! Have a wonderful day!"
        response = CHATBOT_RESPONSES["farewell"]["response"]
        validate_text(
            response,
            reference=reference,
            min_rouge=0.5,
            must_contain=["goodbye"],
        )
 # Content safety tests
 class TestContentSafety:
    """Test responses for inappropriate content."""
    @pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
    def test_no_profanity(self, response_key: str) -> None:
        """Responses should not contain profanity."""
        response = CHATBOT_RESPONSES[response_key]["response"]
        validate_text(
            response,
            must_exclude=["damn", "hell", "crap"],
            min_length=1,
        )
    @pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
    def test_no_harmful_content(self, response_key: str) -> None:
        """Responses should not contain harmful instructions."""
        response = CHATBOT_RESPONSES[response_key]["response"]
        validate_text(
            response,
            must_exclude=["hack", "exploit", "attack"],
            min_length=1,
        )
 # Run tests when executed directly
 if __name__ == "__main__":
    pytest.main([__file__, "-v"])
@@ -1,6 +1,6 @@
 [project]
 name = "veritext"
-version = "0.1.0-dev"
+version = "0.1.0.dev0"
 description = "Semantic text validation framework"
 readme = "readme.md"
 requires-python = ">=3.11"
@@ -2,48 +2,398 @@
 Semantic text validation framework for Python.
-Validates text outputs against quality criteria using metrics like BLEU, ROUGE,
+Veritext validates text outputs against quality criteria using metrics like BLEU,
-and semantic similarity. Designed for developers building systems that produce
+ROUGE, and semantic similarity. Designed for developers building systems that produce
-text (chatbots, content generators, summarisation tools) who need automated
+text (chatbots, content generators, summarisation tools) who need automated quality
-quality assurance beyond simple string matching.
+assurance beyond simple string matching.
-## Status
+## Features
-Under active development. See [changelog.md](changelog.md) for progress.
+- **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic
  embeddings
 - **Composable validators** — Build complex checks from simple primitives
 - **Native pytest integration** — `validate_text()` assertion for test suites
 - **Quality benchmarking** — Track metrics over time with regression detection
 - **CLI tools** — Command-line validation and benchmark management
 ## Installation
 ```bash
 pip install veritext
-# With semantic similarity support
+# With semantic similarity support (sentence-transformers)
 pip install veritext[semantic]
 ```
 ## Quick Start
 ```python
 from veritext import validators as v
 from veritext.core.types import ValidationContext
 from veritext.validators import all_of, bleu, length, rouge
-# Create validators
+# Create a validator
-validator = v.all_of([
+validator = all_of([
-    v.bleu(min_score=0.7),
+    bleu(min_score=0.5),
-    v.length(max_chars=500),
+    rouge(min_score=0.6),
    length(max_chars=500),
 ])
 # Validate text
-context = ValidationContext(reference="The cat sat on the mat.")
+context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
-result = validator.check("A cat is sitting on the mat.", context)
+result = validator.check("A fast brown fox leaps over a sleepy dog.", context)
-if not result.passed:
+if result.passed:
    print("Validation passed!")
 else:
    print(result.failure_summary)
 ```
-## Documentation
+## Metrics
- [Project Plan](docs/project-plan.md)
+Veritext provides several metrics for text evaluation.
- [Implementation Plan](docs/implementation-plan.md)
+
 ### BLEU
 Measures n-gram precision against reference text. Useful for translation and
 generation quality.
 ```python
 from veritext.metrics import Bleu
 bleu = Bleu()
 result = bleu.score(
    candidate="The cat sat on the mat.",
    reference="A cat is sitting on the mat.",
 )
 print(f"BLEU-4: {result.bleu4:.3f}")  # Uses 1-4 gram precision
 print(f"BLEU-1: {result.bleu1:.3f}")  # Unigram precision only
 ```
 ### ROUGE
 Measures recall-oriented overlap with reference text. Useful for summarisation.
 ```python
 from veritext.metrics import Rouge
 rouge = Rouge()
 result = rouge.score(
    candidate="Scientists found a new planet.",
    reference="Researchers discovered a new planet in the solar system.",
 )
 print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}")  # Unigram overlap
 print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}")  # Longest common subsequence
 ```
 ### Lexical Similarity
 Measures token overlap using Jaccard similarity.
 ```python
 from veritext.metrics import Lexical
 lexical = Lexical()
 result = lexical.score(
    candidate="The quick brown fox",
    reference="The fast brown fox",
 )
 print(f"Jaccard: {result.jaccard:.3f}")
 print(f"Token overlap: {result.token_overlap:.3f}")
 ```
 ### Readability
 Computes Flesch-Kincaid scores for text complexity.
 ```python
 from veritext.metrics import Readability
 readability = Readability()
 result = readability.score("This is a simple sentence.")
 print(f"Grade level: {result.flesch_kincaid_grade:.1f}")
 print(f"Reading ease: {result.flesch_reading_ease:.1f}")
 ```
 ### Semantic Similarity (Optional)
 Requires `pip install veritext[semantic]`.
 ```python
 from veritext.semantic import SemanticSimilarity
 semantic = SemanticSimilarity()
 result = semantic.score(
    candidate="The dog is running in the park.",
    reference="A canine is jogging through the garden.",
 )
 print(f"Similarity: {result.score:.3f}")
 ```
 ## Validators
 Validators wrap metrics with thresholds to make pass/fail decisions.
 ### Metric-Based Validators
 ```python
 from veritext.core.types import ValidationContext
 from veritext.validators import bleu, lexical, rouge
 context = ValidationContext(reference="Reference text here.")
 # BLEU validation
 validator = bleu(min_score=0.5, variant=4)  # BLEU-4
 result = validator.check("Candidate text here.", context)
 # ROUGE validation
 validator = rouge(min_score=0.6, variant="l")  # ROUGE-L
 result = validator.check("Candidate text here.", context)
 # Lexical validation
 validator = lexical(min_jaccard=0.3, min_overlap=0.5)
 result = validator.check("Candidate text here.", context)
 ```
 ### Constraint Validators
 These don't require reference text.
 ```python
 from veritext.core.types import ValidationContext
 from veritext.validators import contains, excludes, length, readability
 context = ValidationContext()  # No reference needed
 # Length constraints
 validator = length(min_chars=50, max_chars=500, min_words=10)
 result = validator.check("Your text here...", context)
 # Readability constraints
 validator = readability(max_grade=8.0, min_ease=60.0)
 result = validator.check("Your text here...", context)
 # Content requirements
 validator = contains(patterns=["important", "keyword"])
 result = validator.check("This important text has a keyword.", context)
 # Content exclusions
 validator = excludes(patterns=["forbidden", "banned"])
 result = validator.check("This text is clean.", context)
 ```
 ### Composite Validators
 Combine multiple checks with logical operators.
 ```python
 from veritext.validators import all_of, any_of, bleu, length, rouge
 # All checks must pass
 validator = all_of([
    bleu(min_score=0.5),
    rouge(min_score=0.6),
    length(max_chars=500),
 ])
 # At least one check must pass
 validator = any_of([
    bleu(min_score=0.7),
    rouge(min_score=0.7),
 ])
 ```
 ## Pytest Plugin
 Veritext provides native pytest integration for testing text quality.
 ### Basic Usage
 ```python
 from veritext.pytest_plugin import validate_text
 def test_response_quality():
    response = "This is a helpful response to your question."
    validate_text(
        response,
        min_length=20,
        max_length=200,
        max_reading_grade=10.0,
        must_contain=["helpful"],
        must_exclude=["error", "sorry"],
    )
 def test_summary_similarity():
    summary = "Scientists discovered a new planet."
    reference = "Researchers found a new planet in our solar system."
    validate_text(
        summary,
        reference=reference,
        min_rouge=0.5,
        min_length=10,
    )
 ```
 ### Available Parameters
 | Parameter | Description |
 |-----------|-------------|
 | `reference` | Reference text for comparison metrics |
 | `min_bleu` | Minimum BLEU-4 score (0.0-1.0) |
 | `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) |
 | `min_semantic` | Minimum semantic similarity (0.0-1.0) |
 | `min_length` | Minimum character count |
 | `max_length` | Maximum character count |
 | `max_reading_grade` | Maximum Flesch-Kincaid grade level |
 | `must_contain` | List of required patterns |
 | `must_exclude` | List of forbidden patterns |
 ## Benchmarking
 Track text quality over time and detect regressions.
 ### Running Benchmarks
 ```python
 from veritext.benchmark import Benchmark
 # Create a benchmark suite
 bench = Benchmark("summariser_quality", storage_path="benchmarks/")
 # Evaluate a batch of outputs
 candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
 references = ["Reference 1...", "Reference 2...", "Reference 3..."]
 run = bench.evaluate(
    candidates=candidates,
    references=references,
    metrics=["rouge_l", "bleu4"],
    metadata={"model": "v1.2", "git_sha": "abc123"},
 )
 print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
 print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
 ```
 ### Regression Detection
 ```python
 from veritext.benchmark import Benchmark
 from veritext.core.exceptions import RegressionDetectedError
 bench = Benchmark("summariser_quality")
 # Check for regression against historical baseline
 report = bench.check_regression(tolerance=0.05, window=10)
 if report.detected:
    print("Quality regression detected!")
    for metric, delta in report.deltas.items():
        print(f"  {metric}: {delta:+.4f}")
 # Or raise an exception for CI integration
 try:
    bench.assert_no_regression(tolerance=0.05)
 except RegressionDetectedError as e:
    print(f"CI failure: {e}")
    exit(1)
 ```
 ### Viewing History
 ```python
 bench = Benchmark("summariser_quality")
 for run in bench.get_history(limit=10):
    print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
 ```
 ## CLI
 Veritext provides a command-line interface for validation and benchmarking.
 ### Validate Text
 ```bash
 # Inline validation
 veritext validate "Candidate text" -r "Reference text" -m bleu,rouge
 # File-based batch validation (JSONL with "candidate" and "reference" fields)
 veritext validate -f outputs.jsonl -m bleu,rouge,lexical
 # With threshold for pass/fail
 veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple
 # Output formats: table (default), json, simple
 veritext validate "Text" -r "Reference" -m bleu -o json
 ```
 ### Benchmark Commands
 ```bash
 # Run a benchmark evaluation
 veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
 # View benchmark history
 veritext benchmark show my_bench --last 10
 # Check for regression (exits with code 1 if detected)
 veritext benchmark check my_bench --tolerance 0.05 --window 10
 ```
 ### JSONL Format
 For file-based operations, use JSONL with `candidate` and `reference` fields:
 ```json
 {"candidate": "Model output 1", "reference": "Expected output 1"}
 {"candidate": "Model output 2", "reference": "Expected output 2"}
 ```
 ## Configuration
 Veritext uses environment variables for configuration:
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `VERITEXT_LOG_LEVEL` | `INFO` | Logging level |
 | `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) |
 ## Development
 ### Setup
 ```bash
 git clone https://gitea.kschappell.com/kschappell/veritext.git
 cd veritext
 uv sync --all-extras
 ```
 ### Quality Checks
 ```bash
 # Linting
 uv run ruff check .
 # Formatting
 uv run ruff format --check .
 # Type checking
 uv run mypy src/
 # Tests
 uv run pytest
 ```
 ### Running Examples
 ```bash
 uv run python examples/basic_validation.py
 uv run pytest examples/chatbot_testing.py -v
 uv run python examples/benchmark_regression.py
 ```
 ## Licence
@@ -0,0 +1,5 @@
 """CLI module: Command-line interface for Veritext."""
 from veritext.cli.main import app
 __all__ = ["app"]
@@ -0,0 +1,166 @@
 """Benchmark commands for quality tracking."""
 from pathlib import Path
 from typing import Annotated
 import typer
 from veritext.benchmark import Benchmark
 from veritext.cli.formatters import (
    console,
    format_benchmark_history,
    format_regression_report,
 )
 from veritext.cli.readers import read_jsonl
 benchmark_app = typer.Typer(
    name="benchmark",
    help="Track and compare text quality over time.",
    no_args_is_help=True,
 )
@benchmark_app.command("run")
 def benchmark_run(
    name: Annotated[
        str,
        typer.Argument(help="Name for this benchmark suite."),
    ],
    file: Annotated[
        Path,
        typer.Option("--file", "-f", help="JSONL file with candidate/reference pairs."),
    ],
    metrics: Annotated[
        str,
        typer.Option(
            "--metrics",
            "-m",
            help="Comma-separated metrics to track (e.g., rouge_l,bleu4).",
        ),
    ] = "rouge_l,bleu4",
    storage_path: Annotated[
        Path,
        typer.Option(
            "--storage",
            "-s",
            help="Directory for benchmark data storage.",
        ),
    ] = Path("benchmarks"),
 ) -> None:
    """
    Run a benchmark evaluation and store the results.
    Example:
        veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
    """
    # Read text pairs
    try:
        pairs = read_jsonl(file)
    except (FileNotFoundError, ValueError) as e:
        console.print(f"[red]Error:[/red] {e}")
        raise typer.Exit(code=1) from e
    if not pairs:
        console.print("[yellow]Warning:[/yellow] No text pairs found in file.")
        raise typer.Exit(code=0)
    # Parse metrics
    metric_names = [m.strip() for m in metrics.split(",")]
    candidates = [p.candidate for p in pairs]
    references = [p.reference for p in pairs]
    # Run benchmark
    bench = Benchmark(name, storage_path=storage_path)
    run = bench.evaluate(candidates, references, metrics=metric_names)
    console.print(f"[green]Benchmark '{name}' completed.[/green]")
    console.print(f"Samples: {run.sample_count}")
    console.print("\nMetrics:")
    for metric_name, value in sorted(run.metrics.items()):
        console.print(f"  {metric_name}: {value:.4f}")
@benchmark_app.command("show")
 def benchmark_show(
    name: Annotated[
        str,
        typer.Argument(help="Name of the benchmark suite."),
    ],
    last: Annotated[
        int,
        typer.Option("--last", "-n", help="Number of recent runs to show."),
    ] = 20,
    storage_path: Annotated[
        Path,
        typer.Option(
            "--storage",
            "-s",
            help="Directory for benchmark data storage.",
        ),
    ] = Path("benchmarks"),
 ) -> None:
    """
    Show benchmark history for a suite.
    Example:
        veritext benchmark show my_bench --last 10
    """
    bench = Benchmark(name, storage_path=storage_path)
    runs = bench.get_history(limit=last)
    if not runs:
        console.print(f"[yellow]No benchmark runs found for '{name}'.[/yellow]")
        raise typer.Exit(code=0)
    table = format_benchmark_history(runs)
    console.print(table)
@benchmark_app.command("check")
 def benchmark_check(
    name: Annotated[
        str,
        typer.Argument(help="Name of the benchmark suite."),
    ],
    tolerance: Annotated[
        float,
        typer.Option(
            "--tolerance",
            "-t",
            help="Maximum allowed metric drop (e.g., 0.05 = 5%).",
        ),
    ] = 0.05,
    window: Annotated[
        int,
        typer.Option(
            "--window",
            "-w",
            help="Number of historical runs for baseline.",
        ),
    ] = 10,
    storage_path: Annotated[
        Path,
        typer.Option(
            "--storage",
            "-s",
            help="Directory for benchmark data storage.",
        ),
    ] = Path("benchmarks"),
 ) -> None:
    """
    Check for quality regression against historical baseline.
    Exits with code 1 if regression detected (for CI integration).
    Example:
        veritext benchmark check my_bench --tolerance 0.05
    """
    bench = Benchmark(name, storage_path=storage_path)
    report = bench.check_regression(tolerance=tolerance, window=window)
    panel = format_regression_report(report)
    console.print(panel)
    if report.detected:
        raise typer.Exit(code=1)
@@ -0,0 +1,170 @@
 """Rich output formatters for CLI display."""
 import json
 from rich.console import Console
 from rich.panel import Panel
 from rich.table import Table
 from veritext.benchmark.models import BenchmarkRun, RegressionReport
 console = Console()
 def format_validation_table(
    results: dict[str, float],
    threshold: float | None = None,
 ) -> Table:
    """
    Format validation results as a Rich table.
    Args:
        results: Dictionary of metric names to scores.
        threshold: Optional threshold for pass/fail colouring.
    Returns:
        Rich Table object.
    """
    table = Table(title="Validation Results", show_header=True, header_style="bold")
    table.add_column("Metric", style="cyan")
    table.add_column("Score", justify="right")
    if threshold is not None:
        table.add_column("Status", justify="center")
    for metric, score in sorted(results.items()):
        score_str = f"{score:.4f}"
        if threshold is not None:
            status = "[green]PASS[/green]" if score >= threshold else "[red]FAIL[/red]"
            table.add_row(metric, score_str, status)
        else:
            table.add_row(metric, score_str)
    return table
 def format_validation_json(results: dict[str, float]) -> str:
    """
    Format validation results as JSON.
    Args:
        results: Dictionary of metric names to scores.
    Returns:
        JSON string.
    """
    return json.dumps(results, indent=2)
 def format_validation_simple(results: dict[str, float]) -> str:
    """
    Format validation results as simple text output.
    Args:
        results: Dictionary of metric names to scores.
    Returns:
        Simple text string with one metric per line.
    """
    lines = [f"{metric}: {score:.4f}" for metric, score in sorted(results.items())]
    return "\n".join(lines)
 def format_benchmark_history(runs: list[BenchmarkRun]) -> Table:
    """
    Format benchmark run history as a Rich table.
    Args:
        runs: List of BenchmarkRun objects (most recent first).
    Returns:
        Rich Table object.
    """
    if not runs:
        table = Table(title="Benchmark History")
        table.add_column("No runs found")
        return table
    # Get all metric names from the runs
    metric_names: set[str] = set()
    for run in runs:
        metric_names.update(run.metrics.keys())
    sorted_metrics = sorted(metric_names)
    table = Table(title="Benchmark History", show_header=True, header_style="bold")
    table.add_column("Timestamp", style="cyan")
    table.add_column("Samples", justify="right")
    for metric in sorted_metrics:
        table.add_column(metric, justify="right")
    for run in runs:
        timestamp = run.timestamp.strftime("%Y-%m-%d %H:%M")
        samples = str(run.sample_count)
        metric_values = [f"{run.metrics.get(m, 0.0):.4f}" for m in sorted_metrics]
        table.add_row(timestamp, samples, *metric_values)
    return table
 def format_regression_report(report: RegressionReport) -> Panel:
    """
    Format a regression report as a Rich panel.
    Args:
        report: RegressionReport object.
    Returns:
        Rich Panel object with formatted report.
    """
    if not report.detected:
        content = (
            f"[green]No regression detected.[/green]\nTolerance: {report.tolerance:.2%}"
        )
        return Panel(content, title="Regression Check", border_style="green")
    # Build regression details
    lines = [
        "[red]Regression detected![/red]",
        f"Tolerance: {report.tolerance:.2%}",
        "",
        "Metric details:",
    ]
    for metric in sorted(report.deltas.keys()):
        baseline = report.baseline.get(metric, 0.0)
        current = report.current.get(metric, 0.0)
        delta = report.deltas[metric]
        if delta < -report.tolerance:
            status = "[red]REGRESSED[/red]"
        else:
            status = "[green]OK[/green]"
        lines.append(
            f"  {metric}: {current:.4f} (baseline: {baseline:.4f}, "
            f"delta: {delta:+.4f}) {status}"
        )
    return Panel("\n".join(lines), title="Regression Check", border_style="red")
 def print_validation_output(
    results: dict[str, float],
    output_format: str = "table",
    threshold: float | None = None,
 ) -> None:
    """
    Print validation results in the specified format.
    Args:
        results: Dictionary of metric names to scores.
        output_format: Output format ('table', 'json', or 'simple').
        threshold: Optional threshold for pass/fail colouring (table only).
    """
    if output_format == "json":
        console.print(format_validation_json(results))
    elif output_format == "simple":
        console.print(format_validation_simple(results))
    else:
        console.print(format_validation_table(results, threshold))
@@ -0,0 +1,37 @@
 """Veritext CLI entry point."""
 import typer
 import veritext
 from veritext.cli.benchmark import benchmark_app
 from veritext.cli.validate import validate
 app = typer.Typer(
    name="veritext",
    help="Semantic text validation framework.",
    no_args_is_help=True,
 )
 # Register commands
 app.command()(validate)
 app.add_typer(benchmark_app)
@app.callback(invoke_without_command=True)
 def main(
    version: bool | None = typer.Option(
        None,
        "--version",
        "-V",
        help="Show version and exit.",
        is_eager=True,
    ),
 ) -> None:
    """Veritext: Semantic text validation framework for Python."""
    if version:
        typer.echo(f"veritext {veritext.__version__}")
        raise typer.Exit()
 if __name__ == "__main__":
    app()
@@ -0,0 +1,120 @@
 """Input readers for CLI operations."""
 import json
 from dataclasses import dataclass
 from pathlib import Path
@dataclass
 class TextPair:
    """A candidate-reference text pair for validation."""
    candidate: str
    reference: str
 def read_jsonl(path: Path) -> list[TextPair]:
    """
    Read text pairs from a JSONL file.
    Each line must be a JSON object with 'candidate' and 'reference' keys.
    Args:
        path: Path to the JSONL file.
    Returns:
        List of TextPair objects.
    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If any line is malformed or missing required keys.
    """
    if not path.exists():
        raise FileNotFoundError(f"File not found: {path}")
    pairs: list[TextPair] = []
    with path.open() as f:
        for line_num, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            try:
                data = json.loads(line)
            except json.JSONDecodeError as e:
                raise ValueError(f"Invalid JSON on line {line_num}: {e}") from e
            if "candidate" not in data:
                raise ValueError(f"Missing 'candidate' key on line {line_num}")
            if "reference" not in data:
                raise ValueError(f"Missing 'reference' key on line {line_num}")
            pairs.append(
                TextPair(
                    candidate=str(data["candidate"]),
                    reference=str(data["reference"]),
                )
            )
    return pairs
 def read_paired_jsonl(candidates_path: Path, references_path: Path) -> list[TextPair]:
    """
    Read text pairs from separate candidate and reference JSONL files.
    Each file should contain one JSON object per line with a 'text' key.
    Args:
        candidates_path: Path to the candidates JSONL file.
        references_path: Path to the references JSONL file.
    Returns:
        List of TextPair objects.
    Raises:
        FileNotFoundError: If either file does not exist.
        ValueError: If files have different lengths or are malformed.
    """
    candidates = _read_text_jsonl(candidates_path, "candidates")
    references = _read_text_jsonl(references_path, "references")
    if len(candidates) != len(references):
        raise ValueError(
            f"Number of candidates ({len(candidates)}) does not match "
            f"number of references ({len(references)})"
        )
    return [
        TextPair(candidate=c, reference=r)
        for c, r in zip(candidates, references, strict=True)
    ]
 def _read_text_jsonl(path: Path, label: str) -> list[str]:
    """Read text values from a JSONL file with 'text' key per line."""
    if not path.exists():
        raise FileNotFoundError(f"{label.capitalize()} file not found: {path}")
    texts: list[str] = []
    with path.open() as f:
        for line_num, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            try:
                data = json.loads(line)
            except json.JSONDecodeError as e:
                raise ValueError(
                    f"Invalid JSON in {label} file on line {line_num}: {e}"
                ) from e
            if "text" not in data:
                raise ValueError(
                    f"Missing 'text' key in {label} file on line {line_num}"
                )
            texts.append(str(data["text"]))
    return texts
@@ -0,0 +1,250 @@
 """Validate command for computing text metrics."""
 from pathlib import Path
 from typing import Annotated
 import typer
 from veritext.cli.formatters import console, print_validation_output
 from veritext.cli.readers import read_jsonl, read_paired_jsonl
 from veritext.metrics.bleu import Bleu
 from veritext.metrics.lexical import Lexical
 from veritext.metrics.rouge import Rouge
 # Available metrics
 AVAILABLE_METRICS = frozenset(
    {"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"}
 )
 # Lazily-initialised metric instances
 _bleu: Bleu | None = None
 _rouge: Rouge | None = None
 _lexical: Lexical | None = None
 def _get_bleu() -> Bleu:
    """Get or create the BLEU metric instance."""
    global _bleu
    if _bleu is None:
        _bleu = Bleu()
    return _bleu
 def _get_rouge() -> Rouge:
    """Get or create the ROUGE metric instance."""
    global _rouge
    if _rouge is None:
        _rouge = Rouge()
    return _rouge
 def _get_lexical() -> Lexical:
    """Get or create the lexical metric instance."""
    global _lexical
    if _lexical is None:
        _lexical = Lexical()
    return _lexical
 # Metric registry: maps metric names to (result_keys, single_extractor, batch_extractor)
 # - result_keys: output keys to populate
 # - single_extractor: function(candidate, reference) -> dict of results
 # - batch_extractor: function(candidates, references) -> dict of results
 def _bleu_single(candidate: str, reference: str, key: str) -> dict[str, float]:
    """Extract a BLEU score for single mode."""
    result = _get_bleu().score(candidate, reference)
    return {key: getattr(result, key)}
 def _bleu_batch(
    candidates: list[str], references: list[str], key: str
 ) -> dict[str, float]:
    """Extract a BLEU score for batch mode."""
    batch = _get_bleu().batch_score(candidates, references)
    stats = batch.stats.get(key)
    return {key: stats.mean} if stats else {}
 def _rouge_single(candidate: str, reference: str) -> dict[str, float]:
    """Extract ROUGE-L F-measure for single mode."""
    result = _get_rouge().score(candidate, reference)
    return {"rouge_l": result.rouge_l.fmeasure}
 def _rouge_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
    """Extract ROUGE-L F-measure for batch mode."""
    batch = _get_rouge().batch_score(candidates, references)
    stats = batch.stats.get("rouge_l_fmeasure")
    return {"rouge_l": stats.mean} if stats else {}
 def _lexical_single(candidate: str, reference: str) -> dict[str, float]:
    """Extract lexical scores for single mode."""
    result = _get_lexical().score(candidate, reference)
    return {"jaccard": result.jaccard, "token_overlap": result.token_overlap}
 def _lexical_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
    """Extract lexical scores for batch mode."""
    batch = _get_lexical().batch_score(candidates, references)
    results: dict[str, float] = {}
    jaccard_stats = batch.stats.get("jaccard")
    overlap_stats = batch.stats.get("token_overlap")
    if jaccard_stats:
        results["jaccard"] = jaccard_stats.mean
    if overlap_stats:
        results["token_overlap"] = overlap_stats.mean
    return results
 def _compute_metrics(
    candidate: str,
    reference: str,
    metric_names: list[str],
 ) -> dict[str, float]:
    """Compute requested metrics for a single text pair."""
    results: dict[str, float] = {}
    for metric in metric_names:
        if metric in ("bleu", "bleu4"):
            results.update(_bleu_single(candidate, reference, "bleu4"))
        elif metric in ("bleu1", "bleu2", "bleu3"):
            results.update(_bleu_single(candidate, reference, metric))
        elif metric in ("rouge", "rouge_l"):
            results.update(_rouge_single(candidate, reference))
        elif metric == "lexical":
            results.update(_lexical_single(candidate, reference))
    return results
 def _compute_batch_metrics(
    candidates: list[str],
    references: list[str],
    metric_names: list[str],
 ) -> dict[str, float]:
    """Compute average metrics for a batch of text pairs."""
    results: dict[str, float] = {}
    for metric in metric_names:
        if metric in ("bleu", "bleu4"):
            results.update(_bleu_batch(candidates, references, "bleu4"))
        elif metric in ("bleu1", "bleu2", "bleu3"):
            results.update(_bleu_batch(candidates, references, metric))
        elif metric in ("rouge", "rouge_l"):
            results.update(_rouge_batch(candidates, references))
        elif metric == "lexical":
            results.update(_lexical_batch(candidates, references))
    return results
 def _parse_metrics(metrics_str: str) -> list[str]:
    """Parse comma-separated metric names."""
    metrics = [m.strip().lower() for m in metrics_str.split(",")]
    # Validate metric names
    invalid = [m for m in metrics if m not in AVAILABLE_METRICS]
    if invalid:
        raise typer.BadParameter(
            f"Unknown metrics: {', '.join(invalid)}. "
            f"Available: {', '.join(sorted(AVAILABLE_METRICS))}"
        )
    return metrics
 def validate(
    text: Annotated[
        str | None,
        typer.Argument(help="Candidate text to validate (inline mode)."),
    ] = None,
    reference: Annotated[
        str | None,
        typer.Option("--reference", "-r", help="Reference text for comparison."),
    ] = None,
    file: Annotated[
        Path | None,
        typer.Option("--file", "-f", help="JSONL file with candidate/reference pairs."),
    ] = None,
    reference_file: Annotated[
        Path | None,
        typer.Option(
            "--reference-file",
            "-R",
            help="Separate JSONL file with references (requires --file).",
        ),
    ] = None,
    metrics: Annotated[
        str,
        typer.Option(
            "--metrics",
            "-m",
            help="Comma-separated metrics: bleu, bleu1-4, rouge, rouge_l, lexical.",
        ),
    ] = "bleu,rouge",
    output: Annotated[
        str,
        typer.Option("--output", "-o", help="Output format: table, json, or simple."),
    ] = "table",
    threshold: Annotated[
        float | None,
        typer.Option("--threshold", "-t", help="Score threshold for pass/fail status."),
    ] = None,
 ) -> None:
    """
    Validate text quality using various metrics.
    Use inline mode for single texts:
        veritext validate "text" -r "reference" -m bleu,rouge
    Use file mode for batches:
        veritext validate -f outputs.jsonl -m bleu,rouge
    """
    # Parse and validate metric names
    try:
        metric_names = _parse_metrics(metrics)
    except typer.BadParameter as e:
        console.print(f"[red]Error:[/red] {e}")
        raise typer.Exit(code=1) from e
    # Validate output format
    if output not in ("table", "json", "simple"):
        console.print(f"[red]Error:[/red] Invalid output format: {output}")
        raise typer.Exit(code=1)
    # Determine mode: inline vs file
    if file is not None:
        # File mode
        try:
            if reference_file is not None:
                pairs = read_paired_jsonl(file, reference_file)
            else:
                pairs = read_jsonl(file)
        except (FileNotFoundError, ValueError) as e:
            console.print(f"[red]Error:[/red] {e}")
            raise typer.Exit(code=1) from e
        if not pairs:
            console.print("[yellow]Warning:[/yellow] No text pairs found in file.")
            raise typer.Exit(code=0)
        candidates = [p.candidate for p in pairs]
        references = [p.reference for p in pairs]
        results = _compute_batch_metrics(candidates, references, metric_names)
        console.print(f"[dim]Evaluated {len(pairs)} text pairs.[/dim]\n")
    elif text is not None and reference is not None:
        # Inline mode
        results = _compute_metrics(text, reference, metric_names)
    else:
        # Invalid usage
        console.print(
            "[red]Error:[/red] Provide either text and --reference, "
            "or --file for batch mode."
        )
        raise typer.Exit(code=1)
    print_validation_output(results, output, threshold)
@@ -1,5 +1,6 @@
 """Configuration management using pydantic-settings."""
 from functools import lru_cache
 from pathlib import Path
 from typing import Literal
@@ -54,6 +55,7 @@ class VeritextSettings(BaseSettings):
    )
@lru_cache
 def get_settings() -> VeritextSettings:
-    """Get the current settings instance."""
+    """Get the cached settings instance."""
    return VeritextSettings()
@@ -137,8 +137,8 @@ class Readability:
                flesch_reading_ease=0.0,
            )
-        # Count sentences
+        # Count sentences (ensure at least 1 to avoid division by zero)
-        sentence_count = _count_sentences(candidate)
+        sentence_count = max(_count_sentences(candidate), 1)
        # Count syllables
        syllable_count = sum(_count_syllables(word) for word in words)
@@ -40,6 +40,11 @@ class LexicalResult(BaseModel):
    token_overlap: float
    """Proportion of candidate tokens found in reference."""
    @property
    def score(self) -> float:
        """Return Jaccard similarity as the primary score."""
        return self.jaccard
 class RougeScore(BaseModel):
    """Individual ROUGE variant score with precision, recall, F-measure."""
@@ -107,9 +107,6 @@ def _compute_rouge_l(
    Returns:
        RougeScore with precision, recall, and F-measure.
    """
    if not candidate_tokens and not reference_tokens:
        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
    if not candidate_tokens or not reference_tokens:
        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
@@ -209,6 +206,10 @@ class Rouge:
            rouge2_scores.append(_compute_rouge_score(candidate_tokens, ref_tokens, 2))
            rouge_l_scores.append(_compute_rouge_l(candidate_tokens, ref_tokens))
        # All references were empty after tokenisation
        if not rouge1_scores:
            raise ValueError("Reference text cannot be empty")
        return RougeResult(
            rouge1=_max_rouge_scores(rouge1_scores),
            rouge2=_max_rouge_scores(rouge2_scores),
@@ -1,11 +1,15 @@
 """Embedding-based semantic similarity using sentence-transformers."""
 from collections import OrderedDict
 from typing import Any
 from veritext.core.exceptions import DependencyError
 from veritext.metrics.base import AggregateStats, BatchResult
 from veritext.metrics.results import SemanticResult
 # Default maximum cache size (number of embeddings to store)
 DEFAULT_CACHE_MAX_SIZE = 1000
 class SemanticSimilarity:
    """
@@ -21,6 +25,7 @@ class SemanticSimilarity:
        self,
        model: str = "all-MiniLM-L6-v2",
        cache_embeddings: bool = True,
        cache_max_size: int = DEFAULT_CACHE_MAX_SIZE,
    ) -> None:
        """
        Initialise the semantic similarity metric.
@@ -30,6 +35,8 @@ class SemanticSimilarity:
                   Defaults to "all-MiniLM-L6-v2" (22MB, good quality/size tradeoff).
            cache_embeddings: Whether to cache embeddings for repeated texts.
                              Defaults to True.
            cache_max_size: Maximum number of embeddings to cache. Oldest entries
                            are evicted when the limit is reached. Defaults to 1000.
        Raises:
            DependencyError: If sentence-transformers is not installed.
@@ -44,7 +51,10 @@ class SemanticSimilarity:
        self._model_name = model
        self._model: Any = SentenceTransformer(model)
-        self._cache: dict[str, Any] | None = {} if cache_embeddings else None
+        self._cache: OrderedDict[str, Any] | None = (
            OrderedDict() if cache_embeddings else None
        )
        self._cache_max_size = cache_max_size
    @property
    def name(self) -> str:
@@ -58,7 +68,7 @@ class SemanticSimilarity:
    def _get_embedding(self, text: str) -> Any:
        """
-        Get embedding for text, using cache if available.
+        Get embedding for text, using LRU cache if available.
        Args:
            text: The text to embed.
@@ -67,11 +77,16 @@ class SemanticSimilarity:
            The embedding tensor.
        """
        if self._cache is not None and text in self._cache:
            # Move to end to mark as recently used
            self._cache.move_to_end(text)
            return self._cache[text]
        embedding = self._model.encode(text, convert_to_tensor=True)
        if self._cache is not None:
            # Evict oldest entries if cache is full
            while len(self._cache) >= self._cache_max_size:
                self._cache.popitem(last=False)
            self._cache[text] = embedding
        return embedding
@@ -1,11 +1,20 @@
-"""Composite validators for combining multiple checks."""
+"""Composite validators for combining multiple checks.
 Note: CompositeCheck classes (AllOf, AnyOf) intentionally return ValidationResult
 rather than CheckResult. This allows callers to inspect individual check results
 for detailed error reporting. They implement a compatible interface but are not
 substitutable where Check is expected as a type constraint.
 """
 from veritext.core.types import CheckResult, ValidationContext, ValidationResult
 from veritext.validators.base import Check
 class AllOf:
-    """Passes only if all checks pass."""
+    """Passes only if all checks pass.
    Note: Returns ValidationResult (not CheckResult) to expose child results.
    """
    def __init__(self, checks: list[Check]) -> None:
        """
@@ -20,7 +29,7 @@ class AllOf:
        if not checks:
            raise ValueError("checks list cannot be empty")
-        self._checks = checks
+        self._checks = list(checks)
    @property
    def name(self) -> str:
@@ -48,7 +57,10 @@ class AllOf:
 class AnyOf:
-    """Passes if any check passes."""
+    """Passes if any check passes.
    Note: Returns ValidationResult (not CheckResult) to expose child results.
    """
    def __init__(self, checks: list[Check]) -> None:
        """
@@ -63,7 +75,7 @@ class AnyOf:
        if not checks:
            raise ValueError("checks list cannot be empty")
-        self._checks = checks
+        self._checks = list(checks)
    @property
    def name(self) -> str:
@@ -229,7 +229,7 @@ class ContainsValidator:
            case_sensitive: Whether matching is case-sensitive. Defaults to False.
        Raises:
-            InvalidThresholdError: If patterns list is empty.
+            InvalidThresholdError: If patterns list is empty or contains invalid regex.
        """
        if not patterns:
            raise InvalidThresholdError("patterns list cannot be empty")
@@ -238,6 +238,15 @@ class ContainsValidator:
        self._case_sensitive = case_sensitive
        self._flags = 0 if case_sensitive else re.IGNORECASE
        self._compiled_patterns: list[re.Pattern[str]] = []
        for pattern in patterns:
            try:
                self._compiled_patterns.append(re.compile(pattern, self._flags))
            except re.error as e:
                raise InvalidThresholdError(
                    f"Invalid regex pattern '{pattern}': {e}"
                ) from e
    @property
    def name(self) -> str:
        """Return the name of this check."""
@@ -255,8 +264,10 @@ class ContainsValidator:
            CheckResult with pass/fail status.
        """
        missing = []
-        for pattern in self._patterns:
+        for pattern, compiled in zip(
-            if not re.search(pattern, text, self._flags):
+            self._patterns, self._compiled_patterns, strict=True
        ):
            if not compiled.search(text):
                missing.append(pattern)
        passed = len(missing) == 0
@@ -291,7 +302,7 @@ class ExcludesValidator:
            case_sensitive: Whether matching is case-sensitive. Defaults to False.
        Raises:
-            InvalidThresholdError: If patterns list is empty.
+            InvalidThresholdError: If patterns list is empty or contains invalid regex.
        """
        if not patterns:
            raise InvalidThresholdError("patterns list cannot be empty")
@@ -300,6 +311,15 @@ class ExcludesValidator:
        self._case_sensitive = case_sensitive
        self._flags = 0 if case_sensitive else re.IGNORECASE
        self._compiled_patterns: list[re.Pattern[str]] = []
        for pattern in patterns:
            try:
                self._compiled_patterns.append(re.compile(pattern, self._flags))
            except re.error as e:
                raise InvalidThresholdError(
                    f"Invalid regex pattern '{pattern}': {e}"
                ) from e
    @property
    def name(self) -> str:
        """Return the name of this check."""
@@ -317,8 +337,10 @@ class ExcludesValidator:
            CheckResult with pass/fail status.
        """
        found = []
-        for pattern in self._patterns:
+        for pattern, compiled in zip(
-            if re.search(pattern, text, self._flags):
+            self._patterns, self._compiled_patterns, strict=True
        ):
            if compiled.search(text):
                found.append(pattern)
        passed = len(found) == 0
@@ -0,0 +1 @@
 """CLI test suite."""
@@ -0,0 +1,337 @@
 """Tests for CLI benchmark commands."""
 from pathlib import Path
 from typer.testing import CliRunner
 from veritext.cli.main import app
 runner = CliRunner()
 class TestBenchmarkRun:
    """Tests for benchmark run command."""
    def test_benchmark_run_basic(self, tmp_path: Path) -> None:
        """Test basic benchmark run."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text(
            '{"candidate": "hello world today", "reference": "hello world today"}\n'
            '{"candidate": "foo bar baz qux", "reference": "foo bar baz qux"}'
        )
        storage_path = tmp_path / "benchmarks"
        result = runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(data_file),
                "-m",
                "rouge_l,bleu4",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "Benchmark 'test_bench' completed" in result.stdout
        assert "Samples: 2" in result.stdout
        assert "rouge_l:" in result.stdout
        assert "bleu4:" in result.stdout
    def test_benchmark_run_file_not_found(self, tmp_path: Path) -> None:
        """Test benchmark run with non-existent file."""
        result = runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                "/nonexistent/file.jsonl",
                "-s",
                str(tmp_path / "benchmarks"),
            ],
        )
        assert result.exit_code == 1
        assert "Error" in result.stdout
    def test_benchmark_run_creates_storage(self, tmp_path: Path) -> None:
        """Test that benchmark run creates storage directory."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text('{"candidate": "hello", "reference": "hello"}')
        storage_path = tmp_path / "new_benchmarks"
        result = runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(data_file),
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert storage_path.exists()
 class TestBenchmarkShow:
    """Tests for benchmark show command."""
    def test_benchmark_show_no_runs(self, tmp_path: Path) -> None:
        """Test showing benchmark with no runs."""
        storage_path = tmp_path / "benchmarks"
        storage_path.mkdir()
        result = runner.invoke(
            app,
            [
                "benchmark",
                "show",
                "nonexistent_bench",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "No benchmark runs found" in result.stdout
    def test_benchmark_show_with_runs(self, tmp_path: Path) -> None:
        """Test showing benchmark history with runs."""
        # First create some runs
        data_file = tmp_path / "data.jsonl"
        data_file.write_text('{"candidate": "hello world", "reference": "hello world"}')
        storage_path = tmp_path / "benchmarks"
        # Run benchmark twice
        for _ in range(2):
            runner.invoke(
                app,
                [
                    "benchmark",
                    "run",
                    "test_bench",
                    "-f",
                    str(data_file),
                    "-s",
                    str(storage_path),
                ],
            )
        # Show history
        result = runner.invoke(
            app,
            [
                "benchmark",
                "show",
                "test_bench",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "Benchmark History" in result.stdout
    def test_benchmark_show_limit(self, tmp_path: Path) -> None:
        """Test showing limited benchmark history."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text('{"candidate": "hello", "reference": "hello"}')
        storage_path = tmp_path / "benchmarks"
        # Run benchmark 3 times
        for _ in range(3):
            runner.invoke(
                app,
                [
                    "benchmark",
                    "run",
                    "test_bench",
                    "-f",
                    str(data_file),
                    "-s",
                    str(storage_path),
                ],
            )
        # Show only last 2
        result = runner.invoke(
            app,
            [
                "benchmark",
                "show",
                "test_bench",
                "--last",
                "2",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
 class TestBenchmarkCheck:
    """Tests for benchmark check command."""
    def test_benchmark_check_no_regression(self, tmp_path: Path) -> None:
        """Test checking for regression with no regression."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text(
            '{"candidate": "hello world today", "reference": "hello world today"}'
        )
        storage_path = tmp_path / "benchmarks"
        # Run benchmark twice with same data (no regression)
        for _ in range(2):
            runner.invoke(
                app,
                [
                    "benchmark",
                    "run",
                    "test_bench",
                    "-f",
                    str(data_file),
                    "-s",
                    str(storage_path),
                ],
            )
        # Check for regression
        result = runner.invoke(
            app,
            [
                "benchmark",
                "check",
                "test_bench",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "No regression detected" in result.stdout
    def test_benchmark_check_with_regression(self, tmp_path: Path) -> None:
        """Test checking for regression when regression occurs."""
        storage_path = tmp_path / "benchmarks"
        # First run with good data
        good_file = tmp_path / "good.jsonl"
        good_file.write_text(
            '{"candidate": "hello world today", "reference": "hello world today"}'
        )
        runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(good_file),
                "-s",
                str(storage_path),
            ],
        )
        # Second run with bad data (regression)
        bad_file = tmp_path / "bad.jsonl"
        bad_file.write_text(
            '{"candidate": "completely different", "reference": "hello world today"}'
        )
        runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(bad_file),
                "-s",
                str(storage_path),
            ],
        )
        # Check for regression
        result = runner.invoke(
            app,
            [
                "benchmark",
                "check",
                "test_bench",
                "-t",
                "0.05",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 1
        assert "Regression detected" in result.stdout
    def test_benchmark_check_custom_tolerance(self, tmp_path: Path) -> None:
        """Test checking regression with custom tolerance."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text('{"candidate": "hello", "reference": "hello"}')
        storage_path = tmp_path / "benchmarks"
        runner.invoke(
            app,
            [
                "benchmark",
                "run",
                "test_bench",
                "-f",
                str(data_file),
                "-s",
                str(storage_path),
            ],
        )
        result = runner.invoke(
            app,
            [
                "benchmark",
                "check",
                "test_bench",
                "--tolerance",
                "0.10",
                "-s",
                str(storage_path),
            ],
        )
        assert result.exit_code == 0
        assert "10.00%" in result.stdout
 class TestBenchmarkHelp:
    """Tests for benchmark help output."""
    def test_benchmark_help(self) -> None:
        """Test benchmark help output."""
        result = runner.invoke(app, ["benchmark", "--help"])
        assert result.exit_code == 0
        assert "run" in result.stdout
        assert "show" in result.stdout
        assert "check" in result.stdout
    def test_benchmark_run_help(self) -> None:
        """Test benchmark run help output."""
        result = runner.invoke(app, ["benchmark", "run", "--help"])
        assert result.exit_code == 0
        assert "--file" in result.stdout
        assert "--metrics" in result.stdout
    def test_benchmark_show_help(self) -> None:
        """Test benchmark show help output."""
        result = runner.invoke(app, ["benchmark", "show", "--help"])
        assert result.exit_code == 0
        assert "--last" in result.stdout
    def test_benchmark_check_help(self) -> None:
        """Test benchmark check help output."""
        result = runner.invoke(app, ["benchmark", "check", "--help"])
        assert result.exit_code == 0
        assert "--tolerance" in result.stdout
        assert "--window" in result.stdout
@@ -0,0 +1,141 @@
 """Tests for CLI output formatters."""
 from datetime import UTC, datetime
 from veritext.benchmark.models import BenchmarkRun, RegressionReport
 from veritext.cli.formatters import (
    format_benchmark_history,
    format_regression_report,
    format_validation_json,
    format_validation_simple,
    format_validation_table,
 )
 class TestFormatValidationTable:
    """Tests for format_validation_table function."""
    def test_format_empty_results(self) -> None:
        """Test formatting empty results."""
        table = format_validation_table({})
        assert table.title == "Validation Results"
        assert table.row_count == 0
    def test_format_single_metric(self) -> None:
        """Test formatting a single metric."""
        results = {"bleu4": 0.8523}
        table = format_validation_table(results)
        assert table.row_count == 1
    def test_format_multiple_metrics(self) -> None:
        """Test formatting multiple metrics."""
        results = {"bleu4": 0.85, "rouge_l": 0.92, "jaccard": 0.75}
        table = format_validation_table(results)
        assert table.row_count == 3
    def test_format_with_threshold(self) -> None:
        """Test formatting with threshold for pass/fail."""
        results = {"bleu4": 0.85, "rouge_l": 0.45}
        table = format_validation_table(results, threshold=0.5)
        # Should have 3 columns: Metric, Score, Status
        assert table.row_count == 2
 class TestFormatValidationJson:
    """Tests for format_validation_json function."""
    def test_format_empty_results(self) -> None:
        """Test formatting empty results as JSON."""
        result = format_validation_json({})
        assert result == "{}"
    def test_format_results(self) -> None:
        """Test formatting results as JSON."""
        results = {"bleu4": 0.85, "rouge_l": 0.92}
        result = format_validation_json(results)
        assert '"bleu4": 0.85' in result
        assert '"rouge_l": 0.92' in result
 class TestFormatValidationSimple:
    """Tests for format_validation_simple function."""
    def test_format_empty_results(self) -> None:
        """Test formatting empty results as simple text."""
        result = format_validation_simple({})
        assert result == ""
    def test_format_results(self) -> None:
        """Test formatting results as simple text."""
        results = {"bleu4": 0.8523, "rouge_l": 0.9234}
        result = format_validation_simple(results)
        assert "bleu4: 0.8523" in result
        assert "rouge_l: 0.9234" in result
 class TestFormatBenchmarkHistory:
    """Tests for format_benchmark_history function."""
    def test_format_empty_history(self) -> None:
        """Test formatting empty benchmark history."""
        table = format_benchmark_history([])
        assert table.title == "Benchmark History"
    def test_format_single_run(self) -> None:
        """Test formatting a single benchmark run."""
        run = BenchmarkRun(
            id="test-id",
            benchmark_name="test",
            timestamp=datetime(2024, 1, 15, 10, 30, tzinfo=UTC),
            veritext_version="0.1.0",
            metrics={"rouge_l": 0.85, "bleu4": 0.72},
            sample_count=100,
        )
        table = format_benchmark_history([run])
        assert table.row_count == 1
    def test_format_multiple_runs(self) -> None:
        """Test formatting multiple benchmark runs."""
        runs = [
            BenchmarkRun(
                id=f"test-id-{i}",
                benchmark_name="test",
                timestamp=datetime(2024, 1, i + 1, 10, 30, tzinfo=UTC),
                veritext_version="0.1.0",
                metrics={"rouge_l": 0.8 + i * 0.01},
                sample_count=100,
            )
            for i in range(3)
        ]
        table = format_benchmark_history(runs)
        assert table.row_count == 3
 class TestFormatRegressionReport:
    """Tests for format_regression_report function."""
    def test_format_no_regression(self) -> None:
        """Test formatting report with no regression."""
        report = RegressionReport(
            detected=False,
            baseline={"rouge_l": 0.85},
            current={"rouge_l": 0.86},
            deltas={"rouge_l": 0.01},
            tolerance=0.05,
        )
        panel = format_regression_report(report)
        assert panel.title == "Regression Check"
        assert panel.border_style == "green"
    def test_format_with_regression(self) -> None:
        """Test formatting report with regression detected."""
        report = RegressionReport(
            detected=True,
            baseline={"rouge_l": 0.85, "bleu4": 0.72},
            current={"rouge_l": 0.70, "bleu4": 0.70},
            deltas={"rouge_l": -0.15, "bleu4": -0.02},
            tolerance=0.05,
        )
        panel = format_regression_report(report)
        assert panel.title == "Regression Check"
        assert panel.border_style == "red"
@@ -0,0 +1,145 @@
 """Tests for CLI input readers."""
 import json
 from pathlib import Path
 import pytest
 from veritext.cli.readers import TextPair, read_jsonl, read_paired_jsonl
 class TestTextPair:
    """Tests for TextPair dataclass."""
    def test_create_text_pair(self) -> None:
        """Test creating a TextPair."""
        pair = TextPair(candidate="hello", reference="world")
        assert pair.candidate == "hello"
        assert pair.reference == "world"
 class TestReadJsonl:
    """Tests for read_jsonl function."""
    def test_read_valid_jsonl(self, tmp_path: Path) -> None:
        """Test reading a valid JSONL file."""
        data = [
            {"candidate": "foo", "reference": "bar"},
            {"candidate": "baz", "reference": "qux"},
        ]
        jsonl_file = tmp_path / "data.jsonl"
        jsonl_file.write_text("\n".join(json.dumps(d) for d in data))
        pairs = read_jsonl(jsonl_file)
        assert len(pairs) == 2
        assert pairs[0].candidate == "foo"
        assert pairs[0].reference == "bar"
        assert pairs[1].candidate == "baz"
        assert pairs[1].reference == "qux"
    def test_read_empty_file(self, tmp_path: Path) -> None:
        """Test reading an empty JSONL file."""
        jsonl_file = tmp_path / "empty.jsonl"
        jsonl_file.write_text("")
        pairs = read_jsonl(jsonl_file)
        assert pairs == []
    def test_read_file_with_blank_lines(self, tmp_path: Path) -> None:
        """Test reading a JSONL file with blank lines."""
        jsonl_file = tmp_path / "data.jsonl"
        content = '{"candidate": "a", "reference": "b"}\n\n{"candidate": "c", "reference": "d"}\n'
        jsonl_file.write_text(content)
        pairs = read_jsonl(jsonl_file)
        assert len(pairs) == 2
    def test_read_file_not_found(self, tmp_path: Path) -> None:
        """Test reading a non-existent file."""
        with pytest.raises(FileNotFoundError):
            read_jsonl(tmp_path / "nonexistent.jsonl")
    def test_read_invalid_json(self, tmp_path: Path) -> None:
        """Test reading a file with invalid JSON."""
        jsonl_file = tmp_path / "invalid.jsonl"
        jsonl_file.write_text("not valid json")
        with pytest.raises(ValueError, match="Invalid JSON on line 1"):
            read_jsonl(jsonl_file)
    def test_read_missing_candidate_key(self, tmp_path: Path) -> None:
        """Test reading a file missing the candidate key."""
        jsonl_file = tmp_path / "data.jsonl"
        jsonl_file.write_text('{"reference": "bar"}')
        with pytest.raises(ValueError, match="Missing 'candidate' key on line 1"):
            read_jsonl(jsonl_file)
    def test_read_missing_reference_key(self, tmp_path: Path) -> None:
        """Test reading a file missing the reference key."""
        jsonl_file = tmp_path / "data.jsonl"
        jsonl_file.write_text('{"candidate": "foo"}')
        with pytest.raises(ValueError, match="Missing 'reference' key on line 1"):
            read_jsonl(jsonl_file)
 class TestReadPairedJsonl:
    """Tests for read_paired_jsonl function."""
    def test_read_paired_valid(self, tmp_path: Path) -> None:
        """Test reading valid paired JSONL files."""
        candidates_file = tmp_path / "candidates.jsonl"
        references_file = tmp_path / "references.jsonl"
        candidates_file.write_text('{"text": "foo"}\n{"text": "bar"}')
        references_file.write_text('{"text": "baz"}\n{"text": "qux"}')
        pairs = read_paired_jsonl(candidates_file, references_file)
        assert len(pairs) == 2
        assert pairs[0].candidate == "foo"
        assert pairs[0].reference == "baz"
        assert pairs[1].candidate == "bar"
        assert pairs[1].reference == "qux"
    def test_read_paired_length_mismatch(self, tmp_path: Path) -> None:
        """Test reading paired files with different lengths."""
        candidates_file = tmp_path / "candidates.jsonl"
        references_file = tmp_path / "references.jsonl"
        candidates_file.write_text('{"text": "foo"}\n{"text": "bar"}')
        references_file.write_text('{"text": "baz"}')
        with pytest.raises(ValueError, match="does not match"):
            read_paired_jsonl(candidates_file, references_file)
    def test_read_paired_candidates_not_found(self, tmp_path: Path) -> None:
        """Test reading when candidates file doesn't exist."""
        references_file = tmp_path / "references.jsonl"
        references_file.write_text('{"text": "baz"}')
        with pytest.raises(FileNotFoundError, match="Candidates file not found"):
            read_paired_jsonl(tmp_path / "nonexistent.jsonl", references_file)
    def test_read_paired_references_not_found(self, tmp_path: Path) -> None:
        """Test reading when references file doesn't exist."""
        candidates_file = tmp_path / "candidates.jsonl"
        candidates_file.write_text('{"text": "foo"}')
        with pytest.raises(FileNotFoundError, match="References file not found"):
            read_paired_jsonl(candidates_file, tmp_path / "nonexistent.jsonl")
    def test_read_paired_missing_text_key(self, tmp_path: Path) -> None:
        """Test reading paired files with missing text key."""
        candidates_file = tmp_path / "candidates.jsonl"
        references_file = tmp_path / "references.jsonl"
        candidates_file.write_text('{"value": "foo"}')
        references_file.write_text('{"text": "baz"}')
        with pytest.raises(ValueError, match="Missing 'text' key in candidates file"):
            read_paired_jsonl(candidates_file, references_file)
@@ -0,0 +1,233 @@
 """Tests for CLI validate command."""
 import json
 from pathlib import Path
 from typer.testing import CliRunner
 from veritext.cli.main import app
 runner = CliRunner()
 class TestValidateInline:
    """Tests for inline validation mode."""
    def test_validate_inline_basic(self) -> None:
        """Test basic inline validation."""
        result = runner.invoke(
            app,
            [
                "validate",
                "The quick brown fox jumps",
                "-r",
                "The quick brown fox jumps",
                "-m",
                "bleu",
            ],
        )
        assert result.exit_code == 0
        assert "bleu4" in result.stdout
    def test_validate_inline_with_rouge(self) -> None:
        """Test inline validation with ROUGE metric."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world today",
                "-r",
                "hello world here",
                "-m",
                "rouge",
            ],
        )
        assert result.exit_code == 0
        assert "rouge_l" in result.stdout
    def test_validate_inline_with_lexical(self) -> None:
        """Test inline validation with lexical metric."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world",
                "-r",
                "hello everyone",
                "-m",
                "lexical",
            ],
        )
        assert result.exit_code == 0
        assert "jaccard" in result.stdout
        assert "token_overlap" in result.stdout
    def test_validate_inline_json_output(self) -> None:
        """Test inline validation with JSON output."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world today",
                "-r",
                "hello world today",
                "-m",
                "bleu",
                "-o",
                "json",
            ],
        )
        assert result.exit_code == 0
        data = json.loads(result.stdout)
        assert "bleu4" in data
    def test_validate_inline_simple_output(self) -> None:
        """Test inline validation with simple output."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world today",
                "-r",
                "hello world today",
                "-m",
                "rouge",
                "-o",
                "simple",
            ],
        )
        assert result.exit_code == 0
        assert "rouge_l:" in result.stdout
    def test_validate_inline_missing_reference(self) -> None:
        """Test inline validation without reference."""
        result = runner.invoke(
            app,
            ["validate", "hello world", "-m", "bleu"],
        )
        assert result.exit_code == 1
        assert "Error" in result.stdout
    def test_validate_inline_invalid_metric(self) -> None:
        """Test inline validation with invalid metric."""
        result = runner.invoke(
            app,
            ["validate", "hello", "-r", "world", "-m", "invalid_metric"],
        )
        assert result.exit_code == 1
        assert "Unknown metrics" in result.stdout
 class TestValidateFile:
    """Tests for file-based validation mode."""
    def test_validate_file_basic(self, tmp_path: Path) -> None:
        """Test basic file-based validation."""
        data_file = tmp_path / "data.jsonl"
        data_file.write_text(
            '{"candidate": "hello world today", "reference": "hello world today"}\n'
            '{"candidate": "foo bar baz", "reference": "foo bar baz"}'
        )
        result = runner.invoke(
            app,
            ["validate", "-f", str(data_file), "-m", "bleu"],
        )
        assert result.exit_code == 0
        assert "bleu4" in result.stdout
        assert "Evaluated 2 text pairs" in result.stdout
    def test_validate_file_not_found(self) -> None:
        """Test file-based validation with non-existent file."""
        result = runner.invoke(
            app,
            ["validate", "-f", "/nonexistent/file.jsonl", "-m", "bleu"],
        )
        assert result.exit_code == 1
        assert "Error" in result.stdout
    def test_validate_paired_files(self, tmp_path: Path) -> None:
        """Test validation with separate candidate and reference files."""
        candidates_file = tmp_path / "candidates.jsonl"
        references_file = tmp_path / "references.jsonl"
        candidates_file.write_text(
            '{"text": "hello world today"}\n{"text": "foo bar baz"}'
        )
        references_file.write_text(
            '{"text": "hello world today"}\n{"text": "foo bar baz"}'
        )
        result = runner.invoke(
            app,
            [
                "validate",
                "-f",
                str(candidates_file),
                "-R",
                str(references_file),
                "-m",
                "bleu",
            ],
        )
        assert result.exit_code == 0
        assert "Evaluated 2 text pairs" in result.stdout
 class TestValidateOptions:
    """Tests for validate command options."""
    def test_validate_with_threshold(self) -> None:
        """Test validation with threshold option."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello world today",
                "-r",
                "hello world today",
                "-m",
                "bleu",
                "-t",
                "0.5",
            ],
        )
        assert result.exit_code == 0
        # Table output should include Status column
        assert "Status" in result.stdout or "PASS" in result.stdout
    def test_validate_invalid_output_format(self) -> None:
        """Test validation with invalid output format."""
        result = runner.invoke(
            app,
            [
                "validate",
                "hello",
                "-r",
                "world",
                "-m",
                "bleu",
                "-o",
                "invalid",
            ],
        )
        assert result.exit_code == 1
        assert "Invalid output format" in result.stdout
    def test_validate_multiple_metrics(self) -> None:
        """Test validation with multiple metrics."""
        result = runner.invoke(
            app,
            [
                "validate",
                "The quick brown fox",
                "-r",
                "The quick brown fox",
                "-m",
                "bleu,rouge,lexical",
            ],
        )
        assert result.exit_code == 0
        assert "bleu4" in result.stdout
        assert "rouge_l" in result.stdout
        assert "jaccard" in result.stdout
@@ -0,0 +1,73 @@
 """Tests for configuration module."""
 from pathlib import Path
 import pytest
 from veritext.core.config import VeritextSettings, get_settings
 class TestVeritextSettings:
    """Tests for VeritextSettings."""
    def test_default_log_level(self) -> None:
        """Test default log level is INFO."""
        settings = VeritextSettings()
        assert settings.log_level == "INFO"
    def test_default_log_format(self) -> None:
        """Test default log format is console."""
        settings = VeritextSettings()
        assert settings.log_format == "console"
    def test_default_benchmark_path(self) -> None:
        """Test default benchmark storage path."""
        settings = VeritextSettings()
        assert settings.benchmark_storage_path == Path("benchmarks")
    def test_default_tokeniser_lowercase(self) -> None:
        """Test default tokeniser lowercase setting."""
        settings = VeritextSettings()
        assert settings.tokeniser_lowercase is True
    def test_default_tokeniser_remove_punctuation(self) -> None:
        """Test default tokeniser remove punctuation setting."""
        settings = VeritextSettings()
        assert settings.tokeniser_remove_punctuation is True
    def test_default_semantic_model(self) -> None:
        """Test default semantic model name."""
        settings = VeritextSettings()
        assert settings.semantic_model == "all-MiniLM-L6-v2"
    def test_default_semantic_cache_enabled(self) -> None:
        """Test semantic cache is enabled by default."""
        settings = VeritextSettings()
        assert settings.semantic_cache_embeddings is True
    def test_env_var_override(self, monkeypatch: pytest.MonkeyPatch) -> None:
        """Test environment variable overrides default settings."""
        monkeypatch.setenv("VERITEXT_LOG_LEVEL", "DEBUG")
        settings = VeritextSettings()
        assert settings.log_level == "DEBUG"
    def test_env_var_override_log_format(self, monkeypatch: pytest.MonkeyPatch) -> None:
        """Test environment variable overrides log format."""
        monkeypatch.setenv("VERITEXT_LOG_FORMAT", "json")
        settings = VeritextSettings()
        assert settings.log_format == "json"
 class TestGetSettings:
    """Tests for get_settings function."""
    def test_get_settings_returns_instance(self) -> None:
        """Test get_settings returns a VeritextSettings instance."""
        settings = get_settings()
        assert isinstance(settings, VeritextSettings)
    def test_get_settings_returns_valid_defaults(self) -> None:
        """Test get_settings returns instance with valid defaults."""
        settings = get_settings()
        assert settings.log_level in ("DEBUG", "INFO", "WARNING", "ERROR")
        assert settings.log_format in ("console", "json")
@@ -0,0 +1,56 @@
 """Tests for logging module."""
 from veritext.core.logging import configure_logging, get_logger
 class TestGetLogger:
    """Tests for get_logger function."""
    def test_get_logger_returns_logger(self) -> None:
        """Test get_logger returns a logger instance."""
        logger = get_logger()
        assert logger is not None
    def test_get_logger_default_name(self) -> None:
        """Test get_logger uses 'veritext' as default name."""
        logger = get_logger()
        # The logger should be a bound logger from structlog
        assert hasattr(logger, "info")
        assert hasattr(logger, "debug")
        assert hasattr(logger, "warning")
        assert hasattr(logger, "error")
    def test_get_logger_custom_name(self) -> None:
        """Test get_logger respects custom name parameter."""
        logger = get_logger("custom.module")
        assert logger is not None
        assert hasattr(logger, "info")
 class TestConfigureLogging:
    """Tests for configure_logging function."""
    def test_configure_logging_console_format(self) -> None:
        """Test configure_logging with console format does not raise."""
        configure_logging(level="INFO", log_format="console")
        logger = get_logger()
        assert logger is not None
    def test_configure_logging_json_format(self) -> None:
        """Test configure_logging with json format does not raise."""
        configure_logging(level="DEBUG", log_format="json")
        logger = get_logger()
        assert logger is not None
    def test_configure_logging_uses_defaults(self) -> None:
        """Test configure_logging uses settings defaults when not provided."""
        configure_logging()
        logger = get_logger()
        assert logger is not None
    def test_configure_logging_different_levels(self) -> None:
        """Test configure_logging accepts different log levels."""
        for level in ("DEBUG", "INFO", "WARNING", "ERROR"):
            configure_logging(level=level)
            logger = get_logger()
            assert logger is not None
@@ -5,12 +5,11 @@ import pytest
@pytest.fixture
 def plugin_pytester(pytester: pytest.Pytester) -> pytest.Pytester:
-    """Configure pytester to use the veritext plugin."""
+    """Configure pytester to use the veritext plugin.
-    pytester.makeconftest(
+
-        """
+    Note: The plugin is already loaded via the entry point in pyproject.toml,
-        pytest_plugins = ['veritext.pytest_plugin']
+    so no explicit pytest_plugins declaration is needed.
-        """
+    """
    )
    return pytester
@@ -263,6 +263,11 @@ class TestContainsValidator:
        with pytest.raises(InvalidThresholdError, match="cannot be empty"):
            ContainsValidator(patterns=[])
    def test_contains_validator_raises_on_invalid_regex(self) -> None:
        """Test that invalid regex pattern raises error at init time."""
        with pytest.raises(InvalidThresholdError, match="Invalid regex"):
            ContainsValidator(patterns=[r"[invalid"])
    def test_contains_factory_function(self) -> None:
        """Test the contains() factory function."""
        validator = contains(patterns=["test"], case_sensitive=True)
@@ -327,6 +332,11 @@ class TestExcludesValidator:
        with pytest.raises(InvalidThresholdError, match="cannot be empty"):
            ExcludesValidator(patterns=[])
    def test_excludes_validator_raises_on_invalid_regex(self) -> None:
        """Test that invalid regex pattern raises error at init time."""
        with pytest.raises(InvalidThresholdError, match="Invalid regex"):
            ExcludesValidator(patterns=[r"[invalid"])
    def test_excludes_factory_function(self) -> None:
        """Test the excludes() factory function."""
        validator = excludes(patterns=["test"], case_sensitive=True)
Author	SHA1	Message	Date
kschappell	0699e97e1d	refactor: CLI cleanup and documentation updates - Refactor CLI metric computation to eliminate code duplication - Update version format to PEP 440 compliance (0.1.0.dev0) - Cache Settings instance via @lru_cache for performance - Document composite validators' protocol deviation - Consolidate redundant empty checks in ROUGE-L computation - Add Phase 10 (Portfolio Demos) to implementation plan	2026-02-04 15:38:46 +00:00
kschappell	7de4505e31	fix(pytest-plugin): remove duplicate plugin registration in tests The pytest plugin is already loaded via the entry point, so explicitly declaring it in conftest causes a duplicate registration error.	2026-02-04 00:43:20 +00:00
kschappell	564d663c78	docs(changelog): update for QA fixes	2026-02-04 00:23:06 +00:00
kschappell	0b2bc6c688	test(core): add coverage for config and logging modules Adds tests for VeritextSettings defaults, env var overrides, and the get_logger/configure_logging functions.	2026-02-04 00:22:57 +00:00
kschappell	aa687f43cd	fix(validators): validate regex patterns at init time ContainsValidator and ExcludesValidator now pre-compile regex patterns during initialisation and raise InvalidThresholdError if invalid.	2026-02-04 00:22:47 +00:00
kschappell	f18427e123	fix: QA review fixes for 0.1.0 release - Fix README readability example property names - Add validation for empty references after tokenisation in ROUGE - Guard against zero sentence count in readability metric - Implement LRU cache with max size for semantic embeddings - Add .score property to LexicalResult for API consistency - Use defensive list copy in composite validators	2026-02-03 21:31:48 +00:00
kschappell	1754556c99	docs(changelog): release 0.1.0 Initial release with metrics, validators, pytest plugin, benchmark module, CLI, and comprehensive documentation.	2026-02-03 19:16:37 +00:00
kschappell	13c869f5d6	docs(readme): comprehensive documentation Expands readme with detailed coverage of metrics, validators, pytest plugin, benchmark module, CLI commands, and development setup.	2026-02-03 19:16:14 +00:00
kschappell	93515707cc	docs(examples): add benchmark regression example Demonstrates benchmark quality tracking with historical comparison and CI integration using assert_no_regression() for exit code control.	2026-02-03 19:15:12 +00:00
kschappell	3cde5aba77	docs(examples): add chatbot testing example Demonstrates pytest integration for chatbot QA with validate_text() assertions, fixtures, and parametrised content safety tests.	2026-02-03 19:14:25 +00:00
kschappell	69966d171c	docs(examples): add basic validation example Demonstrates core Veritext functionality: metrics, validators, composites, and constraint validators with runnable code.	2026-02-03 19:13:47 +00:00
kschappell	d5df8b52e6	docs: add branch creation instruction to git workflow Explicitly documents the requirement to create a new branch before starting work from a plan, consistent with the parent workspace CLAUDE.md instruction.	2026-02-03 19:06:45 +00:00
kschappell	8b7c087de7	docs(changelog): add CLI entries Document command-line interface including validate command, benchmark subcommands, and output formatting options.	2026-02-03 18:22:50 +00:00
kschappell	c54f8c3f6f	test(cli): add CLI tests Add comprehensive test suite for validate command, benchmark commands, input readers, and output formatters using Typer CliRunner.	2026-02-03 18:22:31 +00:00
kschappell	0cadfd4d23	feat(cli): add benchmark subcommands Add benchmark run, show, and check commands for quality tracking with regression detection supporting CI integration.	2026-02-03 18:20:28 +00:00
kschappell	e128720917	feat(cli): add validate command Implement validate command with inline and file-based modes supporting BLEU, ROUGE, and lexical metrics with multiple output formats.	2026-02-03 18:19:20 +00:00
kschappell	f713d5e8a6	feat(cli): add Rich output formatters Add formatters for validation results (table/json/simple) and benchmark history display with regression report panels.	2026-02-03 18:17:33 +00:00
kschappell	9853b57843	feat(cli): add JSONL and directory input readers Add TextPair dataclass and read_jsonl/read_paired_jsonl functions for parsing candidate-reference pairs from JSONL files.	2026-02-03 18:16:34 +00:00
kschappell	55faae3e1b	feat(cli): add CLI entry point with version command Initialise Typer app with --version flag and help text.	2026-02-03 18:16:07 +00:00