refactor: CLI cleanup and documentation updates

- Refactor CLI metric computation to eliminate code duplication - Update version format to PEP 440 compliance (0.1.0.dev0) - Cache Settings instance via @lru_cache for performance - Document composite validators' protocol deviation - Consolidate redundant empty checks in ROUGE-L computation - Add Phase 10 (Portfolio Demos) to implementation plan
fix(pytest-plugin): remove duplicate plugin registration in tests
2026-02-04 15:38:46 +00:00 · 2026-02-04 00:43:20 +00:00 · 2026-02-04 00:23:06 +00:00 · 2026-02-04 00:22:57 +00:00 · 2026-02-04 00:22:47 +00:00 · 2026-02-03 21:31:48 +00:00
20 changed files with 1275 additions and 103 deletions
@@ -7,35 +7,85 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ## [Unreleased]

+### Changed
+
+- Refactored CLI metric computation to eliminate code duplication
+- Version format updated from `0.1.0-dev` to `0.1.0.dev0` (PEP 440 compliance)
+- Settings instance is now cached via `@lru_cache` for better performance
+- Documented composite validators' intentional deviation from `Check` protocol return type
+
+### Fixed
+
+- Consolidated redundant empty checks in ROUGE-L computation
+- Fixed README example using incorrect property names (`grade_level` → `flesch_kincaid_grade`, `reading_ease` → `flesch_reading_ease`)
+
+### Documentation
+
+- Added Phase 10 (Portfolio Demos) to implementation plan: Streamlit demo and Jupyter notebooks
+- Updated project plan with portfolio demo section
+- Fixed potential crash in ROUGE metric when all references are empty after tokenisation
+- Fixed potential division by zero in readability metric when text has no sentence endings
+- Fixed unbounded cache growth in `SemanticSimilarity` by implementing LRU eviction with configurable max size
+- Fixed mutable list aliasing in `AllOf` and `AnyOf` composite validators
+- Fixed regex pattern validation in `ContainsValidator` and `ExcludesValidator` to fail at init time rather than during `check()`
+- Fixed pytest plugin tests failing with duplicate plugin registration error
+
 ### Added

+- Added `.score` property to `LexicalResult` for API consistency with other result types
+- Added `cache_max_size` parameter to `SemanticSimilarity` (default: 1000 embeddings)
+- Added test coverage for `core/config.py` and `core/logging.py` modules
+
+## [0.1.0] — 2026-02-03
+
+Initial release of Veritext, a semantic text validation framework for Python.
+
+### Added
+
+#### Core
+
 - Project scaffold with pyproject.toml and development tooling
 - Core exception hierarchy (`VeritextError` and subclasses)
 - Core types: `ValidationContext`, `CheckResult`, `ValidationResult`
 - Word tokeniser with Unicode normalisation support
 - Configuration module with pydantic-settings
 - Structured logging with structlog
+
+#### Metrics
+
 - Metrics module with `Metric` protocol, `AggregateStats`, and `BatchResult` types
 - BLEU metric implementation (BLEU-1 through BLEU-4 with brevity penalty)
- Lexical similarity metric (Jaccard similarity and token overlap)
 - ROUGE metric (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F-measure)
+- Lexical similarity metric (Jaccard similarity and token overlap)
 - Flesch-Kincaid readability metrics (grade level and reading ease)
 - Batch scoring with aggregate statistics for all metrics
+
+#### Validators
+
 - Validators module with `Check` protocol for validation checks
 - Metric-based validators: `BleuValidator`, `RougeValidator`, `LexicalValidator`
 - Constraint validators: `LengthValidator`, `ReadabilityValidator`, `ContainsValidator`, `ExcludesValidator`
 - Composite validators: `AllOf` (all checks must pass), `AnyOf` (any check must pass)
 - Factory functions for clean validator API (`bleu()`, `rouge()`, `lexical()`, `length()`, `readability()`, `contains()`, `excludes()`, `all_of()`, `any_of()`)
+
+#### Semantic Similarity
+
 - Semantic similarity module with embedding-based text comparison (requires `veritext[semantic]` extra)
 - `SemanticSimilarity` metric using sentence-transformers for semantic relatedness
 - `SemanticValidator` for threshold-based semantic similarity validation
 - `semantic()` factory function for creating semantic validators
 - Embedding caching for performance optimisation in repeated comparisons
+
+#### Pytest Plugin
+
 - Native pytest plugin for CI/CD integration (entry point: `pytest11`)
 - `validate_text()` assertion function for expressive test assertions
 - `text_validation` marker for filtering validation tests
 - Pytest fixtures: `text_validator` factory and `validation_context` helper
 - Detailed failure messages with text preview and check diagnostics
+
+#### Benchmarking
+
 - Benchmark module for quality tracking and regression detection
 - `Benchmark` class for evaluating text quality over time with metric storage
 - `BenchmarkRun` and `RegressionReport` data models for tracking runs
@@ -45,6 +95,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `assert_no_regression()` raises `RegressionDetectedError` for CI integration
 - Customisable tolerance threshold and window size for regression detection
 - Metadata support for tracking git SHA, model versions, etc.
+
+#### CLI
+
 - Command-line interface (CLI) via `veritext` command
 - `veritext validate` command for inline and file-based text validation
 - JSONL input format support for batch validation (`--file` option)
@@ -54,3 +107,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `veritext benchmark show` command for viewing benchmark history
 - `veritext benchmark check` command for regression detection with exit code 1 on failure
 - Rich-formatted terminal output with tables and coloured panels
+
+#### Documentation
+
+- Comprehensive readme with usage examples
+- Example scripts: basic validation, chatbot testing, benchmark regression
@@ -871,6 +871,59 @@ uv run pytest --cov=src/veritext --cov-report=term-missing

 ---

+### Phase 10: Portfolio Demos
+
+**Goal:** Interactive demos for showcasing Veritext without installation.
+
+**Step 1 — Streamlit Demo:**
+
+Build a quick interactive web UI for general visitors.
+
+- [ ] Create `demo/streamlit_app.py`
+- [ ] Text input boxes (candidate + reference)
+- [ ] Metric selector (BLEU, ROUGE, lexical, readability)
+- [ ] Threshold sliders for pass/fail validation
+- [ ] Results table with scores and status
+- [ ] Deploy to homeserver (e.g., `veritext.kschappell.com`)
+
+**Step 2 — Jupyter Notebook Collection:**
+
+Deep-dive notebooks targeting data science and ML recruiters.
+
+- [ ] Create `notebooks/` directory
+- [ ] `01-metrics-overview.ipynb` — Introduction to each metric with visualisations
+- [ ] `02-batch-evaluation.ipynb` — Evaluating model outputs at scale
+- [ ] `03-regression-detection.ipynb` — Tracking quality over time
+- [ ] `04-chatbot-validation.ipynb` — Real-world use case
+
+**Step 3 — JupyterLite Deployment:**
+
+Host notebooks as static files running in the browser.
+
+- [ ] Configure JupyterLite build with veritext pre-installed
+- [ ] Bundle notebooks into static site
+- [ ] Deploy alongside Streamlit demo
+
+**Files:**
+- `demo/streamlit_app.py`
+- `notebooks/01-metrics-overview.ipynb`
+- `notebooks/02-batch-evaluation.ipynb`
+- `notebooks/03-regression-detection.ipynb`
+- `notebooks/04-chatbot-validation.ipynb`
+- `notebooks/jupyterlite-config.json`
+
+**Verification:**
+```bash
+# Streamlit
+uv run streamlit run demo/streamlit_app.py
+
+# JupyterLite (local preview)
+jupyter lite build --contents notebooks/
+jupyter lite serve
+```
+
+---
+
 ## Dependencies

 ```toml
@@ -488,3 +488,47 @@ benchmark.assert_no_regression(tolerance=0.03)

 5. **Natural portfolio narrative** — "I was building X and needed a better way to test
   it, so I built this tool." Every interviewer has faced similar problems.
+
+---
+
+## Portfolio Demos (Future)
+
+Interactive demos to showcase Veritext without requiring installation.
+
+### Streamlit Demo
+
+A quick interactive web UI for general visitors and recruiters.
+
+**Features:**
+- Text input boxes (candidate + reference)
+- Metric selector (BLEU, ROUGE, lexical, readability)
+- Threshold sliders for pass/fail validation
+- Results table with scores and status
+
+**Deployment:** Self-hosted on homeserver (e.g., `veritext.kschappell.com`)
+
+**Effort:** ~30 minutes
+
+### Jupyter Notebook Collection
+
+Deep-dive notebooks targeting data science and ML recruiters.
+
+**Notebooks:**
+
+| Notebook | Purpose |
+|----------|---------|
+| `01-metrics-overview.ipynb` | Introduction to each metric with visualisations |
+| `02-batch-evaluation.ipynb` | Evaluating model outputs at scale, statistical analysis |
+| `03-regression-detection.ipynb` | Tracking quality over time, detecting degradation |
+| `04-chatbot-validation.ipynb` | Real-world use case: validating chatbot responses |
+
+**Hosting:** JupyterLite (static files, runs in browser via WebAssembly)
+
+**Deployment:** Self-hosted alongside Streamlit demo
+
+**Why both:**
+
+| Demo Type | Audience | Value |
+|-----------|----------|-------|
+| Streamlit | General visitors | Quick, interactive, no friction |
+| Notebooks | Data/ML recruiters | Shows analytical depth, speaks their language |
@@ -0,0 +1,135 @@
+"""Basic text validation examples.
+
+Demonstrates core Veritext functionality:
+- Single metric scoring (BLEU, ROUGE)
+- Validator usage with thresholds
+- Composite validators (all_of, any_of)
+- Constraint validators (length, readability)
+"""
+
+from veritext.core.types import ValidationContext
+from veritext.metrics import Bleu, Rouge
+from veritext.validators import (
+    all_of,
+    any_of,
+    bleu,
+    contains,
+    excludes,
+    length,
+    readability,
+    rouge,
+)
+
+
+def metric_scoring_example() -> None:
+    """Score text using individual metrics."""
+    candidate = "The quick brown fox jumps over the lazy dog."
+    reference = "A fast brown fox leaps over a sleepy dog."
+
+    # BLEU scoring (translation quality)
+    bleu_metric = Bleu()
+    bleu_result = bleu_metric.score(candidate, reference)
+    print("BLEU Scores:")
+    print(f"  BLEU-1: {bleu_result.bleu1:.3f}")
+    print(f"  BLEU-4: {bleu_result.bleu4:.3f}")
+    print(f"  Brevity penalty: {bleu_result.brevity_penalty:.3f}")
+
+    # ROUGE scoring (summary quality)
+    rouge_metric = Rouge()
+    rouge_result = rouge_metric.score(candidate, reference)
+    print("\nROUGE Scores:")
+    print(f"  ROUGE-1 F1: {rouge_result.rouge1.fmeasure:.3f}")
+    print(f"  ROUGE-L F1: {rouge_result.rouge_l.fmeasure:.3f}")
+
+
+def validator_example() -> None:
+    """Use validators to make pass/fail decisions."""
+    reference = "Machine learning models require training data."
+    candidate = "ML models need training data to learn patterns."
+
+    context = ValidationContext(reference=reference)
+
+    # BLEU validator with minimum threshold
+    bleu_validator = bleu(min_score=0.3)
+    result = bleu_validator.check(candidate, context)
+    print(f"\nBLEU validation (min 0.3): {'PASS' if result.passed else 'FAIL'}")
+
+    # ROUGE validator
+    rouge_validator = rouge(min_score=0.5)
+    result = rouge_validator.check(candidate, context)
+    print(f"ROUGE validation (min 0.5): {'PASS' if result.passed else 'FAIL'}")
+
+
+def composite_validator_example() -> None:
+    """Combine validators with all_of and any_of."""
+    reference = "The product launch exceeded all expectations."
+    candidate = "The product release performed beyond expectations."
+
+    context = ValidationContext(reference=reference)
+
+    # All checks must pass
+    strict_validator = all_of(
+        [
+            bleu(min_score=0.2),
+            rouge(min_score=0.4),
+            length(max_chars=100),
+        ]
+    )
+    result = strict_validator.check(candidate, context)
+    print(f"\nStrict (all_of): {'PASS' if result.passed else 'FAIL'}")
+    if not result.passed:
+        print(f"  Failures: {result.failure_summary}")
+
+    # At least one check must pass
+    flexible_validator = any_of(
+        [
+            bleu(min_score=0.8),  # Unlikely to pass
+            rouge(min_score=0.4),  # More likely
+        ]
+    )
+    result = flexible_validator.check(candidate, context)
+    print(f"Flexible (any_of): {'PASS' if result.passed else 'FAIL'}")
+
+
+def constraint_validator_example() -> None:
+    """Use constraint validators for text properties."""
+    text = "This short guide explains the basics clearly."
+    context = ValidationContext()  # No reference needed for constraints
+
+    # Length constraints
+    length_validator = length(min_chars=20, max_chars=100, min_words=5, max_words=20)
+    result = length_validator.check(text, context)
+    print(f"\nLength check: {'PASS' if result.passed else 'FAIL'}")
+
+    # Readability (Flesch-Kincaid)
+    readability_validator = readability(max_grade=10.0)
+    result = readability_validator.check(text, context)
+    print(f"Readability (grade <= 10): {'PASS' if result.passed else 'FAIL'}")
+
+    # Content patterns
+    contains_validator = contains(patterns=["guide", "basics"])
+    result = contains_validator.check(text, context)
+    print(f"Contains required terms: {'PASS' if result.passed else 'FAIL'}")
+
+    excludes_validator = excludes(patterns=["error", "warning"])
+    result = excludes_validator.check(text, context)
+    print(f"Excludes forbidden terms: {'PASS' if result.passed else 'FAIL'}")
+
+
+def main() -> None:
+    """Run all examples."""
+    print("=" * 60)
+    print("Veritext Basic Validation Examples")
+    print("=" * 60)
+
+    metric_scoring_example()
+    validator_example()
+    composite_validator_example()
+    constraint_validator_example()
+
+    print("\n" + "=" * 60)
+    print("All examples completed.")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,160 @@
+"""Benchmark quality tracking with regression detection.
+
+Demonstrates Veritext's benchmark module for CI integration:
+- Creating a benchmark suite
+- Running evaluations and storing results
+- Checking for quality regression
+- CI integration pattern with exit codes
+"""
+
+import tempfile
+from pathlib import Path
+
+from veritext.benchmark import Benchmark
+from veritext.core.exceptions import RegressionDetectedError
+
+
+def create_sample_data() -> tuple[list[str], list[str]]:
+    """Create sample candidate/reference pairs for benchmarking."""
+    # Simulated summarisation outputs and references
+    candidates = [
+        "The new policy aims to reduce carbon emissions by 50% by 2030.",
+        "Scientists discovered a new species of deep-sea fish.",
+        "The company reported record profits in the third quarter.",
+        "Researchers developed a breakthrough treatment for the disease.",
+        "The city plans to expand public transportation routes.",
+    ]
+    references = [
+        "The policy targets a 50% reduction in carbon emissions by 2030.",
+        "A new deep-sea fish species was discovered by marine biologists.",
+        "Record profits were announced by the company for Q3.",
+        "A breakthrough disease treatment was developed by researchers.",
+        "Public transport expansion is planned for the city.",
+    ]
+    return candidates, references
+
+
+def run_benchmark_example() -> None:
+    """Run a benchmark evaluation and view results."""
+    # Use a temp directory for this example
+    with tempfile.TemporaryDirectory() as tmpdir:
+        storage_path = Path(tmpdir) / "benchmarks"
+
+        # Create benchmark suite
+        bench = Benchmark("summariser_quality", storage_path=storage_path)
+
+        candidates, references = create_sample_data()
+
+        # Run evaluation
+        print("Running benchmark evaluation...")
+        run = bench.evaluate(
+            candidates=candidates,
+            references=references,
+            metrics=["rouge_l", "bleu4"],
+            metadata={"model": "v1.0", "dataset": "test"},
+        )
+
+        print("\nBenchmark run completed:")
+        print(f"  Run ID: {run.id[:8]}...")
+        print(f"  Samples: {run.sample_count}")
+        print("  Metrics:")
+        for name, value in run.metrics.items():
+            print(f"    {name}: {value:.4f}")
+
+
+def regression_detection_example() -> None:
+    """Demonstrate regression detection with historical comparison."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        storage_path = Path(tmpdir) / "benchmarks"
+        bench = Benchmark("summariser_quality", storage_path=storage_path)
+
+        candidates, references = create_sample_data()
+
+        # Simulate historical runs with stable quality
+        print("\nBuilding baseline with historical runs...")
+        for i in range(5):
+            bench.evaluate(
+                candidates=candidates,
+                references=references,
+                metrics=["rouge_l", "bleu4"],
+                metadata={"run": f"baseline_{i}"},
+            )
+            print(f"  Baseline run {i + 1} recorded")
+
+        # Check regression (no degradation expected)
+        report = bench.check_regression(tolerance=0.05, window=5)
+        print(f"\nRegression check: {'DETECTED' if report.detected else 'NONE'}")
+
+        # Simulate a degraded model
+        print("\nSimulating degraded model output...")
+        degraded_candidates = [
+            "Policy carbon emissions.",  # Much shorter/worse
+            "Fish discovered.",
+            "Company profits.",
+            "Treatment developed.",
+            "Transport expansion.",
+        ]
+        bench.evaluate(
+            candidates=degraded_candidates,
+            references=references,
+            metrics=["rouge_l", "bleu4"],
+            metadata={"model": "v1.1-broken"},
+        )
+
+        # Check regression (should detect)
+        report = bench.check_regression(tolerance=0.05, window=5)
+        print(f"Regression check: {'DETECTED' if report.detected else 'NONE'}")
+        if report.detected:
+            print("\nRegression details:")
+            for metric, delta in report.deltas.items():
+                baseline = report.baseline.get(metric, 0)
+                current = report.current.get(metric, 0)
+                print(f"  {metric}: {baseline:.4f} -> {current:.4f} ({delta:+.4f})")
+
+
+def ci_integration_example() -> None:
+    """CI integration pattern using assert_no_regression()."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        storage_path = Path(tmpdir) / "benchmarks"
+        bench = Benchmark("ci_check", storage_path=storage_path)
+
+        candidates, references = create_sample_data()
+
+        # Build baseline
+        for _ in range(3):
+            bench.evaluate(candidates, references, metrics=["rouge_l"])
+
+        # Simulate CI check
+        print("\n" + "=" * 50)
+        print("CI Integration Example")
+        print("=" * 50)
+
+        print("\nRunning evaluation...")
+        bench.evaluate(candidates, references, metrics=["rouge_l"])
+
+        print("Checking for regression...")
+        try:
+            bench.assert_no_regression(tolerance=0.05, window=3)
+            print("No regression detected.")
+            print("CI status: EXIT 0")
+        except RegressionDetectedError as e:
+            print(f"Regression detected: {e}")
+            print("CI status: EXIT 1")
+
+
+def main() -> None:
+    """Run all benchmark examples."""
+    print("=" * 60)
+    print("Veritext Benchmark & Regression Detection Examples")
+    print("=" * 60)
+
+    run_benchmark_example()
+    regression_detection_example()
+    ci_integration_example()
+
+    print("\n" + "=" * 60)
+    print("All examples completed.")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,140 @@
+"""Pytest integration for chatbot testing.
+
+Demonstrates Veritext's pytest plugin for testing chatbot responses:
+- validate_text() assertion function
+- Custom test fixtures
+- Test organisation with markers
+"""
+
+import pytest
+
+from veritext.pytest_plugin import validate_text
+
+# Sample chatbot responses for testing
+CHATBOT_RESPONSES = {
+    "greeting": {
+        "input": "Hello!",
+        "response": "Hi there! How can I help you today?",
+        "expected_keywords": ["help", "hi"],
+    },
+    "weather": {
+        "input": "What's the weather like?",
+        "response": "I don't have access to real-time weather data, but you can "
+        "check a weather service like weather.com for current conditions.",
+        "expected_keywords": ["weather", "check"],
+    },
+    "farewell": {
+        "input": "Goodbye!",
+        "response": "Goodbye! Have a great day!",
+        "expected_keywords": ["goodbye", "day"],
+    },
+}
+
+
+# Fixtures for common test setup
+@pytest.fixture
+def greeting_response() -> str:
+    """Provide a sample greeting response."""
+    return CHATBOT_RESPONSES["greeting"]["response"]
+
+
+@pytest.fixture
+def weather_response() -> str:
+    """Provide a sample weather response."""
+    return CHATBOT_RESPONSES["weather"]["response"]
+
+
+# Basic validation tests
+class TestResponseQuality:
+    """Test chatbot response quality using Veritext."""
+
+    def test_greeting_length(self, greeting_response: str) -> None:
+        """Greeting responses should be concise."""
+        validate_text(
+            greeting_response,
+            min_length=10,
+            max_length=100,
+        )
+
+    def test_greeting_readability(self, greeting_response: str) -> None:
+        """Greeting responses should be easy to read."""
+        validate_text(
+            greeting_response,
+            max_reading_grade=8.0,
+        )
+
+    def test_greeting_contains_keywords(self, greeting_response: str) -> None:
+        """Greeting should contain expected terms."""
+        validate_text(
+            greeting_response,
+            must_contain=["help"],
+        )
+
+    def test_weather_response_quality(self, weather_response: str) -> None:
+        """Weather response should be informative and readable."""
+        validate_text(
+            weather_response,
+            min_length=50,
+            max_length=500,
+            max_reading_grade=10.0,
+            must_contain=["weather"],
+        )
+
+
+# Tests with reference comparison
+class TestResponseSimilarity:
+    """Test response similarity against reference texts."""
+
+    def test_greeting_similarity(self) -> None:
+        """Greeting should match expected style."""
+        reference = "Hello! How may I assist you today?"
+        response = CHATBOT_RESPONSES["greeting"]["response"]
+
+        validate_text(
+            response,
+            reference=reference,
+            min_rouge=0.3,  # Allow variation in wording
+            min_length=10,
+        )
+
+    def test_farewell_similarity(self) -> None:
+        """Farewell should match expected style."""
+        reference = "Goodbye! Have a wonderful day!"
+        response = CHATBOT_RESPONSES["farewell"]["response"]
+
+        validate_text(
+            response,
+            reference=reference,
+            min_rouge=0.5,
+            must_contain=["goodbye"],
+        )
+
+
+# Content safety tests
+class TestContentSafety:
+    """Test responses for inappropriate content."""
+
+    @pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
+    def test_no_profanity(self, response_key: str) -> None:
+        """Responses should not contain profanity."""
+        response = CHATBOT_RESPONSES[response_key]["response"]
+        validate_text(
+            response,
+            must_exclude=["damn", "hell", "crap"],
+            min_length=1,
+        )
+
+    @pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
+    def test_no_harmful_content(self, response_key: str) -> None:
+        """Responses should not contain harmful instructions."""
+        response = CHATBOT_RESPONSES[response_key]["response"]
+        validate_text(
+            response,
+            must_exclude=["hack", "exploit", "attack"],
+            min_length=1,
+        )
+
+
+# Run tests when executed directly
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
@@ -1,6 +1,6 @@
 [project]
 name = "veritext"
-version = "0.1.0-dev"
+version = "0.1.0.dev0"
 description = "Semantic text validation framework"
 readme = "readme.md"
 requires-python = ">=3.11"
@@ -2,48 +2,398 @@

 Semantic text validation framework for Python.

-Validates text outputs against quality criteria using metrics like BLEU, ROUGE,
-and semantic similarity. Designed for developers building systems that produce
-text (chatbots, content generators, summarisation tools) who need automated
-quality assurance beyond simple string matching.
+Veritext validates text outputs against quality criteria using metrics like BLEU,
+ROUGE, and semantic similarity. Designed for developers building systems that produce
+text (chatbots, content generators, summarisation tools) who need automated quality
+assurance beyond simple string matching.

-## Status
+## Features

-Under active development. See [changelog.md](changelog.md) for progress.
+- **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic
+  embeddings
+- **Composable validators** — Build complex checks from simple primitives
+- **Native pytest integration** — `validate_text()` assertion for test suites
+- **Quality benchmarking** — Track metrics over time with regression detection
+- **CLI tools** — Command-line validation and benchmark management

 ## Installation

 ```bash
 pip install veritext

-# With semantic similarity support
+# With semantic similarity support (sentence-transformers)
 pip install veritext[semantic]
 ```

 ## Quick Start

 ```python
-from veritext import validators as v
 from veritext.core.types import ValidationContext
+from veritext.validators import all_of, bleu, length, rouge

-# Create validators
-validator = v.all_of([
-    v.bleu(min_score=0.7),
-    v.length(max_chars=500),
+# Create a validator
+validator = all_of([
+    bleu(min_score=0.5),
+    rouge(min_score=0.6),
+    length(max_chars=500),
 ])

 # Validate text
-context = ValidationContext(reference="The cat sat on the mat.")
-result = validator.check("A cat is sitting on the mat.", context)
+context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
+result = validator.check("A fast brown fox leaps over a sleepy dog.", context)

-if not result.passed:
+if result.passed:
+    print("Validation passed!")
+else:
    print(result.failure_summary)
 ```

-## Documentation
+## Metrics

- [Project Plan](docs/project-plan.md)
- [Implementation Plan](docs/implementation-plan.md)
+Veritext provides several metrics for text evaluation.
+
+### BLEU
+
+Measures n-gram precision against reference text. Useful for translation and
+generation quality.
+
+```python
+from veritext.metrics import Bleu
+
+bleu = Bleu()
+result = bleu.score(
+    candidate="The cat sat on the mat.",
+    reference="A cat is sitting on the mat.",
+)
+print(f"BLEU-4: {result.bleu4:.3f}")  # Uses 1-4 gram precision
+print(f"BLEU-1: {result.bleu1:.3f}")  # Unigram precision only
+```
+
+### ROUGE
+
+Measures recall-oriented overlap with reference text. Useful for summarisation.
+
+```python
+from veritext.metrics import Rouge
+
+rouge = Rouge()
+result = rouge.score(
+    candidate="Scientists found a new planet.",
+    reference="Researchers discovered a new planet in the solar system.",
+)
+print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}")  # Unigram overlap
+print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}")  # Longest common subsequence
+```
+
+### Lexical Similarity
+
+Measures token overlap using Jaccard similarity.
+
+```python
+from veritext.metrics import Lexical
+
+lexical = Lexical()
+result = lexical.score(
+    candidate="The quick brown fox",
+    reference="The fast brown fox",
+)
+print(f"Jaccard: {result.jaccard:.3f}")
+print(f"Token overlap: {result.token_overlap:.3f}")
+```
+
+### Readability
+
+Computes Flesch-Kincaid scores for text complexity.
+
+```python
+from veritext.metrics import Readability
+
+readability = Readability()
+result = readability.score("This is a simple sentence.")
+print(f"Grade level: {result.flesch_kincaid_grade:.1f}")
+print(f"Reading ease: {result.flesch_reading_ease:.1f}")
+```
+
+### Semantic Similarity (Optional)
+
+Requires `pip install veritext[semantic]`.
+
+```python
+from veritext.semantic import SemanticSimilarity
+
+semantic = SemanticSimilarity()
+result = semantic.score(
+    candidate="The dog is running in the park.",
+    reference="A canine is jogging through the garden.",
+)
+print(f"Similarity: {result.score:.3f}")
+```
+
+## Validators
+
+Validators wrap metrics with thresholds to make pass/fail decisions.
+
+### Metric-Based Validators
+
+```python
+from veritext.core.types import ValidationContext
+from veritext.validators import bleu, lexical, rouge
+
+context = ValidationContext(reference="Reference text here.")
+
+# BLEU validation
+validator = bleu(min_score=0.5, variant=4)  # BLEU-4
+result = validator.check("Candidate text here.", context)
+
+# ROUGE validation
+validator = rouge(min_score=0.6, variant="l")  # ROUGE-L
+result = validator.check("Candidate text here.", context)
+
+# Lexical validation
+validator = lexical(min_jaccard=0.3, min_overlap=0.5)
+result = validator.check("Candidate text here.", context)
+```
+
+### Constraint Validators
+
+These don't require reference text.
+
+```python
+from veritext.core.types import ValidationContext
+from veritext.validators import contains, excludes, length, readability
+
+context = ValidationContext()  # No reference needed
+
+# Length constraints
+validator = length(min_chars=50, max_chars=500, min_words=10)
+result = validator.check("Your text here...", context)
+
+# Readability constraints
+validator = readability(max_grade=8.0, min_ease=60.0)
+result = validator.check("Your text here...", context)
+
+# Content requirements
+validator = contains(patterns=["important", "keyword"])
+result = validator.check("This important text has a keyword.", context)
+
+# Content exclusions
+validator = excludes(patterns=["forbidden", "banned"])
+result = validator.check("This text is clean.", context)
+```
+
+### Composite Validators
+
+Combine multiple checks with logical operators.
+
+```python
+from veritext.validators import all_of, any_of, bleu, length, rouge
+
+# All checks must pass
+validator = all_of([
+    bleu(min_score=0.5),
+    rouge(min_score=0.6),
+    length(max_chars=500),
+])
+
+# At least one check must pass
+validator = any_of([
+    bleu(min_score=0.7),
+    rouge(min_score=0.7),
+])
+```
+
+## Pytest Plugin
+
+Veritext provides native pytest integration for testing text quality.
+
+### Basic Usage
+
+```python
+from veritext.pytest_plugin import validate_text
+
+
+def test_response_quality():
+    response = "This is a helpful response to your question."
+
+    validate_text(
+        response,
+        min_length=20,
+        max_length=200,
+        max_reading_grade=10.0,
+        must_contain=["helpful"],
+        must_exclude=["error", "sorry"],
+    )
+
+
+def test_summary_similarity():
+    summary = "Scientists discovered a new planet."
+    reference = "Researchers found a new planet in our solar system."
+
+    validate_text(
+        summary,
+        reference=reference,
+        min_rouge=0.5,
+        min_length=10,
+    )
+```
+
+### Available Parameters
+
+| Parameter | Description |
+|-----------|-------------|
+| `reference` | Reference text for comparison metrics |
+| `min_bleu` | Minimum BLEU-4 score (0.0-1.0) |
+| `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) |
+| `min_semantic` | Minimum semantic similarity (0.0-1.0) |
+| `min_length` | Minimum character count |
+| `max_length` | Maximum character count |
+| `max_reading_grade` | Maximum Flesch-Kincaid grade level |
+| `must_contain` | List of required patterns |
+| `must_exclude` | List of forbidden patterns |
+
+## Benchmarking
+
+Track text quality over time and detect regressions.
+
+### Running Benchmarks
+
+```python
+from veritext.benchmark import Benchmark
+
+# Create a benchmark suite
+bench = Benchmark("summariser_quality", storage_path="benchmarks/")
+
+# Evaluate a batch of outputs
+candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
+references = ["Reference 1...", "Reference 2...", "Reference 3..."]
+
+run = bench.evaluate(
+    candidates=candidates,
+    references=references,
+    metrics=["rouge_l", "bleu4"],
+    metadata={"model": "v1.2", "git_sha": "abc123"},
+)
+
+print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
+print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
+```
+
+### Regression Detection
+
+```python
+from veritext.benchmark import Benchmark
+from veritext.core.exceptions import RegressionDetectedError
+
+bench = Benchmark("summariser_quality")
+
+# Check for regression against historical baseline
+report = bench.check_regression(tolerance=0.05, window=10)
+if report.detected:
+    print("Quality regression detected!")
+    for metric, delta in report.deltas.items():
+        print(f"  {metric}: {delta:+.4f}")
+
+# Or raise an exception for CI integration
+try:
+    bench.assert_no_regression(tolerance=0.05)
+except RegressionDetectedError as e:
+    print(f"CI failure: {e}")
+    exit(1)
+```
+
+### Viewing History
+
+```python
+bench = Benchmark("summariser_quality")
+
+for run in bench.get_history(limit=10):
+    print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
+```
+
+## CLI
+
+Veritext provides a command-line interface for validation and benchmarking.
+
+### Validate Text
+
+```bash
+# Inline validation
+veritext validate "Candidate text" -r "Reference text" -m bleu,rouge
+
+# File-based batch validation (JSONL with "candidate" and "reference" fields)
+veritext validate -f outputs.jsonl -m bleu,rouge,lexical
+
+# With threshold for pass/fail
+veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple
+
+# Output formats: table (default), json, simple
+veritext validate "Text" -r "Reference" -m bleu -o json
+```
+
+### Benchmark Commands
+
+```bash
+# Run a benchmark evaluation
+veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
+
+# View benchmark history
+veritext benchmark show my_bench --last 10
+
+# Check for regression (exits with code 1 if detected)
+veritext benchmark check my_bench --tolerance 0.05 --window 10
+```
+
+### JSONL Format
+
+For file-based operations, use JSONL with `candidate` and `reference` fields:
+
+```json
+{"candidate": "Model output 1", "reference": "Expected output 1"}
+{"candidate": "Model output 2", "reference": "Expected output 2"}
+```
+
+## Configuration
+
+Veritext uses environment variables for configuration:
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `VERITEXT_LOG_LEVEL` | `INFO` | Logging level |
+| `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) |
+
+## Development
+
+### Setup
+
+```bash
+git clone https://gitea.kschappell.com/kschappell/veritext.git
+cd veritext
+uv sync --all-extras
+```
+
+### Quality Checks
+
+```bash
+# Linting
+uv run ruff check .
+
+# Formatting
+uv run ruff format --check .
+
+# Type checking
+uv run mypy src/
+
+# Tests
+uv run pytest
+```
+
+### Running Examples
+
+```bash
+uv run python examples/basic_validation.py
+uv run pytest examples/chatbot_testing.py -v
+uv run python examples/benchmark_regression.py
+```

 ## Licence

@@ -11,11 +11,91 @@ from veritext.metrics.bleu import Bleu
 from veritext.metrics.lexical import Lexical
 from veritext.metrics.rouge import Rouge

-# Available metrics mapped to their computation functions
+# Available metrics
 AVAILABLE_METRICS = frozenset(
    {"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"}
 )

+# Lazily-initialised metric instances
+_bleu: Bleu | None = None
+_rouge: Rouge | None = None
+_lexical: Lexical | None = None
+
+
+def _get_bleu() -> Bleu:
+    """Get or create the BLEU metric instance."""
+    global _bleu
+    if _bleu is None:
+        _bleu = Bleu()
+    return _bleu
+
+
+def _get_rouge() -> Rouge:
+    """Get or create the ROUGE metric instance."""
+    global _rouge
+    if _rouge is None:
+        _rouge = Rouge()
+    return _rouge
+
+
+def _get_lexical() -> Lexical:
+    """Get or create the lexical metric instance."""
+    global _lexical
+    if _lexical is None:
+        _lexical = Lexical()
+    return _lexical
+
+
+# Metric registry: maps metric names to (result_keys, single_extractor, batch_extractor)
+# - result_keys: output keys to populate
+# - single_extractor: function(candidate, reference) -> dict of results
+# - batch_extractor: function(candidates, references) -> dict of results
+def _bleu_single(candidate: str, reference: str, key: str) -> dict[str, float]:
+    """Extract a BLEU score for single mode."""
+    result = _get_bleu().score(candidate, reference)
+    return {key: getattr(result, key)}
+
+
+def _bleu_batch(
+    candidates: list[str], references: list[str], key: str
+) -> dict[str, float]:
+    """Extract a BLEU score for batch mode."""
+    batch = _get_bleu().batch_score(candidates, references)
+    stats = batch.stats.get(key)
+    return {key: stats.mean} if stats else {}
+
+
+def _rouge_single(candidate: str, reference: str) -> dict[str, float]:
+    """Extract ROUGE-L F-measure for single mode."""
+    result = _get_rouge().score(candidate, reference)
+    return {"rouge_l": result.rouge_l.fmeasure}
+
+
+def _rouge_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
+    """Extract ROUGE-L F-measure for batch mode."""
+    batch = _get_rouge().batch_score(candidates, references)
+    stats = batch.stats.get("rouge_l_fmeasure")
+    return {"rouge_l": stats.mean} if stats else {}
+
+
+def _lexical_single(candidate: str, reference: str) -> dict[str, float]:
+    """Extract lexical scores for single mode."""
+    result = _get_lexical().score(candidate, reference)
+    return {"jaccard": result.jaccard, "token_overlap": result.token_overlap}
+
+
+def _lexical_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
+    """Extract lexical scores for batch mode."""
+    batch = _get_lexical().batch_score(candidates, references)
+    results: dict[str, float] = {}
+    jaccard_stats = batch.stats.get("jaccard")
+    overlap_stats = batch.stats.get("token_overlap")
+    if jaccard_stats:
+        results["jaccard"] = jaccard_stats.mean
+    if overlap_stats:
+        results["token_overlap"] = overlap_stats.mean
+    return results
+

 def _compute_metrics(
    candidate: str,
@@ -24,30 +104,16 @@ def _compute_metrics(
 ) -> dict[str, float]:
    """Compute requested metrics for a single text pair."""
    results: dict[str, float] = {}
-    bleu = Bleu()
-    rouge = Rouge()
-    lexical = Lexical()

    for metric in metric_names:
-        if metric == "bleu" or metric == "bleu4":
-            bleu_result = bleu.score(candidate, reference)
-            results["bleu4"] = bleu_result.bleu4
-        elif metric == "bleu1":
-            bleu_result = bleu.score(candidate, reference)
-            results["bleu1"] = bleu_result.bleu1
-        elif metric == "bleu2":
-            bleu_result = bleu.score(candidate, reference)
-            results["bleu2"] = bleu_result.bleu2
-        elif metric == "bleu3":
-            bleu_result = bleu.score(candidate, reference)
-            results["bleu3"] = bleu_result.bleu3
-        elif metric == "rouge" or metric == "rouge_l":
-            rouge_result = rouge.score(candidate, reference)
-            results["rouge_l"] = rouge_result.rouge_l.fmeasure
+        if metric in ("bleu", "bleu4"):
+            results.update(_bleu_single(candidate, reference, "bleu4"))
+        elif metric in ("bleu1", "bleu2", "bleu3"):
+            results.update(_bleu_single(candidate, reference, metric))
+        elif metric in ("rouge", "rouge_l"):
+            results.update(_rouge_single(candidate, reference))
        elif metric == "lexical":
-            lexical_result = lexical.score(candidate, reference)
-            results["jaccard"] = lexical_result.jaccard
-            results["token_overlap"] = lexical_result.token_overlap
+            results.update(_lexical_single(candidate, reference))

    return results

@@ -58,46 +124,17 @@ def _compute_batch_metrics(
    metric_names: list[str],
 ) -> dict[str, float]:
    """Compute average metrics for a batch of text pairs."""
-    bleu = Bleu()
-    rouge = Rouge()
-    lexical = Lexical()
-
    results: dict[str, float] = {}

    for metric in metric_names:
-        if metric == "bleu" or metric == "bleu4":
-            bleu_batch = bleu.batch_score(candidates, references)
-            stats = bleu_batch.stats.get("bleu4")
-            if stats:
-                results["bleu4"] = stats.mean
-        elif metric == "bleu1":
-            bleu_batch = bleu.batch_score(candidates, references)
-            stats = bleu_batch.stats.get("bleu1")
-            if stats:
-                results["bleu1"] = stats.mean
-        elif metric == "bleu2":
-            bleu_batch = bleu.batch_score(candidates, references)
-            stats = bleu_batch.stats.get("bleu2")
-            if stats:
-                results["bleu2"] = stats.mean
-        elif metric == "bleu3":
-            bleu_batch = bleu.batch_score(candidates, references)
-            stats = bleu_batch.stats.get("bleu3")
-            if stats:
-                results["bleu3"] = stats.mean
-        elif metric == "rouge" or metric == "rouge_l":
-            rouge_batch = rouge.batch_score(candidates, references)
-            stats = rouge_batch.stats.get("rouge_l_fmeasure")
-            if stats:
-                results["rouge_l"] = stats.mean
+        if metric in ("bleu", "bleu4"):
+            results.update(_bleu_batch(candidates, references, "bleu4"))
+        elif metric in ("bleu1", "bleu2", "bleu3"):
+            results.update(_bleu_batch(candidates, references, metric))
+        elif metric in ("rouge", "rouge_l"):
+            results.update(_rouge_batch(candidates, references))
        elif metric == "lexical":
-            lexical_batch = lexical.batch_score(candidates, references)
-            jaccard_stats = lexical_batch.stats.get("jaccard")
-            overlap_stats = lexical_batch.stats.get("token_overlap")
-            if jaccard_stats:
-                results["jaccard"] = jaccard_stats.mean
-            if overlap_stats:
-                results["token_overlap"] = overlap_stats.mean
+            results.update(_lexical_batch(candidates, references))

    return results

@@ -1,5 +1,6 @@
 """Configuration management using pydantic-settings."""

+from functools import lru_cache
 from pathlib import Path
 from typing import Literal

@@ -54,6 +55,7 @@ class VeritextSettings(BaseSettings):
    )


+@lru_cache
 def get_settings() -> VeritextSettings:
-    """Get the current settings instance."""
+    """Get the cached settings instance."""
    return VeritextSettings()
@@ -137,8 +137,8 @@ class Readability:
                flesch_reading_ease=0.0,
            )

-        # Count sentences
-        sentence_count = _count_sentences(candidate)
+        # Count sentences (ensure at least 1 to avoid division by zero)
+        sentence_count = max(_count_sentences(candidate), 1)

        # Count syllables
        syllable_count = sum(_count_syllables(word) for word in words)
@@ -40,6 +40,11 @@ class LexicalResult(BaseModel):
    token_overlap: float
    """Proportion of candidate tokens found in reference."""

+    @property
+    def score(self) -> float:
+        """Return Jaccard similarity as the primary score."""
+        return self.jaccard
+

 class RougeScore(BaseModel):
    """Individual ROUGE variant score with precision, recall, F-measure."""
@@ -107,9 +107,6 @@ def _compute_rouge_l(
    Returns:
        RougeScore with precision, recall, and F-measure.
    """
-    if not candidate_tokens and not reference_tokens:
-        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
-
    if not candidate_tokens or not reference_tokens:
        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)

@@ -209,6 +206,10 @@ class Rouge:
            rouge2_scores.append(_compute_rouge_score(candidate_tokens, ref_tokens, 2))
            rouge_l_scores.append(_compute_rouge_l(candidate_tokens, ref_tokens))

+        # All references were empty after tokenisation
+        if not rouge1_scores:
+            raise ValueError("Reference text cannot be empty")
+
        return RougeResult(
            rouge1=_max_rouge_scores(rouge1_scores),
            rouge2=_max_rouge_scores(rouge2_scores),
@@ -1,11 +1,15 @@
 """Embedding-based semantic similarity using sentence-transformers."""

+from collections import OrderedDict
 from typing import Any

 from veritext.core.exceptions import DependencyError
 from veritext.metrics.base import AggregateStats, BatchResult
 from veritext.metrics.results import SemanticResult

+# Default maximum cache size (number of embeddings to store)
+DEFAULT_CACHE_MAX_SIZE = 1000
+

 class SemanticSimilarity:
    """
@@ -21,6 +25,7 @@ class SemanticSimilarity:
        self,
        model: str = "all-MiniLM-L6-v2",
        cache_embeddings: bool = True,
+        cache_max_size: int = DEFAULT_CACHE_MAX_SIZE,
    ) -> None:
        """
        Initialise the semantic similarity metric.
@@ -30,6 +35,8 @@ class SemanticSimilarity:
                   Defaults to "all-MiniLM-L6-v2" (22MB, good quality/size tradeoff).
            cache_embeddings: Whether to cache embeddings for repeated texts.
                              Defaults to True.
+            cache_max_size: Maximum number of embeddings to cache. Oldest entries
+                            are evicted when the limit is reached. Defaults to 1000.

        Raises:
            DependencyError: If sentence-transformers is not installed.
@@ -44,7 +51,10 @@ class SemanticSimilarity:

        self._model_name = model
        self._model: Any = SentenceTransformer(model)
-        self._cache: dict[str, Any] | None = {} if cache_embeddings else None
+        self._cache: OrderedDict[str, Any] | None = (
+            OrderedDict() if cache_embeddings else None
+        )
+        self._cache_max_size = cache_max_size

    @property
    def name(self) -> str:
@@ -58,7 +68,7 @@ class SemanticSimilarity:

    def _get_embedding(self, text: str) -> Any:
        """
-        Get embedding for text, using cache if available.
+        Get embedding for text, using LRU cache if available.

        Args:
            text: The text to embed.
@@ -67,11 +77,16 @@ class SemanticSimilarity:
            The embedding tensor.
        """
        if self._cache is not None and text in self._cache:
+            # Move to end to mark as recently used
+            self._cache.move_to_end(text)
            return self._cache[text]

        embedding = self._model.encode(text, convert_to_tensor=True)

        if self._cache is not None:
+            # Evict oldest entries if cache is full
+            while len(self._cache) >= self._cache_max_size:
+                self._cache.popitem(last=False)
            self._cache[text] = embedding

        return embedding
@@ -1,11 +1,20 @@
-"""Composite validators for combining multiple checks."""
+"""Composite validators for combining multiple checks.
+
+Note: CompositeCheck classes (AllOf, AnyOf) intentionally return ValidationResult
+rather than CheckResult. This allows callers to inspect individual check results
+for detailed error reporting. They implement a compatible interface but are not
+substitutable where Check is expected as a type constraint.
+"""

 from veritext.core.types import CheckResult, ValidationContext, ValidationResult
 from veritext.validators.base import Check


 class AllOf:
-    """Passes only if all checks pass."""
+    """Passes only if all checks pass.
+
+    Note: Returns ValidationResult (not CheckResult) to expose child results.
+    """

    def __init__(self, checks: list[Check]) -> None:
        """
@@ -20,7 +29,7 @@ class AllOf:
        if not checks:
            raise ValueError("checks list cannot be empty")

-        self._checks = checks
+        self._checks = list(checks)

    @property
    def name(self) -> str:
@@ -48,7 +57,10 @@ class AllOf:


 class AnyOf:
-    """Passes if any check passes."""
+    """Passes if any check passes.
+
+    Note: Returns ValidationResult (not CheckResult) to expose child results.
+    """

    def __init__(self, checks: list[Check]) -> None:
        """
@@ -63,7 +75,7 @@ class AnyOf:
        if not checks:
            raise ValueError("checks list cannot be empty")

-        self._checks = checks
+        self._checks = list(checks)

    @property
    def name(self) -> str:
@@ -229,7 +229,7 @@ class ContainsValidator:
            case_sensitive: Whether matching is case-sensitive. Defaults to False.

        Raises:
-            InvalidThresholdError: If patterns list is empty.
+            InvalidThresholdError: If patterns list is empty or contains invalid regex.
        """
        if not patterns:
            raise InvalidThresholdError("patterns list cannot be empty")
@@ -238,6 +238,15 @@ class ContainsValidator:
        self._case_sensitive = case_sensitive
        self._flags = 0 if case_sensitive else re.IGNORECASE

+        self._compiled_patterns: list[re.Pattern[str]] = []
+        for pattern in patterns:
+            try:
+                self._compiled_patterns.append(re.compile(pattern, self._flags))
+            except re.error as e:
+                raise InvalidThresholdError(
+                    f"Invalid regex pattern '{pattern}': {e}"
+                ) from e
+
    @property
    def name(self) -> str:
        """Return the name of this check."""
@@ -255,8 +264,10 @@ class ContainsValidator:
            CheckResult with pass/fail status.
        """
        missing = []
-        for pattern in self._patterns:
-            if not re.search(pattern, text, self._flags):
+        for pattern, compiled in zip(
+            self._patterns, self._compiled_patterns, strict=True
+        ):
+            if not compiled.search(text):
                missing.append(pattern)

        passed = len(missing) == 0
@@ -291,7 +302,7 @@ class ExcludesValidator:
            case_sensitive: Whether matching is case-sensitive. Defaults to False.

        Raises:
-            InvalidThresholdError: If patterns list is empty.
+            InvalidThresholdError: If patterns list is empty or contains invalid regex.
        """
        if not patterns:
            raise InvalidThresholdError("patterns list cannot be empty")
@@ -300,6 +311,15 @@ class ExcludesValidator:
        self._case_sensitive = case_sensitive
        self._flags = 0 if case_sensitive else re.IGNORECASE

+        self._compiled_patterns: list[re.Pattern[str]] = []
+        for pattern in patterns:
+            try:
+                self._compiled_patterns.append(re.compile(pattern, self._flags))
+            except re.error as e:
+                raise InvalidThresholdError(
+                    f"Invalid regex pattern '{pattern}': {e}"
+                ) from e
+
    @property
    def name(self) -> str:
        """Return the name of this check."""
@@ -317,8 +337,10 @@ class ExcludesValidator:
            CheckResult with pass/fail status.
        """
        found = []
-        for pattern in self._patterns:
-            if re.search(pattern, text, self._flags):
+        for pattern, compiled in zip(
+            self._patterns, self._compiled_patterns, strict=True
+        ):
+            if compiled.search(text):
                found.append(pattern)

        passed = len(found) == 0
@@ -0,0 +1,73 @@
+"""Tests for configuration module."""
+
+from pathlib import Path
+
+import pytest
+
+from veritext.core.config import VeritextSettings, get_settings
+
+
+class TestVeritextSettings:
+    """Tests for VeritextSettings."""
+
+    def test_default_log_level(self) -> None:
+        """Test default log level is INFO."""
+        settings = VeritextSettings()
+        assert settings.log_level == "INFO"
+
+    def test_default_log_format(self) -> None:
+        """Test default log format is console."""
+        settings = VeritextSettings()
+        assert settings.log_format == "console"
+
+    def test_default_benchmark_path(self) -> None:
+        """Test default benchmark storage path."""
+        settings = VeritextSettings()
+        assert settings.benchmark_storage_path == Path("benchmarks")
+
+    def test_default_tokeniser_lowercase(self) -> None:
+        """Test default tokeniser lowercase setting."""
+        settings = VeritextSettings()
+        assert settings.tokeniser_lowercase is True
+
+    def test_default_tokeniser_remove_punctuation(self) -> None:
+        """Test default tokeniser remove punctuation setting."""
+        settings = VeritextSettings()
+        assert settings.tokeniser_remove_punctuation is True
+
+    def test_default_semantic_model(self) -> None:
+        """Test default semantic model name."""
+        settings = VeritextSettings()
+        assert settings.semantic_model == "all-MiniLM-L6-v2"
+
+    def test_default_semantic_cache_enabled(self) -> None:
+        """Test semantic cache is enabled by default."""
+        settings = VeritextSettings()
+        assert settings.semantic_cache_embeddings is True
+
+    def test_env_var_override(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        """Test environment variable overrides default settings."""
+        monkeypatch.setenv("VERITEXT_LOG_LEVEL", "DEBUG")
+        settings = VeritextSettings()
+        assert settings.log_level == "DEBUG"
+
+    def test_env_var_override_log_format(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        """Test environment variable overrides log format."""
+        monkeypatch.setenv("VERITEXT_LOG_FORMAT", "json")
+        settings = VeritextSettings()
+        assert settings.log_format == "json"
+
+
+class TestGetSettings:
+    """Tests for get_settings function."""
+
+    def test_get_settings_returns_instance(self) -> None:
+        """Test get_settings returns a VeritextSettings instance."""
+        settings = get_settings()
+        assert isinstance(settings, VeritextSettings)
+
+    def test_get_settings_returns_valid_defaults(self) -> None:
+        """Test get_settings returns instance with valid defaults."""
+        settings = get_settings()
+        assert settings.log_level in ("DEBUG", "INFO", "WARNING", "ERROR")
+        assert settings.log_format in ("console", "json")
@@ -0,0 +1,56 @@
+"""Tests for logging module."""
+
+from veritext.core.logging import configure_logging, get_logger
+
+
+class TestGetLogger:
+    """Tests for get_logger function."""
+
+    def test_get_logger_returns_logger(self) -> None:
+        """Test get_logger returns a logger instance."""
+        logger = get_logger()
+        assert logger is not None
+
+    def test_get_logger_default_name(self) -> None:
+        """Test get_logger uses 'veritext' as default name."""
+        logger = get_logger()
+        # The logger should be a bound logger from structlog
+        assert hasattr(logger, "info")
+        assert hasattr(logger, "debug")
+        assert hasattr(logger, "warning")
+        assert hasattr(logger, "error")
+
+    def test_get_logger_custom_name(self) -> None:
+        """Test get_logger respects custom name parameter."""
+        logger = get_logger("custom.module")
+        assert logger is not None
+        assert hasattr(logger, "info")
+
+
+class TestConfigureLogging:
+    """Tests for configure_logging function."""
+
+    def test_configure_logging_console_format(self) -> None:
+        """Test configure_logging with console format does not raise."""
+        configure_logging(level="INFO", log_format="console")
+        logger = get_logger()
+        assert logger is not None
+
+    def test_configure_logging_json_format(self) -> None:
+        """Test configure_logging with json format does not raise."""
+        configure_logging(level="DEBUG", log_format="json")
+        logger = get_logger()
+        assert logger is not None
+
+    def test_configure_logging_uses_defaults(self) -> None:
+        """Test configure_logging uses settings defaults when not provided."""
+        configure_logging()
+        logger = get_logger()
+        assert logger is not None
+
+    def test_configure_logging_different_levels(self) -> None:
+        """Test configure_logging accepts different log levels."""
+        for level in ("DEBUG", "INFO", "WARNING", "ERROR"):
+            configure_logging(level=level)
+            logger = get_logger()
+            assert logger is not None
@@ -5,12 +5,11 @@ import pytest

@pytest.fixture
 def plugin_pytester(pytester: pytest.Pytester) -> pytest.Pytester:
-    """Configure pytester to use the veritext plugin."""
-    pytester.makeconftest(
+    """Configure pytester to use the veritext plugin.
+
+    Note: The plugin is already loaded via the entry point in pyproject.toml,
+    so no explicit pytest_plugins declaration is needed.
    """
-        pytest_plugins = ['veritext.pytest_plugin']
-        """
-    )
    return pytester


@@ -263,6 +263,11 @@ class TestContainsValidator:
        with pytest.raises(InvalidThresholdError, match="cannot be empty"):
            ContainsValidator(patterns=[])

+    def test_contains_validator_raises_on_invalid_regex(self) -> None:
+        """Test that invalid regex pattern raises error at init time."""
+        with pytest.raises(InvalidThresholdError, match="Invalid regex"):
+            ContainsValidator(patterns=[r"[invalid"])
+
    def test_contains_factory_function(self) -> None:
        """Test the contains() factory function."""
        validator = contains(patterns=["test"], case_sensitive=True)
@@ -327,6 +332,11 @@ class TestExcludesValidator:
        with pytest.raises(InvalidThresholdError, match="cannot be empty"):
            ExcludesValidator(patterns=[])

+    def test_excludes_validator_raises_on_invalid_regex(self) -> None:
+        """Test that invalid regex pattern raises error at init time."""
+        with pytest.raises(InvalidThresholdError, match="Invalid regex"):
+            ExcludesValidator(patterns=[r"[invalid"])
+
    def test_excludes_factory_function(self) -> None:
        """Test the excludes() factory function."""
        validator = excludes(patterns=["test"], case_sensitive=True)
Author	SHA1	Message	Date
kschappell	0699e97e1d	refactor: CLI cleanup and documentation updates - Refactor CLI metric computation to eliminate code duplication - Update version format to PEP 440 compliance (0.1.0.dev0) - Cache Settings instance via @lru_cache for performance - Document composite validators' protocol deviation - Consolidate redundant empty checks in ROUGE-L computation - Add Phase 10 (Portfolio Demos) to implementation plan	2026-02-04 15:38:46 +00:00
kschappell	7de4505e31	fix(pytest-plugin): remove duplicate plugin registration in tests The pytest plugin is already loaded via the entry point, so explicitly declaring it in conftest causes a duplicate registration error.	2026-02-04 00:43:20 +00:00
kschappell	564d663c78	docs(changelog): update for QA fixes	2026-02-04 00:23:06 +00:00
kschappell	0b2bc6c688	test(core): add coverage for config and logging modules Adds tests for VeritextSettings defaults, env var overrides, and the get_logger/configure_logging functions.	2026-02-04 00:22:57 +00:00
kschappell	aa687f43cd	fix(validators): validate regex patterns at init time ContainsValidator and ExcludesValidator now pre-compile regex patterns during initialisation and raise InvalidThresholdError if invalid.	2026-02-04 00:22:47 +00:00
kschappell	f18427e123	fix: QA review fixes for 0.1.0 release - Fix README readability example property names - Add validation for empty references after tokenisation in ROUGE - Guard against zero sentence count in readability metric - Implement LRU cache with max size for semantic embeddings - Add .score property to LexicalResult for API consistency - Use defensive list copy in composite validators	2026-02-03 21:31:48 +00:00
kschappell	1754556c99	docs(changelog): release 0.1.0 Initial release with metrics, validators, pytest plugin, benchmark module, CLI, and comprehensive documentation.	2026-02-03 19:16:37 +00:00
kschappell	13c869f5d6	docs(readme): comprehensive documentation Expands readme with detailed coverage of metrics, validators, pytest plugin, benchmark module, CLI commands, and development setup.	2026-02-03 19:16:14 +00:00
kschappell	93515707cc	docs(examples): add benchmark regression example Demonstrates benchmark quality tracking with historical comparison and CI integration using assert_no_regression() for exit code control.	2026-02-03 19:15:12 +00:00
kschappell	3cde5aba77	docs(examples): add chatbot testing example Demonstrates pytest integration for chatbot QA with validate_text() assertions, fixtures, and parametrised content safety tests.	2026-02-03 19:14:25 +00:00
kschappell	69966d171c	docs(examples): add basic validation example Demonstrates core Veritext functionality: metrics, validators, composites, and constraint validators with runnable code.	2026-02-03 19:13:47 +00:00