refactor: CLI cleanup and documentation updates

- Refactor CLI metric computation to eliminate code duplication - Update version format to PEP 440 compliance (0.1.0.dev0) - Cache Settings instance via @lru_cache for performance - Document composite validators' protocol deviation - Consolidate redundant empty checks in ROUGE-L computation - Add Phase 10 (Portfolio Demos) to implementation plan
2026-02-04 15:38:46 +00:00
parent 7de4505e31
commit 0699e97e1d
8 changed files with 224 additions and 66 deletions
@@ -7,9 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ## [Unreleased]

+### Changed
+
+- Refactored CLI metric computation to eliminate code duplication
+- Version format updated from `0.1.0-dev` to `0.1.0.dev0` (PEP 440 compliance)
+- Settings instance is now cached via `@lru_cache` for better performance
+- Documented composite validators' intentional deviation from `Check` protocol return type
+
 ### Fixed

+- Consolidated redundant empty checks in ROUGE-L computation
 - Fixed README example using incorrect property names (`grade_level` → `flesch_kincaid_grade`, `reading_ease` → `flesch_reading_ease`)
+
+### Documentation
+
+- Added Phase 10 (Portfolio Demos) to implementation plan: Streamlit demo and Jupyter notebooks
+- Updated project plan with portfolio demo section
 - Fixed potential crash in ROUGE metric when all references are empty after tokenisation
 - Fixed potential division by zero in readability metric when text has no sentence endings
 - Fixed unbounded cache growth in `SemanticSimilarity` by implementing LRU eviction with configurable max size
@@ -871,6 +871,59 @@ uv run pytest --cov=src/veritext --cov-report=term-missing

 ---

+### Phase 10: Portfolio Demos
+
+**Goal:** Interactive demos for showcasing Veritext without installation.
+
+**Step 1 — Streamlit Demo:**
+
+Build a quick interactive web UI for general visitors.
+
+- [ ] Create `demo/streamlit_app.py`
+- [ ] Text input boxes (candidate + reference)
+- [ ] Metric selector (BLEU, ROUGE, lexical, readability)
+- [ ] Threshold sliders for pass/fail validation
+- [ ] Results table with scores and status
+- [ ] Deploy to homeserver (e.g., `veritext.kschappell.com`)
+
+**Step 2 — Jupyter Notebook Collection:**
+
+Deep-dive notebooks targeting data science and ML recruiters.
+
+- [ ] Create `notebooks/` directory
+- [ ] `01-metrics-overview.ipynb` — Introduction to each metric with visualisations
+- [ ] `02-batch-evaluation.ipynb` — Evaluating model outputs at scale
+- [ ] `03-regression-detection.ipynb` — Tracking quality over time
+- [ ] `04-chatbot-validation.ipynb` — Real-world use case
+
+**Step 3 — JupyterLite Deployment:**
+
+Host notebooks as static files running in the browser.
+
+- [ ] Configure JupyterLite build with veritext pre-installed
+- [ ] Bundle notebooks into static site
+- [ ] Deploy alongside Streamlit demo
+
+**Files:**
+- `demo/streamlit_app.py`
+- `notebooks/01-metrics-overview.ipynb`
+- `notebooks/02-batch-evaluation.ipynb`
+- `notebooks/03-regression-detection.ipynb`
+- `notebooks/04-chatbot-validation.ipynb`
+- `notebooks/jupyterlite-config.json`
+
+**Verification:**
+```bash
+# Streamlit
+uv run streamlit run demo/streamlit_app.py
+
+# JupyterLite (local preview)
+jupyter lite build --contents notebooks/
+jupyter lite serve
+```
+
+---
+
 ## Dependencies

 ```toml
@@ -488,3 +488,47 @@ benchmark.assert_no_regression(tolerance=0.03)

 5. **Natural portfolio narrative** — "I was building X and needed a better way to test
   it, so I built this tool." Every interviewer has faced similar problems.
+
+---
+
+## Portfolio Demos (Future)
+
+Interactive demos to showcase Veritext without requiring installation.
+
+### Streamlit Demo
+
+A quick interactive web UI for general visitors and recruiters.
+
+**Features:**
+- Text input boxes (candidate + reference)
+- Metric selector (BLEU, ROUGE, lexical, readability)
+- Threshold sliders for pass/fail validation
+- Results table with scores and status
+
+**Deployment:** Self-hosted on homeserver (e.g., `veritext.kschappell.com`)
+
+**Effort:** ~30 minutes
+
+### Jupyter Notebook Collection
+
+Deep-dive notebooks targeting data science and ML recruiters.
+
+**Notebooks:**
+
+| Notebook | Purpose |
+|----------|---------|
+| `01-metrics-overview.ipynb` | Introduction to each metric with visualisations |
+| `02-batch-evaluation.ipynb` | Evaluating model outputs at scale, statistical analysis |
+| `03-regression-detection.ipynb` | Tracking quality over time, detecting degradation |
+| `04-chatbot-validation.ipynb` | Real-world use case: validating chatbot responses |
+
+**Hosting:** JupyterLite (static files, runs in browser via WebAssembly)
+
+**Deployment:** Self-hosted alongside Streamlit demo
+
+**Why both:**
+
+| Demo Type | Audience | Value |
+|-----------|----------|-------|
+| Streamlit | General visitors | Quick, interactive, no friction |
+| Notebooks | Data/ML recruiters | Shows analytical depth, speaks their language |
@@ -1,6 +1,6 @@
 [project]
 name = "veritext"
-version = "0.1.0-dev"
+version = "0.1.0.dev0"
 description = "Semantic text validation framework"
 readme = "readme.md"
 requires-python = ">=3.11"
@@ -11,11 +11,91 @@ from veritext.metrics.bleu import Bleu
 from veritext.metrics.lexical import Lexical
 from veritext.metrics.rouge import Rouge

-# Available metrics mapped to their computation functions
+# Available metrics
 AVAILABLE_METRICS = frozenset(
    {"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"}
 )

+# Lazily-initialised metric instances
+_bleu: Bleu | None = None
+_rouge: Rouge | None = None
+_lexical: Lexical | None = None
+
+
+def _get_bleu() -> Bleu:
+    """Get or create the BLEU metric instance."""
+    global _bleu
+    if _bleu is None:
+        _bleu = Bleu()
+    return _bleu
+
+
+def _get_rouge() -> Rouge:
+    """Get or create the ROUGE metric instance."""
+    global _rouge
+    if _rouge is None:
+        _rouge = Rouge()
+    return _rouge
+
+
+def _get_lexical() -> Lexical:
+    """Get or create the lexical metric instance."""
+    global _lexical
+    if _lexical is None:
+        _lexical = Lexical()
+    return _lexical
+
+
+# Metric registry: maps metric names to (result_keys, single_extractor, batch_extractor)
+# - result_keys: output keys to populate
+# - single_extractor: function(candidate, reference) -> dict of results
+# - batch_extractor: function(candidates, references) -> dict of results
+def _bleu_single(candidate: str, reference: str, key: str) -> dict[str, float]:
+    """Extract a BLEU score for single mode."""
+    result = _get_bleu().score(candidate, reference)
+    return {key: getattr(result, key)}
+
+
+def _bleu_batch(
+    candidates: list[str], references: list[str], key: str
+) -> dict[str, float]:
+    """Extract a BLEU score for batch mode."""
+    batch = _get_bleu().batch_score(candidates, references)
+    stats = batch.stats.get(key)
+    return {key: stats.mean} if stats else {}
+
+
+def _rouge_single(candidate: str, reference: str) -> dict[str, float]:
+    """Extract ROUGE-L F-measure for single mode."""
+    result = _get_rouge().score(candidate, reference)
+    return {"rouge_l": result.rouge_l.fmeasure}
+
+
+def _rouge_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
+    """Extract ROUGE-L F-measure for batch mode."""
+    batch = _get_rouge().batch_score(candidates, references)
+    stats = batch.stats.get("rouge_l_fmeasure")
+    return {"rouge_l": stats.mean} if stats else {}
+
+
+def _lexical_single(candidate: str, reference: str) -> dict[str, float]:
+    """Extract lexical scores for single mode."""
+    result = _get_lexical().score(candidate, reference)
+    return {"jaccard": result.jaccard, "token_overlap": result.token_overlap}
+
+
+def _lexical_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
+    """Extract lexical scores for batch mode."""
+    batch = _get_lexical().batch_score(candidates, references)
+    results: dict[str, float] = {}
+    jaccard_stats = batch.stats.get("jaccard")
+    overlap_stats = batch.stats.get("token_overlap")
+    if jaccard_stats:
+        results["jaccard"] = jaccard_stats.mean
+    if overlap_stats:
+        results["token_overlap"] = overlap_stats.mean
+    return results
+

 def _compute_metrics(
    candidate: str,
@@ -24,30 +104,16 @@ def _compute_metrics(
 ) -> dict[str, float]:
    """Compute requested metrics for a single text pair."""
    results: dict[str, float] = {}
-    bleu = Bleu()
-    rouge = Rouge()
-    lexical = Lexical()

    for metric in metric_names:
-        if metric == "bleu" or metric == "bleu4":
-            bleu_result = bleu.score(candidate, reference)
-            results["bleu4"] = bleu_result.bleu4
-        elif metric == "bleu1":
-            bleu_result = bleu.score(candidate, reference)
-            results["bleu1"] = bleu_result.bleu1
-        elif metric == "bleu2":
-            bleu_result = bleu.score(candidate, reference)
-            results["bleu2"] = bleu_result.bleu2
-        elif metric == "bleu3":
-            bleu_result = bleu.score(candidate, reference)
-            results["bleu3"] = bleu_result.bleu3
-        elif metric == "rouge" or metric == "rouge_l":
-            rouge_result = rouge.score(candidate, reference)
-            results["rouge_l"] = rouge_result.rouge_l.fmeasure
+        if metric in ("bleu", "bleu4"):
+            results.update(_bleu_single(candidate, reference, "bleu4"))
+        elif metric in ("bleu1", "bleu2", "bleu3"):
+            results.update(_bleu_single(candidate, reference, metric))
+        elif metric in ("rouge", "rouge_l"):
+            results.update(_rouge_single(candidate, reference))
        elif metric == "lexical":
-            lexical_result = lexical.score(candidate, reference)
-            results["jaccard"] = lexical_result.jaccard
-            results["token_overlap"] = lexical_result.token_overlap
+            results.update(_lexical_single(candidate, reference))

    return results

@@ -58,46 +124,17 @@ def _compute_batch_metrics(
    metric_names: list[str],
 ) -> dict[str, float]:
    """Compute average metrics for a batch of text pairs."""
-    bleu = Bleu()
-    rouge = Rouge()
-    lexical = Lexical()
-
    results: dict[str, float] = {}

    for metric in metric_names:
-        if metric == "bleu" or metric == "bleu4":
-            bleu_batch = bleu.batch_score(candidates, references)
-            stats = bleu_batch.stats.get("bleu4")
-            if stats:
-                results["bleu4"] = stats.mean
-        elif metric == "bleu1":
-            bleu_batch = bleu.batch_score(candidates, references)
-            stats = bleu_batch.stats.get("bleu1")
-            if stats:
-                results["bleu1"] = stats.mean
-        elif metric == "bleu2":
-            bleu_batch = bleu.batch_score(candidates, references)
-            stats = bleu_batch.stats.get("bleu2")
-            if stats:
-                results["bleu2"] = stats.mean
-        elif metric == "bleu3":
-            bleu_batch = bleu.batch_score(candidates, references)
-            stats = bleu_batch.stats.get("bleu3")
-            if stats:
-                results["bleu3"] = stats.mean
-        elif metric == "rouge" or metric == "rouge_l":
-            rouge_batch = rouge.batch_score(candidates, references)
-            stats = rouge_batch.stats.get("rouge_l_fmeasure")
-            if stats:
-                results["rouge_l"] = stats.mean
+        if metric in ("bleu", "bleu4"):
+            results.update(_bleu_batch(candidates, references, "bleu4"))
+        elif metric in ("bleu1", "bleu2", "bleu3"):
+            results.update(_bleu_batch(candidates, references, metric))
+        elif metric in ("rouge", "rouge_l"):
+            results.update(_rouge_batch(candidates, references))
        elif metric == "lexical":
-            lexical_batch = lexical.batch_score(candidates, references)
-            jaccard_stats = lexical_batch.stats.get("jaccard")
-            overlap_stats = lexical_batch.stats.get("token_overlap")
-            if jaccard_stats:
-                results["jaccard"] = jaccard_stats.mean
-            if overlap_stats:
-                results["token_overlap"] = overlap_stats.mean
+            results.update(_lexical_batch(candidates, references))

    return results

@@ -1,5 +1,6 @@
 """Configuration management using pydantic-settings."""

+from functools import lru_cache
 from pathlib import Path
 from typing import Literal

@@ -54,6 +55,7 @@ class VeritextSettings(BaseSettings):
    )


+@lru_cache
 def get_settings() -> VeritextSettings:
-    """Get the current settings instance."""
+    """Get the cached settings instance."""
    return VeritextSettings()
@@ -107,9 +107,6 @@ def _compute_rouge_l(
    Returns:
        RougeScore with precision, recall, and F-measure.
    """
-    if not candidate_tokens and not reference_tokens:
-        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
-
    if not candidate_tokens or not reference_tokens:
        return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)

@@ -1,11 +1,20 @@
-"""Composite validators for combining multiple checks."""
+"""Composite validators for combining multiple checks.
+
+Note: CompositeCheck classes (AllOf, AnyOf) intentionally return ValidationResult
+rather than CheckResult. This allows callers to inspect individual check results
+for detailed error reporting. They implement a compatible interface but are not
+substitutable where Check is expected as a type constraint.
+"""

 from veritext.core.types import CheckResult, ValidationContext, ValidationResult
 from veritext.validators.base import Check


 class AllOf:
-    """Passes only if all checks pass."""
+    """Passes only if all checks pass.
+
+    Note: Returns ValidationResult (not CheckResult) to expose child results.
+    """

    def __init__(self, checks: list[Check]) -> None:
        """
@@ -48,7 +57,10 @@ class AllOf:


 class AnyOf:
-    """Passes if any check passes."""
+    """Passes if any check passes.
+
+    Note: Returns ValidationResult (not CheckResult) to expose child results.
+    """

    def __init__(self, checks: list[Check]) -> None:
        """