refactor: CLI cleanup and documentation updates

- Refactor CLI metric computation to eliminate code duplication
- Update version format to PEP 440 compliance (0.1.0.dev0)
- Cache Settings instance via @lru_cache for performance
- Document composite validators' protocol deviation
- Consolidate redundant empty checks in ROUGE-L computation
- Add Phase 10 (Portfolio Demos) to implementation plan
This commit is contained in:
2026-02-04 15:38:46 +00:00
parent 7de4505e31
commit 0699e97e1d
8 changed files with 224 additions and 66 deletions

View File

@@ -7,9 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased] ## [Unreleased]
### Changed
- Refactored CLI metric computation to eliminate code duplication
- Version format updated from `0.1.0-dev` to `0.1.0.dev0` (PEP 440 compliance)
- Settings instance is now cached via `@lru_cache` for better performance
- Documented composite validators' intentional deviation from `Check` protocol return type
### Fixed ### Fixed
- Consolidated redundant empty checks in ROUGE-L computation
- Fixed README example using incorrect property names (`grade_level``flesch_kincaid_grade`, `reading_ease``flesch_reading_ease`) - Fixed README example using incorrect property names (`grade_level``flesch_kincaid_grade`, `reading_ease``flesch_reading_ease`)
### Documentation
- Added Phase 10 (Portfolio Demos) to implementation plan: Streamlit demo and Jupyter notebooks
- Updated project plan with portfolio demo section
- Fixed potential crash in ROUGE metric when all references are empty after tokenisation - Fixed potential crash in ROUGE metric when all references are empty after tokenisation
- Fixed potential division by zero in readability metric when text has no sentence endings - Fixed potential division by zero in readability metric when text has no sentence endings
- Fixed unbounded cache growth in `SemanticSimilarity` by implementing LRU eviction with configurable max size - Fixed unbounded cache growth in `SemanticSimilarity` by implementing LRU eviction with configurable max size

View File

@@ -871,6 +871,59 @@ uv run pytest --cov=src/veritext --cov-report=term-missing
--- ---
### Phase 10: Portfolio Demos
**Goal:** Interactive demos for showcasing Veritext without installation.
**Step 1 — Streamlit Demo:**
Build a quick interactive web UI for general visitors.
- [ ] Create `demo/streamlit_app.py`
- [ ] Text input boxes (candidate + reference)
- [ ] Metric selector (BLEU, ROUGE, lexical, readability)
- [ ] Threshold sliders for pass/fail validation
- [ ] Results table with scores and status
- [ ] Deploy to homeserver (e.g., `veritext.kschappell.com`)
**Step 2 — Jupyter Notebook Collection:**
Deep-dive notebooks targeting data science and ML recruiters.
- [ ] Create `notebooks/` directory
- [ ] `01-metrics-overview.ipynb` — Introduction to each metric with visualisations
- [ ] `02-batch-evaluation.ipynb` — Evaluating model outputs at scale
- [ ] `03-regression-detection.ipynb` — Tracking quality over time
- [ ] `04-chatbot-validation.ipynb` — Real-world use case
**Step 3 — JupyterLite Deployment:**
Host notebooks as static files running in the browser.
- [ ] Configure JupyterLite build with veritext pre-installed
- [ ] Bundle notebooks into static site
- [ ] Deploy alongside Streamlit demo
**Files:**
- `demo/streamlit_app.py`
- `notebooks/01-metrics-overview.ipynb`
- `notebooks/02-batch-evaluation.ipynb`
- `notebooks/03-regression-detection.ipynb`
- `notebooks/04-chatbot-validation.ipynb`
- `notebooks/jupyterlite-config.json`
**Verification:**
```bash
# Streamlit
uv run streamlit run demo/streamlit_app.py
# JupyterLite (local preview)
jupyter lite build --contents notebooks/
jupyter lite serve
```
---
## Dependencies ## Dependencies
```toml ```toml

View File

@@ -488,3 +488,47 @@ benchmark.assert_no_regression(tolerance=0.03)
5. **Natural portfolio narrative** — "I was building X and needed a better way to test 5. **Natural portfolio narrative** — "I was building X and needed a better way to test
it, so I built this tool." Every interviewer has faced similar problems. it, so I built this tool." Every interviewer has faced similar problems.
---
## Portfolio Demos (Future)
Interactive demos to showcase Veritext without requiring installation.
### Streamlit Demo
A quick interactive web UI for general visitors and recruiters.
**Features:**
- Text input boxes (candidate + reference)
- Metric selector (BLEU, ROUGE, lexical, readability)
- Threshold sliders for pass/fail validation
- Results table with scores and status
**Deployment:** Self-hosted on homeserver (e.g., `veritext.kschappell.com`)
**Effort:** ~30 minutes
### Jupyter Notebook Collection
Deep-dive notebooks targeting data science and ML recruiters.
**Notebooks:**
| Notebook | Purpose |
|----------|---------|
| `01-metrics-overview.ipynb` | Introduction to each metric with visualisations |
| `02-batch-evaluation.ipynb` | Evaluating model outputs at scale, statistical analysis |
| `03-regression-detection.ipynb` | Tracking quality over time, detecting degradation |
| `04-chatbot-validation.ipynb` | Real-world use case: validating chatbot responses |
**Hosting:** JupyterLite (static files, runs in browser via WebAssembly)
**Deployment:** Self-hosted alongside Streamlit demo
**Why both:**
| Demo Type | Audience | Value |
|-----------|----------|-------|
| Streamlit | General visitors | Quick, interactive, no friction |
| Notebooks | Data/ML recruiters | Shows analytical depth, speaks their language |

View File

@@ -1,6 +1,6 @@
[project] [project]
name = "veritext" name = "veritext"
version = "0.1.0-dev" version = "0.1.0.dev0"
description = "Semantic text validation framework" description = "Semantic text validation framework"
readme = "readme.md" readme = "readme.md"
requires-python = ">=3.11" requires-python = ">=3.11"

View File

@@ -11,11 +11,91 @@ from veritext.metrics.bleu import Bleu
from veritext.metrics.lexical import Lexical from veritext.metrics.lexical import Lexical
from veritext.metrics.rouge import Rouge from veritext.metrics.rouge import Rouge
# Available metrics mapped to their computation functions # Available metrics
AVAILABLE_METRICS = frozenset( AVAILABLE_METRICS = frozenset(
{"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"} {"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"}
) )
# Lazily-initialised metric instances
_bleu: Bleu | None = None
_rouge: Rouge | None = None
_lexical: Lexical | None = None
def _get_bleu() -> Bleu:
"""Get or create the BLEU metric instance."""
global _bleu
if _bleu is None:
_bleu = Bleu()
return _bleu
def _get_rouge() -> Rouge:
"""Get or create the ROUGE metric instance."""
global _rouge
if _rouge is None:
_rouge = Rouge()
return _rouge
def _get_lexical() -> Lexical:
"""Get or create the lexical metric instance."""
global _lexical
if _lexical is None:
_lexical = Lexical()
return _lexical
# Metric registry: maps metric names to (result_keys, single_extractor, batch_extractor)
# - result_keys: output keys to populate
# - single_extractor: function(candidate, reference) -> dict of results
# - batch_extractor: function(candidates, references) -> dict of results
def _bleu_single(candidate: str, reference: str, key: str) -> dict[str, float]:
"""Extract a BLEU score for single mode."""
result = _get_bleu().score(candidate, reference)
return {key: getattr(result, key)}
def _bleu_batch(
candidates: list[str], references: list[str], key: str
) -> dict[str, float]:
"""Extract a BLEU score for batch mode."""
batch = _get_bleu().batch_score(candidates, references)
stats = batch.stats.get(key)
return {key: stats.mean} if stats else {}
def _rouge_single(candidate: str, reference: str) -> dict[str, float]:
"""Extract ROUGE-L F-measure for single mode."""
result = _get_rouge().score(candidate, reference)
return {"rouge_l": result.rouge_l.fmeasure}
def _rouge_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
"""Extract ROUGE-L F-measure for batch mode."""
batch = _get_rouge().batch_score(candidates, references)
stats = batch.stats.get("rouge_l_fmeasure")
return {"rouge_l": stats.mean} if stats else {}
def _lexical_single(candidate: str, reference: str) -> dict[str, float]:
"""Extract lexical scores for single mode."""
result = _get_lexical().score(candidate, reference)
return {"jaccard": result.jaccard, "token_overlap": result.token_overlap}
def _lexical_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
"""Extract lexical scores for batch mode."""
batch = _get_lexical().batch_score(candidates, references)
results: dict[str, float] = {}
jaccard_stats = batch.stats.get("jaccard")
overlap_stats = batch.stats.get("token_overlap")
if jaccard_stats:
results["jaccard"] = jaccard_stats.mean
if overlap_stats:
results["token_overlap"] = overlap_stats.mean
return results
def _compute_metrics( def _compute_metrics(
candidate: str, candidate: str,
@@ -24,30 +104,16 @@ def _compute_metrics(
) -> dict[str, float]: ) -> dict[str, float]:
"""Compute requested metrics for a single text pair.""" """Compute requested metrics for a single text pair."""
results: dict[str, float] = {} results: dict[str, float] = {}
bleu = Bleu()
rouge = Rouge()
lexical = Lexical()
for metric in metric_names: for metric in metric_names:
if metric == "bleu" or metric == "bleu4": if metric in ("bleu", "bleu4"):
bleu_result = bleu.score(candidate, reference) results.update(_bleu_single(candidate, reference, "bleu4"))
results["bleu4"] = bleu_result.bleu4 elif metric in ("bleu1", "bleu2", "bleu3"):
elif metric == "bleu1": results.update(_bleu_single(candidate, reference, metric))
bleu_result = bleu.score(candidate, reference) elif metric in ("rouge", "rouge_l"):
results["bleu1"] = bleu_result.bleu1 results.update(_rouge_single(candidate, reference))
elif metric == "bleu2":
bleu_result = bleu.score(candidate, reference)
results["bleu2"] = bleu_result.bleu2
elif metric == "bleu3":
bleu_result = bleu.score(candidate, reference)
results["bleu3"] = bleu_result.bleu3
elif metric == "rouge" or metric == "rouge_l":
rouge_result = rouge.score(candidate, reference)
results["rouge_l"] = rouge_result.rouge_l.fmeasure
elif metric == "lexical": elif metric == "lexical":
lexical_result = lexical.score(candidate, reference) results.update(_lexical_single(candidate, reference))
results["jaccard"] = lexical_result.jaccard
results["token_overlap"] = lexical_result.token_overlap
return results return results
@@ -58,46 +124,17 @@ def _compute_batch_metrics(
metric_names: list[str], metric_names: list[str],
) -> dict[str, float]: ) -> dict[str, float]:
"""Compute average metrics for a batch of text pairs.""" """Compute average metrics for a batch of text pairs."""
bleu = Bleu()
rouge = Rouge()
lexical = Lexical()
results: dict[str, float] = {} results: dict[str, float] = {}
for metric in metric_names: for metric in metric_names:
if metric == "bleu" or metric == "bleu4": if metric in ("bleu", "bleu4"):
bleu_batch = bleu.batch_score(candidates, references) results.update(_bleu_batch(candidates, references, "bleu4"))
stats = bleu_batch.stats.get("bleu4") elif metric in ("bleu1", "bleu2", "bleu3"):
if stats: results.update(_bleu_batch(candidates, references, metric))
results["bleu4"] = stats.mean elif metric in ("rouge", "rouge_l"):
elif metric == "bleu1": results.update(_rouge_batch(candidates, references))
bleu_batch = bleu.batch_score(candidates, references)
stats = bleu_batch.stats.get("bleu1")
if stats:
results["bleu1"] = stats.mean
elif metric == "bleu2":
bleu_batch = bleu.batch_score(candidates, references)
stats = bleu_batch.stats.get("bleu2")
if stats:
results["bleu2"] = stats.mean
elif metric == "bleu3":
bleu_batch = bleu.batch_score(candidates, references)
stats = bleu_batch.stats.get("bleu3")
if stats:
results["bleu3"] = stats.mean
elif metric == "rouge" or metric == "rouge_l":
rouge_batch = rouge.batch_score(candidates, references)
stats = rouge_batch.stats.get("rouge_l_fmeasure")
if stats:
results["rouge_l"] = stats.mean
elif metric == "lexical": elif metric == "lexical":
lexical_batch = lexical.batch_score(candidates, references) results.update(_lexical_batch(candidates, references))
jaccard_stats = lexical_batch.stats.get("jaccard")
overlap_stats = lexical_batch.stats.get("token_overlap")
if jaccard_stats:
results["jaccard"] = jaccard_stats.mean
if overlap_stats:
results["token_overlap"] = overlap_stats.mean
return results return results

View File

@@ -1,5 +1,6 @@
"""Configuration management using pydantic-settings.""" """Configuration management using pydantic-settings."""
from functools import lru_cache
from pathlib import Path from pathlib import Path
from typing import Literal from typing import Literal
@@ -54,6 +55,7 @@ class VeritextSettings(BaseSettings):
) )
@lru_cache
def get_settings() -> VeritextSettings: def get_settings() -> VeritextSettings:
"""Get the current settings instance.""" """Get the cached settings instance."""
return VeritextSettings() return VeritextSettings()

View File

@@ -107,9 +107,6 @@ def _compute_rouge_l(
Returns: Returns:
RougeScore with precision, recall, and F-measure. RougeScore with precision, recall, and F-measure.
""" """
if not candidate_tokens and not reference_tokens:
return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
if not candidate_tokens or not reference_tokens: if not candidate_tokens or not reference_tokens:
return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0) return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)

View File

@@ -1,11 +1,20 @@
"""Composite validators for combining multiple checks.""" """Composite validators for combining multiple checks.
Note: CompositeCheck classes (AllOf, AnyOf) intentionally return ValidationResult
rather than CheckResult. This allows callers to inspect individual check results
for detailed error reporting. They implement a compatible interface but are not
substitutable where Check is expected as a type constraint.
"""
from veritext.core.types import CheckResult, ValidationContext, ValidationResult from veritext.core.types import CheckResult, ValidationContext, ValidationResult
from veritext.validators.base import Check from veritext.validators.base import Check
class AllOf: class AllOf:
"""Passes only if all checks pass.""" """Passes only if all checks pass.
Note: Returns ValidationResult (not CheckResult) to expose child results.
"""
def __init__(self, checks: list[Check]) -> None: def __init__(self, checks: list[Check]) -> None:
""" """
@@ -48,7 +57,10 @@ class AllOf:
class AnyOf: class AnyOf:
"""Passes if any check passes.""" """Passes if any check passes.
Note: Returns ValidationResult (not CheckResult) to expose child results.
"""
def __init__(self, checks: list[Check]) -> None: def __init__(self, checks: list[Check]) -> None:
""" """