From 49f1e27cb164bc3c9144145b1a7712002eafbc37 Mon Sep 17 00:00:00 2001 From: Kai Chappell Date: Tue, 3 Feb 2026 15:27:00 +0000 Subject: [PATCH] docs: add project and implementation plans Comprehensive documentation for Veritext semantic text validation framework: - Project plan with architecture, use cases, and success criteria - Implementation plan with 9 phases, interfaces, and verification steps --- docs/implementation-plan.md | 861 ++++++++++++++++++++++++++++++++++++ docs/project-plan.md | 478 ++++++++++++++++++++ 2 files changed, 1339 insertions(+) create mode 100644 docs/implementation-plan.md create mode 100644 docs/project-plan.md diff --git a/docs/implementation-plan.md b/docs/implementation-plan.md new file mode 100644 index 0000000..110a6ad --- /dev/null +++ b/docs/implementation-plan.md @@ -0,0 +1,861 @@ +# Implementation Plan: Veritext + +Semantic text validation framework for Python — validates text outputs against quality criteria. + +## Project Overview + +**Location:** `/home/kai/work/dev/portfolio/veritext/` +**Remote:** `https://gitea.kschappell.com/kschappell/veritext.git` + +**Purpose:** A Python library for validating text outputs against semantic criteria. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching. + +--- + +## Architectural Decisions + +### 1. Layered Architecture + +``` +┌─────────────────────────────────────────────────────┐ +│ CLI / pytest_plugin (presentation layer) │ +├─────────────────────────────────────────────────────┤ +│ validators/ (decision logic) │ +│ benchmark/ (tracking & regression) │ +├─────────────────────────────────────────────────────┤ +│ metrics/ (pure computation) │ +├─────────────────────────────────────────────────────┤ +│ core/ (shared types, tokenisation) │ +└─────────────────────────────────────────────────────┘ +``` + +**Dependency rule:** Each layer depends only on layers below it. + +### 2. Metrics vs Validators (Clear Separation) + +| Concept | Responsibility | Output | +|---------|----------------|--------| +| **Metric** | Compute a score | Typed result object (e.g., `BleuResult`) | +| **Validator** | Make pass/fail decision | `ValidationResult` with diagnostics | + +Validators wrap metrics and apply thresholds. + +### 3. Optional Heavy Dependencies + +`sentence-transformers` (~2GB with PyTorch) is optional: + +```toml +[project.optional-dependencies] +semantic = ["sentence-transformers>=2.2"] +``` + +Core library works without ML dependencies. + +### 4. Typed Result Objects + +Each metric returns a specific result type, not just `float`: + +```python +@dataclass(frozen=True) +class BleuResult: + bleu1: float + bleu2: float + bleu3: float + bleu4: float + brevity_penalty: float + +@dataclass(frozen=True) +class RougeScore: + precision: float + recall: float + fmeasure: float + +@dataclass(frozen=True) +class RougeResult: + rouge1: RougeScore + rouge2: RougeScore + rouge_l: RougeScore +``` + +### 5. Shared Tokenisation + +Single tokeniser used by all n-gram metrics: + +```python +class Tokeniser(Protocol): + def tokenise(self, text: str) -> list[str]: ... + +class WordTokeniser: + def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ... +``` + +### 6. Explicit Context Object + +Validation context is explicit, not `**kwargs`: + +```python +@dataclass +class ValidationContext: + reference: str | list[str] | None = None + metadata: dict[str, Any] = field(default_factory=dict) +``` + +--- + +## Directory Structure + +``` +veritext/ +├── src/ +│ └── veritext/ +│ ├── __init__.py # Public API exports +│ ├── py.typed # PEP 561 marker +│ ├── core/ +│ │ ├── __init__.py +│ │ ├── types.py # ValidationContext, CheckResult, BatchResult +│ │ ├── exceptions.py # Exception hierarchy +│ │ ├── tokenisation.py # Shared tokeniser +│ │ ├── config.py # pydantic-settings +│ │ └── logging.py # structlog configuration +│ ├── metrics/ +│ │ ├── __init__.py # Metric exports +│ │ ├── base.py # Metric protocol +│ │ ├── results.py # BleuResult, RougeResult, etc. +│ │ ├── bleu.py # BLEU implementation +│ │ ├── rouge.py # ROUGE implementation +│ │ ├── lexical.py # Jaccard, token overlap +│ │ └── readability.py # Flesch-Kincaid, etc. +│ ├── semantic/ # Optional (requires sentence-transformers) +│ │ ├── __init__.py +│ │ └── similarity.py # Embedding-based similarity +│ ├── validators/ +│ │ ├── __init__.py # Validator exports +│ │ ├── base.py # Check protocol, ValidationResult +│ │ ├── metric.py # Validators wrapping metrics +│ │ ├── constraint.py # Length, content checks +│ │ └── composite.py # Validator composition +│ ├── benchmark/ +│ │ ├── __init__.py +│ │ ├── models.py # BenchmarkRun, RegressionReport +│ │ ├── storage.py # SQLite backend +│ │ ├── runner.py # Benchmark execution +│ │ └── regression.py # Statistical detection +│ ├── pytest_plugin/ +│ │ ├── __init__.py # Plugin entry point +│ │ ├── fixtures.py # Pytest fixtures +│ │ ├── assertions.py # validate_text(), assert_similar() +│ │ └── plugin.py # Pytest hooks +│ └── cli/ +│ ├── __init__.py +│ └── main.py # Typer CLI app +├── tests/ +│ ├── conftest.py +│ ├── test_core/ +│ │ ├── test_tokenisation.py +│ │ └── test_types.py +│ ├── test_metrics/ +│ │ ├── test_bleu.py +│ │ ├── test_rouge.py +│ │ ├── test_lexical.py +│ │ └── test_readability.py +│ ├── test_semantic/ +│ │ └── test_similarity.py +│ ├── test_validators/ +│ │ ├── test_metric_validators.py +│ │ ├── test_constraint_validators.py +│ │ └── test_composite.py +│ ├── test_benchmark/ +│ │ ├── test_storage.py +│ │ └── test_regression.py +│ ├── test_pytest_plugin/ +│ │ └── test_integration.py +│ └── test_cli/ +│ └── test_commands.py +├── examples/ +│ ├── basic_validation.py +│ ├── chatbot_testing.py +│ └── benchmark_regression.py +├── docs/ +│ ├── project-plan.md +│ └── implementation-plan.md +├── pyproject.toml +├── readme.md +├── changelog.md +└── CLAUDE.md +``` + +--- + +## Exception Hierarchy + +```python +class VeritextError(Exception): + """Base exception for all Veritext errors.""" + +class MetricError(VeritextError): + """Error during metric computation.""" + +class TokenisationError(MetricError): + """Error during text tokenisation.""" + +class EmbeddingError(MetricError): + """Error computing embeddings (semantic similarity).""" + +class ValidationError(VeritextError): + """Error during validation.""" + +class InvalidThresholdError(ValidationError): + """Invalid threshold value provided.""" + +class BenchmarkError(VeritextError): + """Error during benchmarking.""" + +class StorageError(BenchmarkError): + """Error reading/writing benchmark storage.""" + +class RegressionDetectedError(BenchmarkError): + """Quality regression detected (used in CI).""" + +class ConfigurationError(VeritextError): + """Invalid configuration.""" + +class DependencyError(VeritextError): + """Optional dependency not installed.""" +``` + +--- + +## Core Interfaces + +### Metric Protocol + +```python +from typing import Protocol, TypeVar, Generic + +T = TypeVar("T") + +class Metric(Protocol[T]): + """Protocol for text comparison metrics.""" + + @property + def name(self) -> str: ... + + def score(self, candidate: str, reference: str | list[str]) -> T: ... + + def batch_score( + self, + candidates: list[str], + references: list[str] | list[list[str]] + ) -> BatchResult[T]: ... + +@dataclass +class AggregateStats: + mean: float + std: float + min: float + max: float + percentiles: dict[int, float] # {25: 0.65, 50: 0.72, 75: 0.81, 95: 0.89} + +@dataclass +class BatchResult(Generic[T]): + results: list[T] # Individual results per sample + count: int + stats: dict[str, AggregateStats] # Aggregated stats for numeric fields +``` + +**Note:** Readability metrics (Flesch-Kincaid) accept but ignore the `reference` parameter since they only analyse the candidate text. + +### Validator Protocol + +```python +class Check(Protocol): + """Protocol for individual validation checks.""" + + @property + def name(self) -> str: ... + + def check(self, text: str, context: ValidationContext) -> CheckResult: ... + +@dataclass +class CheckResult: + name: str + passed: bool + actual: Any + threshold: Any | None + message: str + +@dataclass +class ValidationResult: + passed: bool + checks: list[CheckResult] + + @property + def failure_summary(self) -> str: ... + + @property + def failed_checks(self) -> list[CheckResult]: ... +``` + +### Benchmark Models + +```python +@dataclass +class BenchmarkRun: + id: str # UUID + benchmark_name: str + timestamp: datetime + veritext_version: str # Track library version + metrics: dict[str, float] # {"rouge_l": 0.82, "bleu4": 0.71} + sample_count: int + metadata: dict[str, Any] # {"git_sha": "abc123", "model": "v2"} + +@dataclass +class RegressionReport: + detected: bool + baseline: dict[str, float] + current: dict[str, float] + deltas: dict[str, float] # {"rouge_l": -0.05} + tolerance: float + + @property + def summary(self) -> str: ... +``` + +--- + +## Validator Naming Convention + +Consistent short names: + +```python +from veritext import validators as v + +# Metric-based validators +v.bleu(min_score=0.7) # BLEU-4 by default +v.bleu(min_score=0.7, variant=1) # BLEU-1 +v.rouge(min_score=0.7) # ROUGE-L by default +v.rouge(min_score=0.7, variant="1") # ROUGE-1 +v.semantic(min_score=0.8) # Semantic similarity + +# Constraint validators +v.length(max_chars=500) +v.length(min_chars=100, max_chars=500) +v.readability(max_grade=8) +v.contains(terms=["hello", "world"]) +v.excludes(terms=["error", "fail"]) +v.pattern(regex=r"^\d{4}-\d{2}-\d{2}$") + +# Composition +v.all_of([...]) # All must pass +v.any_of([...]) # At least one must pass +v.weighted( # Weighted score threshold + checks=[ + (v.bleu(min_score=0.7), 0.6), # (check, weight) tuples + (v.readability(max_grade=8), 0.4), + ], + min_score=0.75, # Minimum weighted score to pass +) +``` + +--- + +## Implementation Phases + +### Phase 1: Project Scaffold & Core + +**Goal:** Set up project structure with shared types and tokenisation. + +**Tasks:** +1. Create directory structure +2. Write `pyproject.toml` with optional dependencies +3. Create `CLAUDE.md` with project guidelines +4. Implement `core/exceptions.py` (full hierarchy) +5. Implement `core/types.py` (ValidationContext, CheckResult, BatchResult) +6. Implement `core/tokenisation.py` (WordTokeniser) +7. Implement `core/config.py` (pydantic-settings) +8. Implement `core/logging.py` (structlog configuration) +9. Create `__init__.py` with version +10. Write tests for tokenisation +11. Initial commit to Gitea + +**Files:** +- `pyproject.toml` +- `CLAUDE.md` +- `readme.md` (stub) +- `changelog.md` +- `src/veritext/__init__.py` +- `src/veritext/py.typed` +- `src/veritext/core/__init__.py` +- `src/veritext/core/exceptions.py` +- `src/veritext/core/types.py` +- `src/veritext/core/tokenisation.py` +- `src/veritext/core/config.py` +- `src/veritext/core/logging.py` +- `tests/conftest.py` +- `tests/test_core/test_tokenisation.py` +- `tests/test_core/test_types.py` + +**Verification:** +```bash +uv sync +uv run ruff check . +uv run ruff format --check . +uv run mypy src/ +uv run pytest tests/test_core/ -v +``` + +--- + +### Phase 2: Metrics — BLEU & Lexical + +**Goal:** Implement BLEU and lexical similarity metrics. + +**Tasks:** +1. Implement `metrics/base.py` (Metric protocol) +2. Implement `metrics/results.py` (BleuResult, LexicalResult) +3. Implement `metrics/bleu.py` (BLEU-1 through BLEU-4) +4. Implement `metrics/lexical.py` (Jaccard, token overlap) +5. Add batch processing with statistics +6. Write comprehensive tests with reference values +7. Update changelog + +**Key Design:** +```python +class Bleu: + def __init__(self, tokeniser: Tokeniser | None = None, max_n: int = 4): ... + + def score(self, candidate: str, reference: str | list[str]) -> BleuResult: ... +``` + +**Files:** +- `src/veritext/metrics/__init__.py` +- `src/veritext/metrics/base.py` +- `src/veritext/metrics/results.py` +- `src/veritext/metrics/bleu.py` +- `src/veritext/metrics/lexical.py` +- `tests/test_metrics/test_bleu.py` +- `tests/test_metrics/test_lexical.py` + +**Verification:** +```bash +uv run pytest tests/test_metrics/ -v --cov=src/veritext/metrics +# Verify BLEU matches nltk.translate.bleu_score reference +``` + +--- + +### Phase 3: Metrics — ROUGE & Readability + +**Goal:** Implement ROUGE and readability metrics. + +**Tasks:** +1. Implement `metrics/rouge.py` (ROUGE-1, ROUGE-2, ROUGE-L) +2. Implement `metrics/readability.py` (Flesch-Kincaid) +3. Add RougeResult, ReadabilityResult to results.py +4. Write comprehensive tests +5. Update changelog + +**Files:** +- `src/veritext/metrics/rouge.py` +- `src/veritext/metrics/readability.py` +- `tests/test_metrics/test_rouge.py` +- `tests/test_metrics/test_readability.py` + +**Verification:** +```bash +uv run pytest tests/test_metrics/ -v +# Verify ROUGE matches rouge-score library reference +``` + +--- + +### Phase 4: Validators + +**Goal:** Build composable validation system. + +**Tasks:** +1. Implement `validators/base.py` (Check protocol, ValidationResult) +2. Implement `validators/metric.py` (BleuValidator, RougeValidator) +3. Implement `validators/constraint.py` (LengthValidator, ContainsValidator, etc.) +4. Implement `validators/composite.py` (AllOf, AnyOf, Weighted) +5. Create validator factory functions (`v.bleu()`, `v.length()`, etc.) +6. Write comprehensive tests +7. Update changelog + +**Key Design:** +```python +# validators/metric.py +class BleuValidator: + def __init__( + self, + min_score: float, + variant: int = 4, + tokeniser: Tokeniser | None = None, + ): ... + + def check(self, text: str, context: ValidationContext) -> CheckResult: ... + +# validators/__init__.py (factory functions) +def bleu(min_score: float, variant: int = 4) -> BleuValidator: ... +def rouge(min_score: float, variant: str = "l") -> RougeValidator: ... +def length(min_chars: int | None = None, max_chars: int | None = None) -> LengthValidator: ... +``` + +**Files:** +- `src/veritext/validators/__init__.py` +- `src/veritext/validators/base.py` +- `src/veritext/validators/metric.py` +- `src/veritext/validators/constraint.py` +- `src/veritext/validators/composite.py` +- `tests/test_validators/test_metric_validators.py` +- `tests/test_validators/test_constraint_validators.py` +- `tests/test_validators/test_composite.py` + +**Verification:** +```bash +uv run pytest tests/test_validators/ -v --cov=src/veritext/validators +``` + +--- + +### Phase 5: Semantic Similarity (Optional Dependency) + +**Goal:** Add embedding-based semantic similarity as optional feature. + +**Tasks:** +1. Implement `semantic/similarity.py` with lazy import +2. Add embedding caching +3. Add DependencyError for missing sentence-transformers +4. Implement SemanticValidator +5. Write tests (skipped if dependency missing) +6. Update changelog + +**Key Design:** +```python +# semantic/similarity.py +class SemanticSimilarity: + def __init__( + self, + model: str = "all-MiniLM-L6-v2", + cache_embeddings: bool = True, + ): + try: + from sentence_transformers import SentenceTransformer + except ImportError: + raise DependencyError( + "Install veritext[semantic] for semantic similarity: " + "pip install veritext[semantic]" + ) + self._model = SentenceTransformer(model) + self._cache: dict[str, Any] = {} if cache_embeddings else None +``` + +**Files:** +- `src/veritext/semantic/__init__.py` +- `src/veritext/semantic/similarity.py` +- `tests/test_semantic/test_similarity.py` + +**Verification:** +```bash +# Without semantic dependency +uv run pytest tests/ -v --ignore=tests/test_semantic/ + +# With semantic dependency +uv pip install sentence-transformers +uv run pytest tests/test_semantic/ -v +``` + +--- + +### Phase 6: Pytest Plugin + +**Goal:** Native pytest integration for CI/CD. + +**Tasks:** +1. Create plugin structure with entry points +2. Implement fixtures: `text_validator` +3. Implement `validate_text()` assertion function +4. Create detailed failure formatting +5. Add `@pytest.mark.text_validation` marker +6. Write integration tests +7. Update changelog + +**Entry point:** +```toml +[project.entry-points.pytest11] +veritext = "veritext.pytest_plugin" +``` + +**Key Design:** +```python +# pytest_plugin/assertions.py +def validate_text( + text: str, + *, + reference: str | None = None, + min_bleu: float | None = None, + min_rouge: float | None = None, + min_semantic: float | None = None, + max_length: int | None = None, + max_reading_grade: int | None = None, + contains: list[str] | None = None, + excludes: list[str] | None = None, +) -> None: + """ + Assert text passes all specified validation criteria. + + Raises: + AssertionError: With detailed failure information if validation fails. + """ +``` + +**Files:** +- `src/veritext/pytest_plugin/__init__.py` +- `src/veritext/pytest_plugin/fixtures.py` +- `src/veritext/pytest_plugin/assertions.py` +- `src/veritext/pytest_plugin/plugin.py` +- `tests/test_pytest_plugin/test_integration.py` + +**Verification:** +```bash +uv pip install -e . +uv run pytest --co -q # Should show veritext plugin +uv run pytest tests/test_pytest_plugin/ -v +``` + +--- + +### Phase 7: Benchmark & Regression + +**Goal:** Track quality over time, detect regressions. + +**Tasks:** +1. Implement `benchmark/models.py` (BenchmarkRun, RegressionReport) +2. Implement `benchmark/storage.py` (SQLite backend) +3. Implement `benchmark/runner.py` (Benchmark class) +4. Implement `benchmark/regression.py` (statistical detection) +5. Add `assert_no_regression()` for CI +6. Write tests +7. Update changelog + +**Key Interface:** +```python +class Benchmark: + def __init__(self, name: str, storage_path: str | Path = "benchmarks/"): ... + + def evaluate( + self, + candidates: list[str], + references: list[str], + metrics: list[str] = ("rouge_l", "bleu4"), + ) -> BenchmarkRun: + """Evaluate candidates, store results, return the run record.""" + ... + + def check_regression( + self, + tolerance: float = 0.05, + window: int = 10, + ) -> RegressionReport: + """Compare current run against historical baseline.""" + ... + + def assert_no_regression(self, tolerance: float = 0.05) -> None: + """Raise RegressionDetectedError if quality dropped.""" + ... +``` + +**SQLite Schema:** +```sql +CREATE TABLE benchmark_runs ( + id TEXT PRIMARY KEY, + benchmark_name TEXT NOT NULL, + timestamp TEXT NOT NULL, + veritext_version TEXT NOT NULL, + sample_count INTEGER NOT NULL, + metadata TEXT -- JSON +); + +CREATE TABLE benchmark_metrics ( + run_id TEXT REFERENCES benchmark_runs(id), + metric_name TEXT NOT NULL, + value REAL NOT NULL, + PRIMARY KEY (run_id, metric_name) +); + +CREATE INDEX idx_benchmark_name ON benchmark_runs(benchmark_name, timestamp); +``` + +**Files:** +- `src/veritext/benchmark/__init__.py` +- `src/veritext/benchmark/models.py` +- `src/veritext/benchmark/storage.py` +- `src/veritext/benchmark/runner.py` +- `src/veritext/benchmark/regression.py` +- `tests/test_benchmark/test_storage.py` +- `tests/test_benchmark/test_runner.py` +- `tests/test_benchmark/test_regression.py` + +**Verification:** +```bash +uv run pytest tests/test_benchmark/ -v --cov=src/veritext/benchmark +``` + +--- + +### Phase 8: CLI + +**Goal:** Command-line interface for validation and benchmarking. + +**Tasks:** +1. Implement Typer CLI app +2. Add `validate` command +3. Add `benchmark run` command +4. Add `benchmark show` command +5. Add rich output formatting +6. Write CLI tests +7. Update changelog + +**Commands:** +```bash +veritext validate "text" --reference "ref" --metrics bleu,rouge +veritext validate --file outputs.jsonl --reference-file refs.jsonl +veritext benchmark run my_benchmark --inputs data/ --references refs/ +veritext benchmark show my_benchmark --last 20 +veritext benchmark check my_benchmark --tolerance 0.05 +``` + +**Input Formats:** +- **JSONL:** One JSON object per line with `candidate` and `reference` fields: + ```json + {"candidate": "The cat sat on the mat.", "reference": "A cat is sitting on a mat."} + {"candidate": "Hello world.", "reference": "Greetings, world."} + ``` +- **Directories:** Matching filenames in `--inputs` and `--references` directories: + ``` + data/sample1.txt ↔ refs/sample1.txt + data/sample2.txt ↔ refs/sample2.txt + ``` + +**Files:** +- `src/veritext/cli/__init__.py` +- `src/veritext/cli/main.py` +- `tests/test_cli/test_commands.py` + +**Verification:** +```bash +uv run veritext --help +uv run veritext validate "hello world" --reference "hello world" --metrics bleu +uv run pytest tests/test_cli/ -v +``` + +--- + +### Phase 9: Documentation & Polish + +**Goal:** Make portfolio-ready. + +**Tasks:** +1. Write comprehensive `readme.md` with examples +2. Add docstrings to all public APIs +3. Create example scripts +4. Ensure ≥80% test coverage +5. Final linting/type checking +6. Update `changelog.md` with 0.1.0 release +7. Update project docs in `docs/` + +**Files:** +- `readme.md` (comprehensive) +- `examples/basic_validation.py` +- `examples/chatbot_testing.py` +- `examples/benchmark_regression.py` +- Update all docstrings +- `docs/project-plan.md` (update) +- `docs/implementation-plan.md` (update) + +**Verification:** +```bash +uv run ruff check . +uv run ruff format --check . +uv run mypy src/ +uv run pytest --cov=src/veritext --cov-report=term-missing +# Verify ≥80% coverage +``` + +--- + +## Dependencies + +```toml +[project] +name = "veritext" +version = "0.1.0" +description = "Semantic text validation framework" +readme = "readme.md" +requires-python = ">=3.11" +dependencies = [ + "pydantic>=2.0", + "pydantic-settings>=2.0", + "structlog>=23.0", + "typer>=0.9", + "rich>=13.0", +] + +[project.optional-dependencies] +semantic = ["sentence-transformers>=2.2"] +dev = [ + "pytest>=7.0", + "pytest-cov>=4.0", + "mypy>=1.0", + "ruff>=0.1", +] +all = ["veritext[semantic]"] + +[project.scripts] +veritext = "veritext.cli.main:app" + +[project.entry-points.pytest11] +veritext = "veritext.pytest_plugin" +``` + +--- + +## Conventions + +### Code Quality +- `ruff check .` — zero issues +- `ruff format --check .` — zero changes +- `mypy src/` — passes (strict mode) +- `pytest --cov=src/veritext` — ≥80% coverage + +### Git +- **Author:** Kai Chappell +- **Signed commits:** GPG key 219AA60F0638489B +- **Format:** `type(scope): description` +- **Atomic:** ≤3 files, ≤150 LOC per commit +- **No AI/LLM attribution** + +### Python +- Python 3.11+ with modern type hints +- Absolute imports from package root +- structlog for logging +- UK English (colour, behaviour, summarisation) + +--- + +## Verification Checklist (Per Phase) + +```bash +cd /home/kai/work/dev/portfolio/veritext + +# Code quality +uv run ruff check . +uv run ruff format --check . +uv run mypy src/ + +# Tests +uv run pytest --cov=src/veritext --cov-report=term-missing + +# Package installation +uv pip install -e . +uv run python -c "import veritext; print(veritext.__version__)" +``` diff --git a/docs/project-plan.md b/docs/project-plan.md new file mode 100644 index 0000000..90de59e --- /dev/null +++ b/docs/project-plan.md @@ -0,0 +1,478 @@ +# Project Plan: Veritext — Semantic Text Validation Framework + +## Overview + +A Python library for validating text outputs against semantic criteria. Designed for +developers building any system that produces text — chatbots, content generators, +translation pipelines, summarisation tools — who need automated quality assurance +beyond simple string matching. + +**Origin story:** "I was building a feature that generated article summaries and got +tired of manually checking if they captured the key points. Existing tools could tell +me if two strings matched, but not if they *meant* the same thing. So I built a +validation framework that understands semantics." + +**Portfolio role:** A practical developer tool that demonstrates Python framework +design, NLP evaluation techniques, and test automation integration. The project +solves a real problem any developer working with text processing encounters. + +**Target users:** Developers building content pipelines, chatbot teams validating +responses, ML engineers evaluating model outputs, QA teams testing text-based features. + +--- + +## Problem Statement + +Text validation is hard. Traditional testing approaches fall short: + +| Approach | Problem | +|----------|---------| +| Exact string match | Fails on semantically equivalent variations | +| Substring/regex | Brittle, misses meaning entirely | +| Manual review | Doesn't scale, subjective | +| Generic diff tools | Show *what* changed, not *if it matters* | + +**Example:** A summarisation system produces "The CEO announced layoffs affecting 500 +employees" one day and "500 workers will lose their jobs, the company's chief executive +said" the next. These are semantically equivalent, but every traditional test would +flag this as a failure. + +Veritext answers: "Is this text output *good enough* according to my criteria?" — not +"Is it identical?" + +--- + +## Core Concepts + +### Metrics (Pure Computation) + +Metrics compute scores comparing candidate text to reference text: + +```python +from veritext.metrics import Bleu, Rouge + +bleu = Bleu() +result = bleu.score( + candidate="The cat sat on the mat", + reference="A cat is sitting on a mat" +) +# BleuResult(bleu1=0.71, bleu2=0.58, bleu3=0.45, bleu4=0.41, brevity_penalty=1.0) + +rouge = Rouge() +result = rouge.score(candidate, reference) +# RougeResult(rouge1=RougeScore(...), rouge2=RougeScore(...), rouge_l=RougeScore(...)) +``` + +**Built-in metrics:** + +| Metric | What it measures | Use case | +|--------|------------------|----------| +| BLEU-1 to BLEU-4 | N-gram precision | Translation, generation | +| ROUGE-1, ROUGE-2 | N-gram recall | Summarisation | +| ROUGE-L | Longest common subsequence | Summarisation | +| Semantic similarity | Cosine distance of embeddings | Any meaning comparison | +| Lexical overlap | Jaccard similarity of tokens | Simple similarity | +| Reading level | Flesch-Kincaid grade | Accessibility | + +**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required. + +### Validators (Decision Logic) + +Validators wrap metrics and apply thresholds to make pass/fail decisions: + +```python +from veritext import validators as v + +# Compose multiple checks +validator = v.all_of([ + v.bleu(min_score=0.7), + v.length(max_chars=500), + v.readability(max_grade=8), +]) + +from veritext.core.types import ValidationContext + +context = ValidationContext(reference="The quick brown fox jumps over the lazy dog") +result = validator.validate("The fast brown fox leaped over the lazy dog", context) +# ValidationResult(passed=True, checks=[...]) +``` + +### Pytest Integration + +Native pytest fixtures and assertions for CI/CD: + +```python +from veritext.pytest_plugin import validate_text + +def test_summary_quality(summariser, document): + summary = summariser.summarise(document) + + validate_text( + summary, + reference=expected_summary, + min_rouge=0.7, + min_semantic=0.85, + ) +``` + +### Regression Detection + +Track output quality over time, catch degradations before users do: + +```python +from veritext.benchmark import Benchmark + +benchmark = Benchmark("summarisation_quality", storage_path="benchmarks/") +results = benchmark.evaluate(outputs, references, metrics=["rouge_l", "bleu4"]) +benchmark.assert_no_regression(tolerance=0.05) +``` + +--- + +## Tech Stack + +| Component | Technology | Rationale | +|-----------|------------|-----------| +| Core | Python 3.11+ | Target ecosystem, modern type hints | +| Metrics | Custom implementations | Full control, understanding of algorithms | +| Embeddings | sentence-transformers | Semantic similarity (optional) | +| Test integration | pytest | Fixtures, plugins, assertions | +| CLI | typer | Consistent with portfolio projects | +| Data handling | pydantic | Validation, serialisation | +| Storage | SQLite | Benchmark history, lightweight | +| Output | rich | Terminal formatting | + +--- + +## Architecture + +### Layered Design + +``` +┌─────────────────────────────────────────────────────┐ +│ CLI / pytest_plugin (presentation layer) │ +├─────────────────────────────────────────────────────┤ +│ validators/ (decision logic) │ +│ benchmark/ (tracking & regression) │ +├─────────────────────────────────────────────────────┤ +│ metrics/ (pure computation) │ +├─────────────────────────────────────────────────────┤ +│ core/ (shared types, tokenisation) │ +└─────────────────────────────────────────────────────┘ +``` + +**Dependency rule:** Each layer depends only on layers below it. + +### Key Design Decisions + +1. **Metrics vs Validators separation** — Metrics compute scores; validators make + pass/fail decisions. Clear separation of concerns. + +2. **Typed result objects** — Each metric returns a specific result type (e.g., + `BleuResult`, `RougeResult`), not just `float`. Full information preserved. + +3. **Optional heavy dependencies** — `sentence-transformers` (~2GB with PyTorch) is + optional. Core library works without ML dependencies. + +4. **Shared tokenisation** — Single `Tokeniser` protocol used by all n-gram metrics. + Consistent behaviour across BLEU and ROUGE. + +5. **Explicit context** — `ValidationContext` dataclass instead of `**kwargs`. + Type-safe, discoverable API. + +--- + +## Project Components + +### Component 1: Core Module + +Shared types, exceptions, and tokenisation. + +**Types:** +- `ValidationContext` — reference text and metadata for validation +- `CheckResult` — individual check result with diagnostics +- `ValidationResult` — aggregate result with pass/fail and all checks +- `BatchResult` — statistics over multiple evaluations + +**Tokeniser:** +```python +class Tokeniser(Protocol): + def tokenise(self, text: str) -> list[str]: ... + +class WordTokeniser: + def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ... +``` + +--- + +### Component 2: Metric Engine + +Pure implementations of text evaluation metrics. + +**Interface:** +```python +class Metric(Protocol[T]): + @property + def name(self) -> str: ... + + def score(self, candidate: str, reference: str | list[str]) -> T: ... + + def batch_score( + self, + candidates: list[str], + references: list[str] | list[list[str]] + ) -> BatchResult[T]: ... +``` + +**Metrics:** +- `Bleu` — BLEU-1 through BLEU-4 with brevity penalty +- `Rouge` — ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1 +- `Lexical` — Jaccard similarity, token overlap +- `Readability` — Flesch-Kincaid grade level +- `SemanticSimilarity` — Embedding cosine distance (optional dependency) + +--- + +### Component 3: Validator Framework + +Composable validation rules with clear pass/fail semantics. + +**Built-in validators:** + +| Validator | Description | +|-----------|-------------| +| `v.bleu(min_score, variant)` | BLEU score above minimum | +| `v.rouge(min_score, variant)` | ROUGE score above minimum | +| `v.semantic(min_score)` | Semantic similarity above threshold | +| `v.length(min_chars, max_chars)` | Length constraints | +| `v.readability(max_grade)` | Reading level constraint | +| `v.contains(terms)` | Required terms present | +| `v.excludes(terms)` | Forbidden terms absent | +| `v.pattern(regex)` | Regex pattern match | + +**Composition:** + +```python +# All validators must pass +v.all_of([v.bleu(min_score=0.7), v.length(max_chars=500)]) + +# At least one must pass +v.any_of([v.contains(["error"]), v.contains(["failed"])]) + +# Weighted scoring +v.weighted([ + (v.bleu(min_score=0.7), 0.6), + (v.readability(max_grade=8), 0.4), +], min_score=0.75) +``` + +--- + +### Component 4: Pytest Plugin + +First-class pytest integration for CI/CD pipelines. + +**Features:** +- Custom assertions with detailed failure messages +- Fixtures for common validation patterns +- Markers for categorising text tests + +**Usage:** + +```python +from veritext.pytest_plugin import validate_text + +def test_chatbot_response(): + response = chatbot.respond("What are your hours?") + + validate_text( + response, + reference="We're open Monday to Friday, 9am to 5pm.", + min_bleu=0.6, + min_semantic=0.8, + max_length=500, + ) +``` + +**Failure output:** + +``` +FAILED test_summary.py::test_summary_quality + AssertionError: Text failed 2 of 4 checks: + + ✗ rouge: 0.58 (minimum: 0.70) + ✗ semantic: 0.72 (minimum: 0.85) + ✓ length: 342 (maximum: 500) + ✓ readability: 6.2 (maximum: 8) + + Candidate: "The company reported losses..." + Reference: "Financial results showed significant decline..." +``` + +--- + +### Component 5: Benchmark & Regression Detection + +Track quality over time, catch degradations automatically. + +**Features:** +- Store historical metric values in SQLite +- Statistical regression detection +- Configurable tolerance thresholds +- CI integration for blocking degradations + +**Usage:** + +```python +from veritext.benchmark import Benchmark + +benchmark = Benchmark("chatbot_quality", storage_path="benchmarks/") + +# Record current run (returns BenchmarkRun with metrics and metadata) +run = benchmark.evaluate( + candidates=current_outputs, + references=expected_outputs, + metrics=["rouge_l", "semantic"] +) +# run.metrics = {"rouge_l": 0.82, "semantic": 0.89} + +# Compare against historical baseline +regression = benchmark.check_regression(tolerance=0.05, window=10) + +if regression.detected: + print(f"Quality dropped: {regression.summary}") + +# In CI: fail the build on regression +benchmark.assert_no_regression(tolerance=0.05) +``` + +--- + +### Component 6: CLI Tool + +Command-line interface for quick validation and benchmarking. + +```bash +# Validate a single text +$ veritext validate "Your text here" --reference "Expected text" --metrics bleu,rouge + +# Validate from files +$ veritext validate --file outputs.jsonl --reference-file expected.jsonl + +# Run benchmark +$ veritext benchmark run summarisation --inputs docs/ --references refs/ + +# Show benchmark history +$ veritext benchmark show summarisation --last 20 + +# Check for regression +$ veritext benchmark check summarisation --tolerance 0.05 +``` + +--- + +## Example Use Cases + +### Use Case 1: Chatbot Response Validation + +```python +from veritext import validators as v +from veritext.core.types import ValidationContext + +# Define acceptable response criteria +response_validator = v.all_of([ + v.length(max_chars=500), + v.readability(max_grade=8), + v.excludes(terms=["I don't know", "I'm not sure"]), +]) + +def test_chatbot_responds_helpfully(): + response = chatbot.respond("How do I reset my password?") + context = ValidationContext() + result = response_validator.validate(response, context) + assert result.passed, result.failure_summary +``` + +### Use Case 2: Summarisation Quality Gate + +```python +from veritext.pytest_plugin import validate_text + +def test_summary_captures_key_points(): + article = load_article("financial_report.txt") + summary = summariser.summarise(article) + + validate_text( + summary, + reference=load_reference_summary("financial_report_summary.txt"), + min_rouge=0.65, + min_semantic=0.80, + max_length=300, + ) +``` + +### Use Case 3: Translation Quality Monitoring + +```python +from veritext.benchmark import Benchmark + +benchmark = Benchmark("translation_en_de", storage_path="benchmarks/") + +# Nightly CI job +results = benchmark.evaluate( + candidates=translate_batch(test_documents), + references=human_translations, + metrics=["bleu4", "semantic"] +) + +# Block deployment if quality drops +benchmark.assert_no_regression(tolerance=0.03) +``` + +--- + +## Success Criteria + +- [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score) +- [ ] Semantic similarity correlates with human judgement on test pairs +- [ ] Pytest plugin installs cleanly via `pip install veritext` +- [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings) +- [ ] Benchmark regression detection has <5% false positive rate +- [ ] Documentation includes working examples for each use case +- [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage +- [ ] Can explain design decisions and metric theory in interview + +--- + +## Skills Demonstrated + +| Skill | How Veritext demonstrates it | +|-------|------------------------------| +| Python framework design | Composable validators, clean API, plugin architecture | +| Test automation | Native pytest integration, CI/CD workflows | +| NLP evaluation metrics | BLEU, ROUGE, semantic similarity implementations | +| Data analysis | Statistical regression detection, batch processing | +| CLI development | Typer-based interface, rich output | +| Software architecture | Layered design, clear separation of concerns | +| Documentation | Comprehensive readme, examples | +| Quality engineering | High test coverage, type safety, linting | + +--- + +## What Makes This Project Credible + +1. **Solves a real problem** — Anyone building text-based features faces validation + challenges. + +2. **Not tied to a specific technology** — Works with any text source (chatbots, LLMs, + translation APIs, content generators). It's a general-purpose tool, not an "LLM + testing framework." + +3. **Practical scope** — Not trying to reinvent pytest or build an ML platform. Focused + on one thing: validating text quality. + +4. **Demonstrates depth** — Implementing BLEU/ROUGE from understanding (not just + wrapping libraries) shows knowledge of how these metrics work. + +5. **Natural portfolio narrative** — "I was building X and needed a better way to test + it, so I built this tool." Every interviewer has faced similar problems.