# Implementation Plan: Veritext Semantic text validation framework for Python — validates text outputs against quality criteria. ## Project Overview **Location:** `/home/kai/work/dev/portfolio/veritext/` **Remote:** `https://gitea.kschappell.com/kschappell/veritext.git` **Purpose:** A Python library for validating text outputs against semantic criteria. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching. --- ## Architectural Decisions ### 1. Layered Architecture ``` ┌─────────────────────────────────────────────────────┐ │ CLI / pytest_plugin (presentation layer) │ ├─────────────────────────────────────────────────────┤ │ validators/ (decision logic) │ │ benchmark/ (tracking & regression) │ ├─────────────────────────────────────────────────────┤ │ metrics/ (pure computation) │ ├─────────────────────────────────────────────────────┤ │ core/ (shared types, tokenisation) │ └─────────────────────────────────────────────────────┘ ``` **Dependency rule:** Each layer depends only on layers below it. ### 2. Metrics vs Validators (Clear Separation) | Concept | Responsibility | Output | |---------|----------------|--------| | **Metric** | Compute a score | Typed result object (e.g., `BleuResult`) | | **Validator** | Make pass/fail decision | `ValidationResult` with diagnostics | Validators wrap metrics and apply thresholds. ### 3. Optional Heavy Dependencies `sentence-transformers` (~2GB with PyTorch) is optional: ```toml [project.optional-dependencies] semantic = ["sentence-transformers>=2.2"] ``` Core library works without ML dependencies. ### 4. Typed Result Objects Each metric returns a specific result type, not just `float`: ```python @dataclass(frozen=True) class BleuResult: bleu1: float bleu2: float bleu3: float bleu4: float brevity_penalty: float @dataclass(frozen=True) class RougeScore: precision: float recall: float fmeasure: float @dataclass(frozen=True) class RougeResult: rouge1: RougeScore rouge2: RougeScore rouge_l: RougeScore ``` ### 5. Shared Tokenisation Single tokeniser used by all n-gram metrics: ```python class Tokeniser(Protocol): def tokenise(self, text: str) -> list[str]: ... class WordTokeniser: def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ... ``` ### 6. Explicit Context Object Validation context is explicit, not `**kwargs`: ```python @dataclass class ValidationContext: reference: str | list[str] | None = None metadata: dict[str, Any] = field(default_factory=dict) ``` --- ## Directory Structure ``` veritext/ ├── src/ │ └── veritext/ │ ├── __init__.py # Public API exports │ ├── py.typed # PEP 561 marker │ ├── core/ │ │ ├── __init__.py │ │ ├── types.py # ValidationContext, CheckResult, BatchResult │ │ ├── exceptions.py # Exception hierarchy │ │ ├── tokenisation.py # Shared tokeniser │ │ ├── config.py # pydantic-settings │ │ └── logging.py # structlog configuration │ ├── metrics/ │ │ ├── __init__.py # Metric exports │ │ ├── base.py # Metric protocol │ │ ├── results.py # BleuResult, RougeResult, etc. │ │ ├── bleu.py # BLEU implementation │ │ ├── rouge.py # ROUGE implementation │ │ ├── lexical.py # Jaccard, token overlap │ │ └── readability.py # Flesch-Kincaid, etc. │ ├── semantic/ # Optional (requires sentence-transformers) │ │ ├── __init__.py │ │ └── similarity.py # Embedding-based similarity │ ├── validators/ │ │ ├── __init__.py # Validator exports │ │ ├── base.py # Check protocol, ValidationResult │ │ ├── metric.py # Validators wrapping metrics │ │ ├── constraint.py # Length, content checks │ │ └── composite.py # Validator composition │ ├── benchmark/ │ │ ├── __init__.py │ │ ├── models.py # BenchmarkRun, RegressionReport │ │ ├── storage.py # SQLite backend │ │ ├── runner.py # Benchmark execution │ │ └── regression.py # Statistical detection │ ├── pytest_plugin/ │ │ ├── __init__.py # Plugin entry point │ │ ├── fixtures.py # Pytest fixtures │ │ ├── assertions.py # validate_text(), assert_similar() │ │ └── plugin.py # Pytest hooks │ └── cli/ │ ├── __init__.py │ └── main.py # Typer CLI app ├── tests/ │ ├── conftest.py │ ├── test_core/ │ │ ├── test_tokenisation.py │ │ └── test_types.py │ ├── test_metrics/ │ │ ├── test_bleu.py │ │ ├── test_rouge.py │ │ ├── test_lexical.py │ │ └── test_readability.py │ ├── test_semantic/ │ │ └── test_similarity.py │ ├── test_validators/ │ │ ├── test_metric_validators.py │ │ ├── test_constraint_validators.py │ │ └── test_composite.py │ ├── test_benchmark/ │ │ ├── test_storage.py │ │ └── test_regression.py │ ├── test_pytest_plugin/ │ │ └── test_integration.py │ └── test_cli/ │ └── test_commands.py ├── examples/ │ ├── basic_validation.py │ ├── chatbot_testing.py │ └── benchmark_regression.py ├── docs/ │ ├── project-plan.md │ └── implementation-plan.md ├── pyproject.toml ├── readme.md ├── changelog.md └── CLAUDE.md ``` --- ## Exception Hierarchy ```python class VeritextError(Exception): """Base exception for all Veritext errors.""" class MetricError(VeritextError): """Error during metric computation.""" class TokenisationError(MetricError): """Error during text tokenisation.""" class EmbeddingError(MetricError): """Error computing embeddings (semantic similarity).""" class ValidationError(VeritextError): """Error during validation.""" class InvalidThresholdError(ValidationError): """Invalid threshold value provided.""" class BenchmarkError(VeritextError): """Error during benchmarking.""" class StorageError(BenchmarkError): """Error reading/writing benchmark storage.""" class RegressionDetectedError(BenchmarkError): """Quality regression detected (used in CI).""" class ConfigurationError(VeritextError): """Invalid configuration.""" class DependencyError(VeritextError): """Optional dependency not installed.""" ``` --- ## Core Interfaces ### Metric Protocol ```python from typing import Protocol, TypeVar, Generic T = TypeVar("T") class Metric(Protocol[T]): """Protocol for text comparison metrics.""" @property def name(self) -> str: ... def score(self, candidate: str, reference: str | list[str]) -> T: ... def batch_score( self, candidates: list[str], references: list[str] | list[list[str]] ) -> BatchResult[T]: ... @dataclass class AggregateStats: mean: float std: float min: float max: float percentiles: dict[int, float] # {25: 0.65, 50: 0.72, 75: 0.81, 95: 0.89} @dataclass class BatchResult(Generic[T]): results: list[T] # Individual results per sample count: int stats: dict[str, AggregateStats] # Aggregated stats for numeric fields ``` **Note:** Readability metrics (Flesch-Kincaid) accept but ignore the `reference` parameter since they only analyse the candidate text. ### Validator Protocol ```python class Check(Protocol): """Protocol for individual validation checks.""" @property def name(self) -> str: ... def check(self, text: str, context: ValidationContext) -> CheckResult: ... @dataclass class CheckResult: name: str passed: bool actual: Any threshold: Any | None message: str @dataclass class ValidationResult: passed: bool checks: list[CheckResult] @property def failure_summary(self) -> str: ... @property def failed_checks(self) -> list[CheckResult]: ... ``` ### Benchmark Models ```python @dataclass class BenchmarkRun: id: str # UUID benchmark_name: str timestamp: datetime veritext_version: str # Track library version metrics: dict[str, float] # {"rouge_l": 0.82, "bleu4": 0.71} sample_count: int metadata: dict[str, Any] # {"git_sha": "abc123", "model": "v2"} @dataclass class RegressionReport: detected: bool baseline: dict[str, float] current: dict[str, float] deltas: dict[str, float] # {"rouge_l": -0.05} tolerance: float @property def summary(self) -> str: ... ``` --- ## Validator Naming Convention Consistent short names: ```python from veritext import validators as v # Metric-based validators v.bleu(min_score=0.7) # BLEU-4 by default v.bleu(min_score=0.7, variant=1) # BLEU-1 v.rouge(min_score=0.7) # ROUGE-L by default v.rouge(min_score=0.7, variant="1") # ROUGE-1 v.semantic(min_score=0.8) # Semantic similarity # Constraint validators v.length(max_chars=500) v.length(min_chars=100, max_chars=500) v.readability(max_grade=8) v.contains(terms=["hello", "world"]) v.excludes(terms=["error", "fail"]) v.pattern(regex=r"^\d{4}-\d{2}-\d{2}$") # Composition v.all_of([...]) # All must pass v.any_of([...]) # At least one must pass v.weighted( # Weighted score threshold checks=[ (v.bleu(min_score=0.7), 0.6), # (check, weight) tuples (v.readability(max_grade=8), 0.4), ], min_score=0.75, # Minimum weighted score to pass ) ``` --- ## Implementation Phases ### Phase 1: Project Scaffold & Core **Goal:** Set up project structure with shared types and tokenisation. **Tasks:** 1. Create directory structure 2. Write `pyproject.toml` with optional dependencies 3. Create `CLAUDE.md` with project guidelines 4. Implement `core/exceptions.py` (full hierarchy) 5. Implement `core/types.py` (ValidationContext, CheckResult, BatchResult) 6. Implement `core/tokenisation.py` (WordTokeniser) 7. Implement `core/config.py` (pydantic-settings) 8. Implement `core/logging.py` (structlog configuration) 9. Create `__init__.py` with version 10. Write tests for tokenisation 11. Initial commit to Gitea **Files:** - `pyproject.toml` - `CLAUDE.md` - `readme.md` (stub) - `changelog.md` - `src/veritext/__init__.py` - `src/veritext/py.typed` - `src/veritext/core/__init__.py` - `src/veritext/core/exceptions.py` - `src/veritext/core/types.py` - `src/veritext/core/tokenisation.py` - `src/veritext/core/config.py` - `src/veritext/core/logging.py` - `tests/conftest.py` - `tests/test_core/test_tokenisation.py` - `tests/test_core/test_types.py` **Verification:** ```bash uv sync uv run ruff check . uv run ruff format --check . uv run mypy src/ uv run pytest tests/test_core/ -v ``` --- ### Phase 2: Metrics — BLEU & Lexical **Goal:** Implement BLEU and lexical similarity metrics. **Tasks:** 1. Implement `metrics/base.py` (Metric protocol) 2. Implement `metrics/results.py` (BleuResult, LexicalResult) 3. Implement `metrics/bleu.py` (BLEU-1 through BLEU-4) 4. Implement `metrics/lexical.py` (Jaccard, token overlap) 5. Add batch processing with statistics 6. Write comprehensive tests with reference values 7. Update changelog **Key Design:** ```python class Bleu: def __init__(self, tokeniser: Tokeniser | None = None, max_n: int = 4): ... def score(self, candidate: str, reference: str | list[str]) -> BleuResult: ... ``` **Files:** - `src/veritext/metrics/__init__.py` - `src/veritext/metrics/base.py` - `src/veritext/metrics/results.py` - `src/veritext/metrics/bleu.py` - `src/veritext/metrics/lexical.py` - `tests/test_metrics/test_bleu.py` - `tests/test_metrics/test_lexical.py` **Verification:** ```bash uv run pytest tests/test_metrics/ -v --cov=src/veritext/metrics # Verify BLEU matches nltk.translate.bleu_score reference ``` --- ### Phase 3: Metrics — ROUGE & Readability **Goal:** Implement ROUGE and readability metrics. **Tasks:** 1. Implement `metrics/rouge.py` (ROUGE-1, ROUGE-2, ROUGE-L) 2. Implement `metrics/readability.py` (Flesch-Kincaid) 3. Add RougeResult, ReadabilityResult to results.py 4. Write comprehensive tests 5. Update changelog **Files:** - `src/veritext/metrics/rouge.py` - `src/veritext/metrics/readability.py` - `tests/test_metrics/test_rouge.py` - `tests/test_metrics/test_readability.py` **Verification:** ```bash uv run pytest tests/test_metrics/ -v # Verify ROUGE matches rouge-score library reference ``` --- ### Phase 4: Validators **Goal:** Build composable validation system. **Tasks:** 1. Implement `validators/base.py` (Check protocol, ValidationResult) 2. Implement `validators/metric.py` (BleuValidator, RougeValidator) 3. Implement `validators/constraint.py` (LengthValidator, ContainsValidator, etc.) 4. Implement `validators/composite.py` (AllOf, AnyOf, Weighted) 5. Create validator factory functions (`v.bleu()`, `v.length()`, etc.) 6. Write comprehensive tests 7. Update changelog **Key Design:** ```python # validators/metric.py class BleuValidator: def __init__( self, min_score: float, variant: int = 4, tokeniser: Tokeniser | None = None, ): ... def check(self, text: str, context: ValidationContext) -> CheckResult: ... # validators/__init__.py (factory functions) def bleu(min_score: float, variant: int = 4) -> BleuValidator: ... def rouge(min_score: float, variant: str = "l") -> RougeValidator: ... def length(min_chars: int | None = None, max_chars: int | None = None) -> LengthValidator: ... ``` **Files:** - `src/veritext/validators/__init__.py` - `src/veritext/validators/base.py` - `src/veritext/validators/metric.py` - `src/veritext/validators/constraint.py` - `src/veritext/validators/composite.py` - `tests/test_validators/test_metric_validators.py` - `tests/test_validators/test_constraint_validators.py` - `tests/test_validators/test_composite.py` **Verification:** ```bash uv run pytest tests/test_validators/ -v --cov=src/veritext/validators ``` --- ### Phase 5: Semantic Similarity (Optional Dependency) **Goal:** Add embedding-based semantic similarity as optional feature. **Tasks:** 1. Implement `semantic/similarity.py` with lazy import 2. Add embedding caching 3. Add DependencyError for missing sentence-transformers 4. Implement SemanticValidator 5. Write tests (skipped if dependency missing) 6. Update changelog **Key Design:** ```python # semantic/similarity.py class SemanticSimilarity: def __init__( self, model: str = "all-MiniLM-L6-v2", cache_embeddings: bool = True, ): try: from sentence_transformers import SentenceTransformer except ImportError: raise DependencyError( "Install veritext[semantic] for semantic similarity: " "pip install veritext[semantic]" ) self._model = SentenceTransformer(model) self._cache: dict[str, Any] = {} if cache_embeddings else None ``` **Files:** - `src/veritext/semantic/__init__.py` - `src/veritext/semantic/similarity.py` - `tests/test_semantic/test_similarity.py` **Verification:** ```bash # Without semantic dependency uv run pytest tests/ -v --ignore=tests/test_semantic/ # With semantic dependency uv pip install sentence-transformers uv run pytest tests/test_semantic/ -v ``` --- ### Phase 6: Pytest Plugin **Goal:** Native pytest integration for CI/CD. **Tasks:** 1. Create plugin structure with entry points 2. Implement fixtures: `text_validator` 3. Implement `validate_text()` assertion function 4. Create detailed failure formatting 5. Add `@pytest.mark.text_validation` marker 6. Write integration tests 7. Update changelog **Entry point:** ```toml [project.entry-points.pytest11] veritext = "veritext.pytest_plugin" ``` **Key Design:** ```python # pytest_plugin/assertions.py def validate_text( text: str, *, reference: str | None = None, min_bleu: float | None = None, min_rouge: float | None = None, min_semantic: float | None = None, max_length: int | None = None, max_reading_grade: int | None = None, contains: list[str] | None = None, excludes: list[str] | None = None, ) -> None: """ Assert text passes all specified validation criteria. Raises: AssertionError: With detailed failure information if validation fails. """ ``` **Files:** - `src/veritext/pytest_plugin/__init__.py` - `src/veritext/pytest_plugin/fixtures.py` - `src/veritext/pytest_plugin/assertions.py` - `src/veritext/pytest_plugin/plugin.py` - `tests/test_pytest_plugin/test_integration.py` **Verification:** ```bash uv pip install -e . uv run pytest --co -q # Should show veritext plugin uv run pytest tests/test_pytest_plugin/ -v ``` --- ### Phase 7: Benchmark & Regression **Goal:** Track quality over time, detect regressions. **Tasks:** 1. Implement `benchmark/models.py` (BenchmarkRun, RegressionReport) 2. Implement `benchmark/storage.py` (SQLite backend) 3. Implement `benchmark/runner.py` (Benchmark class) 4. Implement `benchmark/regression.py` (statistical detection) 5. Add `assert_no_regression()` for CI 6. Write tests 7. Update changelog **Key Interface:** ```python class Benchmark: def __init__(self, name: str, storage_path: str | Path = "benchmarks/"): ... def evaluate( self, candidates: list[str], references: list[str], metrics: list[str] = ("rouge_l", "bleu4"), ) -> BenchmarkRun: """Evaluate candidates, store results, return the run record.""" ... def check_regression( self, tolerance: float = 0.05, window: int = 10, ) -> RegressionReport: """Compare current run against historical baseline.""" ... def assert_no_regression(self, tolerance: float = 0.05) -> None: """Raise RegressionDetectedError if quality dropped.""" ... ``` **SQLite Schema:** ```sql CREATE TABLE benchmark_runs ( id TEXT PRIMARY KEY, benchmark_name TEXT NOT NULL, timestamp TEXT NOT NULL, veritext_version TEXT NOT NULL, sample_count INTEGER NOT NULL, metadata TEXT -- JSON ); CREATE TABLE benchmark_metrics ( run_id TEXT REFERENCES benchmark_runs(id), metric_name TEXT NOT NULL, value REAL NOT NULL, PRIMARY KEY (run_id, metric_name) ); CREATE INDEX idx_benchmark_name ON benchmark_runs(benchmark_name, timestamp); ``` **Files:** - `src/veritext/benchmark/__init__.py` - `src/veritext/benchmark/models.py` - `src/veritext/benchmark/storage.py` - `src/veritext/benchmark/runner.py` - `src/veritext/benchmark/regression.py` - `tests/test_benchmark/test_storage.py` - `tests/test_benchmark/test_runner.py` - `tests/test_benchmark/test_regression.py` **Verification:** ```bash uv run pytest tests/test_benchmark/ -v --cov=src/veritext/benchmark ``` --- ### Phase 8: CLI **Goal:** Command-line interface for validation and benchmarking. **Tasks:** 1. Implement Typer CLI app 2. Add `validate` command 3. Add `benchmark run` command 4. Add `benchmark show` command 5. Add rich output formatting 6. Write CLI tests 7. Update changelog **Commands:** ```bash veritext validate "text" --reference "ref" --metrics bleu,rouge veritext validate --file outputs.jsonl --reference-file refs.jsonl veritext benchmark run my_benchmark --inputs data/ --references refs/ veritext benchmark show my_benchmark --last 20 veritext benchmark check my_benchmark --tolerance 0.05 ``` **Input Formats:** - **JSONL:** One JSON object per line with `candidate` and `reference` fields: ```json {"candidate": "The cat sat on the mat.", "reference": "A cat is sitting on a mat."} {"candidate": "Hello world.", "reference": "Greetings, world."} ``` - **Directories:** Matching filenames in `--inputs` and `--references` directories: ``` data/sample1.txt ↔ refs/sample1.txt data/sample2.txt ↔ refs/sample2.txt ``` **Files:** - `src/veritext/cli/__init__.py` - `src/veritext/cli/main.py` - `tests/test_cli/test_commands.py` **Verification:** ```bash uv run veritext --help uv run veritext validate "hello world" --reference "hello world" --metrics bleu uv run pytest tests/test_cli/ -v ``` --- ### Phase 9: Documentation & Polish **Goal:** Make portfolio-ready. **Tasks:** 1. Write comprehensive `readme.md` with examples 2. Add docstrings to all public APIs 3. Create example scripts 4. Ensure ≥80% test coverage 5. Final linting/type checking 6. Update `changelog.md` with 0.1.0 release 7. Update project docs in `docs/` **Files:** - `readme.md` (comprehensive) - `examples/basic_validation.py` - `examples/chatbot_testing.py` - `examples/benchmark_regression.py` - Update all docstrings - `docs/project-plan.md` (update) - `docs/implementation-plan.md` (update) **Verification:** ```bash uv run ruff check . uv run ruff format --check . uv run mypy src/ uv run pytest --cov=src/veritext --cov-report=term-missing # Verify ≥80% coverage ``` --- ## Dependencies ```toml [project] name = "veritext" version = "0.1.0" description = "Semantic text validation framework" readme = "readme.md" requires-python = ">=3.11" dependencies = [ "pydantic>=2.0", "pydantic-settings>=2.0", "structlog>=23.0", "typer>=0.9", "rich>=13.0", ] [project.optional-dependencies] semantic = ["sentence-transformers>=2.2"] dev = [ "pytest>=7.0", "pytest-cov>=4.0", "mypy>=1.0", "ruff>=0.1", ] all = ["veritext[semantic]"] [project.scripts] veritext = "veritext.cli.main:app" [project.entry-points.pytest11] veritext = "veritext.pytest_plugin" ``` --- ## Conventions ### Code Quality - `ruff check .` — zero issues - `ruff format --check .` — zero changes - `mypy src/` — passes (strict mode) - `pytest --cov=src/veritext` — ≥80% coverage ### Git - **Author:** Kai Chappell - **Signed commits:** GPG key 219AA60F0638489B - **Format:** `type(scope): description` - **Atomic:** ≤3 files, ≤150 LOC per commit - **No AI/LLM attribution** ### Python - Python 3.11+ with modern type hints - Absolute imports from package root - structlog for logging - UK English (colour, behaviour, summarisation) --- ## Verification Checklist (Per Phase) ```bash cd /home/kai/work/dev/portfolio/veritext # Code quality uv run ruff check . uv run ruff format --check . uv run mypy src/ # Tests uv run pytest --cov=src/veritext --cov-report=term-missing # Package installation uv pip install -e . uv run python -c "import veritext; print(veritext.__version__)" ```