Files

docs(plans): improve consistency and add edge case handling

- Add requires_reference property to Metric protocol for standalone metrics
- Make reference parameter optional in score/batch_score methods
- Add comprehensive Edge Case Handling section (empty text, Unicode, etc.)
- Expand phase tasks with explicit test coverage requirements
- Fix path reference to use relative workspace path
- Add missing test_runner.py to directory structure
- Clarify SemanticValidator integration in Phase 5
- Fix tuple/list type annotation in Benchmark.evaluate()

2026-02-03 16:04:02 +00:00

28 KiB

Raw Blame History

Implementation Plan: Veritext

Semantic text validation framework for Python — validates text outputs against quality criteria.

Project Overview

Location: portfolio/veritext/ (relative to workspace root) Remote: https://gitea.kschappell.com/kschappell/veritext.git

Purpose: A Python library for validating text outputs against semantic criteria. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching.

Architectural Decisions

1. Layered Architecture

┌─────────────────────────────────────────────────────┐
│  CLI / pytest_plugin  (presentation layer)          │
├─────────────────────────────────────────────────────┤
│  validators/          (decision logic)              │
│  benchmark/           (tracking & regression)       │
├─────────────────────────────────────────────────────┤
│  metrics/             (pure computation)            │
├─────────────────────────────────────────────────────┤
│  core/                (shared types, tokenisation)  │
└─────────────────────────────────────────────────────┘

Dependency rule: Each layer depends only on layers below it.

2. Metrics vs Validators (Clear Separation)

Concept	Responsibility	Output
Metric	Compute a score	Typed result object (e.g., `BleuResult`)
Validator	Make pass/fail decision	`ValidationResult` with diagnostics

Validators wrap metrics and apply thresholds.

3. Optional Heavy Dependencies

sentence-transformers (~2GB with PyTorch) is optional:

[project.optional-dependencies]
semantic = ["sentence-transformers>=2.2"]

Core library works without ML dependencies.

4. Typed Result Objects

Each metric returns a specific result type, not just float:

@dataclass(frozen=True)
class BleuResult:
    bleu1: float
    bleu2: float
    bleu3: float
    bleu4: float
    brevity_penalty: float

@dataclass(frozen=True)
class RougeScore:
    precision: float
    recall: float
    fmeasure: float

@dataclass(frozen=True)
class RougeResult:
    rouge1: RougeScore
    rouge2: RougeScore
    rouge_l: RougeScore

5. Shared Tokenisation

Single tokeniser used by all n-gram metrics:

class Tokeniser(Protocol):
    def tokenise(self, text: str) -> list[str]: ...

class WordTokeniser:
    def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ...

6. Explicit Context Object

Validation context is explicit, not **kwargs:

@dataclass
class ValidationContext:
    reference: str | list[str] | None = None
    metadata: dict[str, Any] = field(default_factory=dict)

Directory Structure

veritext/
├── src/
│   └── veritext/
│       ├── __init__.py              # Public API exports
│       ├── py.typed                 # PEP 561 marker
│       ├── core/
│       │   ├── __init__.py
│       │   ├── types.py             # ValidationContext, CheckResult, BatchResult
│       │   ├── exceptions.py        # Exception hierarchy
│       │   ├── tokenisation.py      # Shared tokeniser
│       │   ├── config.py            # pydantic-settings
│       │   └── logging.py           # structlog configuration
│       ├── metrics/
│       │   ├── __init__.py          # Metric exports
│       │   ├── base.py              # Metric protocol
│       │   ├── results.py           # BleuResult, RougeResult, etc.
│       │   ├── bleu.py              # BLEU implementation
│       │   ├── rouge.py             # ROUGE implementation
│       │   ├── lexical.py           # Jaccard, token overlap
│       │   └── readability.py       # Flesch-Kincaid, etc.
│       ├── semantic/                # Optional (requires sentence-transformers)
│       │   ├── __init__.py
│       │   └── similarity.py        # Embedding-based similarity
│       ├── validators/
│       │   ├── __init__.py          # Validator exports
│       │   ├── base.py              # Check protocol, ValidationResult
│       │   ├── metric.py            # Validators wrapping metrics
│       │   ├── constraint.py        # Length, content checks
│       │   └── composite.py         # Validator composition
│       ├── benchmark/
│       │   ├── __init__.py
│       │   ├── models.py            # BenchmarkRun, RegressionReport
│       │   ├── storage.py           # SQLite backend
│       │   ├── runner.py            # Benchmark execution
│       │   └── regression.py        # Statistical detection
│       ├── pytest_plugin/
│       │   ├── __init__.py          # Plugin entry point
│       │   ├── fixtures.py          # Pytest fixtures
│       │   ├── assertions.py        # validate_text(), assert_similar()
│       │   └── plugin.py            # Pytest hooks
│       └── cli/
│           ├── __init__.py
│           └── main.py              # Typer CLI app
├── tests/
│   ├── conftest.py
│   ├── test_core/
│   │   ├── test_tokenisation.py
│   │   └── test_types.py
│   ├── test_metrics/
│   │   ├── test_bleu.py
│   │   ├── test_rouge.py
│   │   ├── test_lexical.py
│   │   └── test_readability.py
│   ├── test_semantic/
│   │   └── test_similarity.py
│   ├── test_validators/
│   │   ├── test_metric_validators.py
│   │   ├── test_constraint_validators.py
│   │   └── test_composite.py
│   ├── test_benchmark/
│   │   ├── test_storage.py
│   │   ├── test_runner.py
│   │   └── test_regression.py
│   ├── test_pytest_plugin/
│   │   └── test_integration.py
│   └── test_cli/
│       └── test_commands.py
├── examples/
│   ├── basic_validation.py
│   ├── chatbot_testing.py
│   └── benchmark_regression.py
├── docs/
│   ├── project-plan.md
│   └── implementation-plan.md
├── pyproject.toml
├── readme.md
├── changelog.md
└── CLAUDE.md

Exception Hierarchy

class VeritextError(Exception):
    """Base exception for all Veritext errors."""

class MetricError(VeritextError):
    """Error during metric computation."""

class TokenisationError(MetricError):
    """Error during text tokenisation."""

class EmbeddingError(MetricError):
    """Error computing embeddings (semantic similarity)."""

class ValidationError(VeritextError):
    """Error during validation."""

class InvalidThresholdError(ValidationError):
    """Invalid threshold value provided."""

class BenchmarkError(VeritextError):
    """Error during benchmarking."""

class StorageError(BenchmarkError):
    """Error reading/writing benchmark storage."""

class RegressionDetectedError(BenchmarkError):
    """Quality regression detected (used in CI)."""

class ConfigurationError(VeritextError):
    """Invalid configuration."""

class DependencyError(VeritextError):
    """Optional dependency not installed."""

Core Interfaces

Metric Protocol

from typing import Protocol, TypeVar, Generic

T = TypeVar("T")

class Metric(Protocol[T]):
    """Protocol for text comparison metrics."""

    @property
    def name(self) -> str: ...

    @property
    def requires_reference(self) -> bool:
        """Whether this metric requires a reference text."""
        ...

    def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
        """
        Compute metric score.

        Args:
            candidate: The text to evaluate.
            reference: Reference text(s) for comparison. Required for comparison
                       metrics (BLEU, ROUGE, semantic). Ignored for standalone
                       metrics (readability).

        Raises:
            ValueError: If reference is required but not provided.
        """
        ...

    def batch_score(
        self,
        candidates: list[str],
        references: list[str] | list[list[str]] | None = None,
    ) -> BatchResult[T]: ...

@dataclass
class AggregateStats:
    mean: float
    std: float
    min: float
    max: float
    percentiles: dict[int, float]  # {25: 0.65, 50: 0.72, 75: 0.81, 95: 0.89}

@dataclass
class BatchResult(Generic[T]):
    results: list[T]                     # Individual results per sample
    count: int
    stats: dict[str, AggregateStats]     # Aggregated stats for numeric fields

Note: Standalone metrics like readability return False for requires_reference and ignore the reference parameter. Comparison metrics (BLEU, ROUGE, semantic) return True and raise ValueError if reference is None.

Validator Protocol

class Check(Protocol):
    """Protocol for individual validation checks."""

    @property
    def name(self) -> str: ...

    def check(self, text: str, context: ValidationContext) -> CheckResult: ...

@dataclass
class CheckResult:
    name: str
    passed: bool
    actual: Any
    threshold: Any | None
    message: str

@dataclass
class ValidationResult:
    passed: bool
    checks: list[CheckResult]

    @property
    def failure_summary(self) -> str: ...

    @property
    def failed_checks(self) -> list[CheckResult]: ...

Benchmark Models

@dataclass
class BenchmarkRun:
    id: str                          # UUID
    benchmark_name: str
    timestamp: datetime
    veritext_version: str            # Track library version
    metrics: dict[str, float]        # {"rouge_l": 0.82, "bleu4": 0.71}
    sample_count: int
    metadata: dict[str, Any]         # {"git_sha": "abc123", "model": "v2"}

@dataclass
class RegressionReport:
    detected: bool
    baseline: dict[str, float]
    current: dict[str, float]
    deltas: dict[str, float]         # {"rouge_l": -0.05}
    tolerance: float

    @property
    def summary(self) -> str: ...

Edge Case Handling

All components must handle edge cases consistently:

Empty Text

Input	Behaviour
Empty candidate (`""`)	Metrics return zero scores; validators fail unless explicitly configured
Empty reference (`""`)	Comparison metrics raise `ValueError`
Whitespace-only text	Treated as empty after tokenisation

None Reference

Component	Behaviour
Comparison metrics (BLEU, ROUGE, semantic)	Raise `ValueError("Reference required for {metric_name}")`
Standalone metrics (readability)	Ignore, compute normally
Validators wrapping comparison metrics	Raise `ValidationError` if `context.reference` is `None`

Unicode & Encoding

All text assumed to be valid UTF-8 strings
Normalisation: NFC by default (configurable in Tokeniser)
Emoji and non-Latin scripts: Supported, tokenised as words where applicable

Very Long Text

No hard limits enforced by default
Tokeniser can accept max_tokens: int | None for truncation
Semantic similarity: Truncates to model's max sequence length (typically 512 tokens) with warning logged

Multiple References

BLEU and ROUGE support multiple references (list[str]):

BLEU: Computes against each reference, uses maximum n-gram matches
ROUGE: Computes against each reference, returns best score

Validator Naming Convention

Consistent short names:

from veritext import validators as v

# Metric-based validators
v.bleu(min_score=0.7)                    # BLEU-4 by default
v.bleu(min_score=0.7, variant=1)         # BLEU-1
v.rouge(min_score=0.7)                   # ROUGE-L by default
v.rouge(min_score=0.7, variant="1")      # ROUGE-1
v.semantic(min_score=0.8)                # Semantic similarity

# Constraint validators
v.length(max_chars=500)
v.length(min_chars=100, max_chars=500)
v.readability(max_grade=8)
v.contains(terms=["hello", "world"])
v.excludes(terms=["error", "fail"])
v.pattern(regex=r"^\d{4}-\d{2}-\d{2}$")

# Composition
v.all_of([...])                          # All must pass
v.any_of([...])                          # At least one must pass
v.weighted(                              # Weighted score threshold
    checks=[
        (v.bleu(min_score=0.7), 0.6),    # (check, weight) tuples
        (v.readability(max_grade=8), 0.4),
    ],
    min_score=0.75,                      # Minimum weighted score to pass
)

Implementation Phases

Phase 1: Project Scaffold & Core

Goal: Set up project structure with shared types and tokenisation.

Tasks:

Create directory structure
Write pyproject.toml with optional dependencies
Create CLAUDE.md with project guidelines
Implement core/exceptions.py (full hierarchy)
Implement core/types.py (ValidationContext, CheckResult, ValidationResult)
Implement core/tokenisation.py (WordTokeniser with NFC normalisation)
Implement core/config.py (pydantic-settings)
Implement core/logging.py (structlog configuration)
Create __init__.py with __version__ and __all__ exports
Write tests for tokenisation (including Unicode, empty input, whitespace-only)
Write tests for types (including edge cases)
Initial commit to Gitea

Files:

pyproject.toml
CLAUDE.md
readme.md (stub)
changelog.md
src/veritext/__init__.py
src/veritext/py.typed
src/veritext/core/__init__.py
src/veritext/core/exceptions.py
src/veritext/core/types.py
src/veritext/core/tokenisation.py
src/veritext/core/config.py
src/veritext/core/logging.py
tests/conftest.py
tests/test_core/test_tokenisation.py
tests/test_core/test_types.py

Verification:

uv sync
uv run ruff check .
uv run ruff format --check .
uv run mypy src/
uv run pytest tests/test_core/ -v

Phase 2: Metrics — BLEU & Lexical

Goal: Implement BLEU and lexical similarity metrics.

Tasks:

Implement metrics/base.py (Metric protocol, BatchResult, AggregateStats)
Implement metrics/results.py (BleuResult, LexicalResult)
Implement metrics/bleu.py (BLEU-1 through BLEU-4)
Implement metrics/lexical.py (Jaccard, token overlap)
Add batch processing with aggregate statistics (mean, std, percentiles)
Write comprehensive tests:
- Single-pair scoring with reference values from NLTK
- Batch scoring with statistical aggregation
- Edge cases: empty text, single-word inputs, identical texts
- Multiple references support
Define __all__ exports in each module's __init__.py
Update changelog

Key Design:

class Bleu:
    def __init__(self, tokeniser: Tokeniser | None = None, max_n: int = 4): ...

    def score(self, candidate: str, reference: str | list[str]) -> BleuResult: ...

Files:

src/veritext/metrics/__init__.py
src/veritext/metrics/base.py
src/veritext/metrics/results.py
src/veritext/metrics/bleu.py
src/veritext/metrics/lexical.py
tests/test_metrics/test_bleu.py
tests/test_metrics/test_lexical.py

Verification:

uv run pytest tests/test_metrics/ -v --cov=src/veritext/metrics
# Verify BLEU matches nltk.translate.bleu_score reference

Phase 3: Metrics — ROUGE & Readability

Goal: Implement ROUGE and readability metrics.

Tasks:

Implement metrics/rouge.py (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1)
Implement metrics/readability.py (Flesch-Kincaid grade level)
- Set requires_reference = False for standalone operation
Add RougeResult, RougeScore, ReadabilityResult to results.py
Write comprehensive tests:
- Single-pair scoring with reference values from rouge-score library
- Batch scoring with statistical aggregation
- Edge cases: empty text, very short text, identical texts
- Readability on various grade levels (children's text → academic)
Update changelog

Files:

src/veritext/metrics/rouge.py
src/veritext/metrics/readability.py
tests/test_metrics/test_rouge.py
tests/test_metrics/test_readability.py

Verification:

uv run pytest tests/test_metrics/ -v
# Verify ROUGE matches rouge-score library reference

Phase 4: Validators

Goal: Build composable validation system.

Tasks:

Implement validators/base.py (Check protocol, ValidationResult)
Implement validators/metric.py (BleuValidator, RougeValidator)
- Raise ValidationError if context.reference is None
Implement validators/constraint.py (LengthValidator, ContainsValidator, etc.)
Implement validators/composite.py (AllOf, AnyOf, Weighted)
Create validator factory functions (v.bleu(), v.length(), etc.)
Define __all__ exports in validators/__init__.py
Write comprehensive tests:
- Individual validators with passing/failing cases
- Composition (all_of, any_of, weighted)
- Edge cases: missing reference, empty text, boundary thresholds
Update changelog

Key Design:

# validators/metric.py
class BleuValidator:
    def __init__(
        self,
        min_score: float,
        variant: int = 4,
        tokeniser: Tokeniser | None = None,
    ): ...

    def check(self, text: str, context: ValidationContext) -> CheckResult: ...

# validators/__init__.py (factory functions)
def bleu(min_score: float, variant: int = 4) -> BleuValidator: ...
def rouge(min_score: float, variant: str = "l") -> RougeValidator: ...
def length(min_chars: int | None = None, max_chars: int | None = None) -> LengthValidator: ...

Files:

src/veritext/validators/__init__.py
src/veritext/validators/base.py
src/veritext/validators/metric.py
src/veritext/validators/constraint.py
src/veritext/validators/composite.py
tests/test_validators/test_metric_validators.py
tests/test_validators/test_constraint_validators.py
tests/test_validators/test_composite.py

Verification:

uv run pytest tests/test_validators/ -v --cov=src/veritext/validators

Phase 5: Semantic Similarity (Optional Dependency)

Goal: Add embedding-based semantic similarity as optional feature.

Tasks:

Implement semantic/similarity.py with lazy import
Add embedding caching for repeated texts
Add DependencyError for missing sentence-transformers
Add SemanticResult to metrics/results.py
Add SemanticValidator to validators/metric.py (extends existing file)
Add v.semantic() factory function to validators/__init__.py
Write tests (skipped if dependency missing via pytest.importorskip)
Update changelog

Key Design:

# semantic/similarity.py
class SemanticSimilarity:
    def __init__(
        self,
        model: str = "all-MiniLM-L6-v2",
        cache_embeddings: bool = True,
    ):
        try:
            from sentence_transformers import SentenceTransformer
        except ImportError:
            raise DependencyError(
                "Install veritext[semantic] for semantic similarity: "
                "pip install veritext[semantic]"
            )
        self._model = SentenceTransformer(model)
        self._cache: dict[str, Any] = {} if cache_embeddings else None

Files:

src/veritext/semantic/__init__.py
src/veritext/semantic/similarity.py
src/veritext/metrics/results.py (add SemanticResult)
src/veritext/validators/metric.py (add SemanticValidator)
src/veritext/validators/__init__.py (add semantic() factory)
tests/test_semantic/test_similarity.py

Verification:

# Without semantic dependency — tests should skip gracefully
uv run pytest tests/ -v

# With semantic dependency
uv sync --extra semantic
uv run pytest tests/test_semantic/ -v

Phase 6: Pytest Plugin

Goal: Native pytest integration for CI/CD.

Tasks:

Create plugin structure with entry points
Implement fixtures: text_validator
Implement validate_text() assertion function
Create detailed failure formatting
Add @pytest.mark.text_validation marker
Write integration tests
Update changelog

Entry point:

[project.entry-points.pytest11]
veritext = "veritext.pytest_plugin"

Key Design:

# pytest_plugin/assertions.py
def validate_text(
    text: str,
    *,
    reference: str | None = None,
    min_bleu: float | None = None,
    min_rouge: float | None = None,
    min_semantic: float | None = None,
    max_length: int | None = None,
    max_reading_grade: float | None = None,
    contains: list[str] | None = None,
    excludes: list[str] | None = None,
) -> None:
    """
    Assert text passes all specified validation criteria.

    Raises:
        AssertionError: With detailed failure information if validation fails.
        ValueError: If comparison metrics requested but reference not provided.
    """

Error handling: If min_bleu, min_rouge, or min_semantic is specified without a reference, raise ValueError immediately with a clear message rather than failing inside the metric.

Files:

src/veritext/pytest_plugin/__init__.py
src/veritext/pytest_plugin/fixtures.py
src/veritext/pytest_plugin/assertions.py
src/veritext/pytest_plugin/plugin.py
tests/test_pytest_plugin/test_integration.py

Verification:

uv pip install -e .
uv run pytest --co -q  # Should show veritext plugin
uv run pytest tests/test_pytest_plugin/ -v

Phase 7: Benchmark & Regression

Goal: Track quality over time, detect regressions.

Tasks:

Implement benchmark/models.py (BenchmarkRun, RegressionReport)
Implement benchmark/storage.py (SQLite backend)
- Handle concurrent writes gracefully (SQLite WAL mode)
- Raise StorageError on corruption with recovery guidance
Implement benchmark/runner.py (Benchmark class)
Implement benchmark/regression.py (statistical detection using rolling window)
Add assert_no_regression() for CI integration
Write comprehensive tests:
- Storage CRUD operations
- Regression detection with known degradation
- Edge cases: first run (no baseline), empty metrics
Update changelog

Key Interface:

class Benchmark:
    def __init__(self, name: str, storage_path: str | Path = "benchmarks/"): ...

    def evaluate(
        self,
        candidates: list[str],
        references: list[str],
        metrics: list[str] | None = None,  # Default: ["rouge_l", "bleu4"]
    ) -> BenchmarkRun:
        """Evaluate candidates, store results, return the run record."""
        ...

    def check_regression(
        self,
        tolerance: float = 0.05,
        window: int = 10,
    ) -> RegressionReport:
        """Compare current run against historical baseline."""
        ...

    def assert_no_regression(self, tolerance: float = 0.05) -> None:
        """Raise RegressionDetectedError if quality dropped."""
        ...

SQLite Schema:

CREATE TABLE benchmark_runs (
    id TEXT PRIMARY KEY,
    benchmark_name TEXT NOT NULL,
    timestamp TEXT NOT NULL,
    veritext_version TEXT NOT NULL,
    sample_count INTEGER NOT NULL,
    metadata TEXT  -- JSON
);

CREATE TABLE benchmark_metrics (
    run_id TEXT REFERENCES benchmark_runs(id),
    metric_name TEXT NOT NULL,
    value REAL NOT NULL,
    PRIMARY KEY (run_id, metric_name)
);

CREATE INDEX idx_benchmark_name ON benchmark_runs(benchmark_name, timestamp);

Files:

src/veritext/benchmark/__init__.py
src/veritext/benchmark/models.py
src/veritext/benchmark/storage.py
src/veritext/benchmark/runner.py
src/veritext/benchmark/regression.py
tests/test_benchmark/test_storage.py
tests/test_benchmark/test_runner.py
tests/test_benchmark/test_regression.py

Verification:

uv run pytest tests/test_benchmark/ -v --cov=src/veritext/benchmark

Phase 8: CLI

Goal: Command-line interface for validation and benchmarking.

Tasks:

Implement Typer CLI app
Add validate command
Add benchmark run command
Add benchmark show command
Add rich output formatting
Write CLI tests
Update changelog

Commands:

veritext validate "text" --reference "ref" --metrics bleu,rouge
veritext validate --file outputs.jsonl --reference-file refs.jsonl
veritext benchmark run my_benchmark --inputs data/ --references refs/
veritext benchmark show my_benchmark --last 20
veritext benchmark check my_benchmark --tolerance 0.05

Input Formats:

JSONL: One JSON object per line with candidate and reference fields:

{"candidate": "The cat sat on the mat.", "reference": "A cat is sitting on a mat."}
{"candidate": "Hello world.", "reference": "Greetings, world."}

Directories: Matching filenames in --inputs and --references directories:

data/sample1.txt ↔ refs/sample1.txt
data/sample2.txt ↔ refs/sample2.txt

Files:

src/veritext/cli/__init__.py
src/veritext/cli/main.py
tests/test_cli/test_commands.py

Verification:

uv run veritext --help
uv run veritext validate "hello world" --reference "hello world" --metrics bleu
uv run pytest tests/test_cli/ -v

Phase 9: Documentation & Polish

Goal: Make portfolio-ready.

Tasks:

Write comprehensive readme.md with examples
Add docstrings to all public APIs
Create example scripts
Ensure ≥80% test coverage
Final linting/type checking
Update changelog.md with 0.1.0 release
Update project docs in docs/

Files:

readme.md (comprehensive)
examples/basic_validation.py
examples/chatbot_testing.py
examples/benchmark_regression.py
Update all docstrings
docs/project-plan.md (update)
docs/implementation-plan.md (update)

Verification:

uv run ruff check .
uv run ruff format --check .
uv run mypy src/
uv run pytest --cov=src/veritext --cov-report=term-missing
# Verify ≥80% coverage

Dependencies

[project]
name = "veritext"
version = "0.1.0"
description = "Semantic text validation framework"
readme = "readme.md"
requires-python = ">=3.11"
dependencies = [
    "pydantic>=2.0",
    "pydantic-settings>=2.0",
    "structlog>=23.0",
    "typer>=0.9",
    "rich>=13.0",
]

[project.optional-dependencies]
semantic = ["sentence-transformers>=2.2"]
dev = [
    "pytest>=7.0",
    "pytest-cov>=4.0",
    "mypy>=1.0",
    "ruff>=0.1",
]
all = ["veritext[semantic]"]

[project.scripts]
veritext = "veritext.cli.main:app"

[project.entry-points.pytest11]
veritext = "veritext.pytest_plugin"

Conventions

Code Quality

ruff check . — zero issues
ruff format --check . — zero changes
mypy src/ — passes (strict mode)
pytest --cov=src/veritext — ≥80% coverage

Git

Author: Kai Chappell git@kschappell.com
Signed commits: GPG key 219AA60F0638489B
Format: type(scope): description
Atomic: ≤3 files, ≤150 LOC per commit
No AI/LLM attribution

Python

Python 3.11+ with modern type hints
Absolute imports from package root
structlog for logging
UK English (colour, behaviour, summarisation)

Verification Checklist (Per Phase)

cd /home/kai/work/dev/portfolio/veritext

# Code quality
uv run ruff check .
uv run ruff format --check .
uv run mypy src/

# Tests
uv run pytest --cov=src/veritext --cov-report=term-missing

# Package installation
uv pip install -e .
uv run python -c "import veritext; print(veritext.__version__)"

28 KiB Raw Blame History

Implementation Plan: Veritext

Project Overview

Architectural Decisions

1. Layered Architecture

2. Metrics vs Validators (Clear Separation)

3. Optional Heavy Dependencies

4. Typed Result Objects

5. Shared Tokenisation

6. Explicit Context Object

Directory Structure

Exception Hierarchy

Core Interfaces

Metric Protocol

Validator Protocol

Benchmark Models

Edge Case Handling

Empty Text

None Reference

Unicode & Encoding

Very Long Text

Multiple References

Validator Naming Convention

Implementation Phases

Phase 1: Project Scaffold & Core

Phase 2: Metrics — BLEU & Lexical

Phase 3: Metrics — ROUGE & Readability

Phase 4: Validators

Phase 5: Semantic Similarity (Optional Dependency)

Phase 6: Pytest Plugin

Phase 7: Benchmark & Regression

Phase 8: CLI

Phase 9: Documentation & Polish

Dependencies

Conventions

Code Quality

Git

Python

Verification Checklist (Per Phase)

28 KiB

Raw Blame History