- Refactor CLI metric computation to eliminate code duplication - Update version format to PEP 440 compliance (0.1.0.dev0) - Cache Settings instance via @lru_cache for performance - Document composite validators' protocol deviation - Consolidate redundant empty checks in ROUGE-L computation - Add Phase 10 (Portfolio Demos) to implementation plan
1003 lines
29 KiB
Markdown
1003 lines
29 KiB
Markdown
# Implementation Plan: Veritext
|
|
|
|
Semantic text validation framework for Python — validates text outputs against quality criteria.
|
|
|
|
## Project Overview
|
|
|
|
**Location:** `portfolio/veritext/` (relative to workspace root)
|
|
**Remote:** `https://gitea.kschappell.com/kschappell/veritext.git`
|
|
|
|
**Purpose:** A Python library for validating text outputs against semantic criteria. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching.
|
|
|
|
---
|
|
|
|
## Architectural Decisions
|
|
|
|
### 1. Layered Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ CLI / pytest_plugin (presentation layer) │
|
|
├─────────────────────────────────────────────────────┤
|
|
│ validators/ (decision logic) │
|
|
│ benchmark/ (tracking & regression) │
|
|
├─────────────────────────────────────────────────────┤
|
|
│ metrics/ (pure computation) │
|
|
├─────────────────────────────────────────────────────┤
|
|
│ core/ (shared types, tokenisation) │
|
|
└─────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Dependency rule:** Each layer depends only on layers below it.
|
|
|
|
### 2. Metrics vs Validators (Clear Separation)
|
|
|
|
| Concept | Responsibility | Output |
|
|
|---------|----------------|--------|
|
|
| **Metric** | Compute a score | Typed result object (e.g., `BleuResult`) |
|
|
| **Validator** | Make pass/fail decision | `ValidationResult` with diagnostics |
|
|
|
|
Validators wrap metrics and apply thresholds.
|
|
|
|
### 3. Optional Heavy Dependencies
|
|
|
|
`sentence-transformers` (~2GB with PyTorch) is optional:
|
|
|
|
```toml
|
|
[project.optional-dependencies]
|
|
semantic = ["sentence-transformers>=2.2"]
|
|
```
|
|
|
|
Core library works without ML dependencies.
|
|
|
|
### 4. Typed Result Objects
|
|
|
|
Each metric returns a specific result type, not just `float`:
|
|
|
|
```python
|
|
@dataclass(frozen=True)
|
|
class BleuResult:
|
|
bleu1: float
|
|
bleu2: float
|
|
bleu3: float
|
|
bleu4: float
|
|
brevity_penalty: float
|
|
|
|
@dataclass(frozen=True)
|
|
class RougeScore:
|
|
precision: float
|
|
recall: float
|
|
fmeasure: float
|
|
|
|
@dataclass(frozen=True)
|
|
class RougeResult:
|
|
rouge1: RougeScore
|
|
rouge2: RougeScore
|
|
rouge_l: RougeScore
|
|
```
|
|
|
|
### 5. Shared Tokenisation
|
|
|
|
Single tokeniser used by all n-gram metrics:
|
|
|
|
```python
|
|
class Tokeniser(Protocol):
|
|
def tokenise(self, text: str) -> list[str]: ...
|
|
|
|
class WordTokeniser:
|
|
def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ...
|
|
```
|
|
|
|
### 6. Explicit Context Object
|
|
|
|
Validation context is explicit, not `**kwargs`:
|
|
|
|
```python
|
|
@dataclass
|
|
class ValidationContext:
|
|
reference: str | list[str] | None = None
|
|
metadata: dict[str, Any] = field(default_factory=dict)
|
|
```
|
|
|
|
---
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
veritext/
|
|
├── src/
|
|
│ └── veritext/
|
|
│ ├── __init__.py # Public API exports
|
|
│ ├── py.typed # PEP 561 marker
|
|
│ ├── core/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── types.py # ValidationContext, CheckResult, BatchResult
|
|
│ │ ├── exceptions.py # Exception hierarchy
|
|
│ │ ├── tokenisation.py # Shared tokeniser
|
|
│ │ ├── config.py # pydantic-settings
|
|
│ │ └── logging.py # structlog configuration
|
|
│ ├── metrics/
|
|
│ │ ├── __init__.py # Metric exports
|
|
│ │ ├── base.py # Metric protocol
|
|
│ │ ├── results.py # BleuResult, RougeResult, etc.
|
|
│ │ ├── bleu.py # BLEU implementation
|
|
│ │ ├── rouge.py # ROUGE implementation
|
|
│ │ ├── lexical.py # Jaccard, token overlap
|
|
│ │ └── readability.py # Flesch-Kincaid, etc.
|
|
│ ├── semantic/ # Optional (requires sentence-transformers)
|
|
│ │ ├── __init__.py
|
|
│ │ └── similarity.py # Embedding-based similarity
|
|
│ ├── validators/
|
|
│ │ ├── __init__.py # Validator exports
|
|
│ │ ├── base.py # Check protocol, ValidationResult
|
|
│ │ ├── metric.py # Validators wrapping metrics
|
|
│ │ ├── constraint.py # Length, content checks
|
|
│ │ └── composite.py # Validator composition
|
|
│ ├── benchmark/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── models.py # BenchmarkRun, RegressionReport
|
|
│ │ ├── storage.py # SQLite backend
|
|
│ │ ├── runner.py # Benchmark execution
|
|
│ │ └── regression.py # Statistical detection
|
|
│ ├── pytest_plugin/
|
|
│ │ ├── __init__.py # Plugin entry point
|
|
│ │ ├── fixtures.py # Pytest fixtures
|
|
│ │ ├── assertions.py # validate_text(), assert_similar()
|
|
│ │ └── plugin.py # Pytest hooks
|
|
│ └── cli/
|
|
│ ├── __init__.py
|
|
│ └── main.py # Typer CLI app
|
|
├── tests/
|
|
│ ├── conftest.py
|
|
│ ├── test_core/
|
|
│ │ ├── test_tokenisation.py
|
|
│ │ └── test_types.py
|
|
│ ├── test_metrics/
|
|
│ │ ├── test_bleu.py
|
|
│ │ ├── test_rouge.py
|
|
│ │ ├── test_lexical.py
|
|
│ │ └── test_readability.py
|
|
│ ├── test_semantic/
|
|
│ │ └── test_similarity.py
|
|
│ ├── test_validators/
|
|
│ │ ├── test_metric_validators.py
|
|
│ │ ├── test_constraint_validators.py
|
|
│ │ └── test_composite.py
|
|
│ ├── test_benchmark/
|
|
│ │ ├── test_storage.py
|
|
│ │ ├── test_runner.py
|
|
│ │ └── test_regression.py
|
|
│ ├── test_pytest_plugin/
|
|
│ │ └── test_integration.py
|
|
│ └── test_cli/
|
|
│ └── test_commands.py
|
|
├── examples/
|
|
│ ├── basic_validation.py
|
|
│ ├── chatbot_testing.py
|
|
│ └── benchmark_regression.py
|
|
├── docs/
|
|
│ ├── project-plan.md
|
|
│ └── implementation-plan.md
|
|
├── pyproject.toml
|
|
├── readme.md
|
|
├── changelog.md
|
|
└── CLAUDE.md
|
|
```
|
|
|
|
---
|
|
|
|
## Exception Hierarchy
|
|
|
|
```python
|
|
class VeritextError(Exception):
|
|
"""Base exception for all Veritext errors."""
|
|
|
|
class MetricError(VeritextError):
|
|
"""Error during metric computation."""
|
|
|
|
class TokenisationError(MetricError):
|
|
"""Error during text tokenisation."""
|
|
|
|
class EmbeddingError(MetricError):
|
|
"""Error computing embeddings (semantic similarity)."""
|
|
|
|
class ValidationError(VeritextError):
|
|
"""Error during validation."""
|
|
|
|
class InvalidThresholdError(ValidationError):
|
|
"""Invalid threshold value provided."""
|
|
|
|
class BenchmarkError(VeritextError):
|
|
"""Error during benchmarking."""
|
|
|
|
class StorageError(BenchmarkError):
|
|
"""Error reading/writing benchmark storage."""
|
|
|
|
class RegressionDetectedError(BenchmarkError):
|
|
"""Quality regression detected (used in CI)."""
|
|
|
|
class ConfigurationError(VeritextError):
|
|
"""Invalid configuration."""
|
|
|
|
class DependencyError(VeritextError):
|
|
"""Optional dependency not installed."""
|
|
```
|
|
|
|
---
|
|
|
|
## Core Interfaces
|
|
|
|
### Metric Protocol
|
|
|
|
```python
|
|
from typing import Protocol, TypeVar, Generic
|
|
|
|
T = TypeVar("T")
|
|
|
|
class Metric(Protocol[T]):
|
|
"""Protocol for text comparison metrics."""
|
|
|
|
@property
|
|
def name(self) -> str: ...
|
|
|
|
@property
|
|
def requires_reference(self) -> bool:
|
|
"""Whether this metric requires a reference text."""
|
|
...
|
|
|
|
def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
|
|
"""
|
|
Compute metric score.
|
|
|
|
Args:
|
|
candidate: The text to evaluate.
|
|
reference: Reference text(s) for comparison. Required for comparison
|
|
metrics (BLEU, ROUGE, semantic). Ignored for standalone
|
|
metrics (readability).
|
|
|
|
Raises:
|
|
ValueError: If reference is required but not provided.
|
|
"""
|
|
...
|
|
|
|
def batch_score(
|
|
self,
|
|
candidates: list[str],
|
|
references: list[str] | list[list[str]] | None = None,
|
|
) -> BatchResult[T]: ...
|
|
|
|
@dataclass
|
|
class AggregateStats:
|
|
mean: float
|
|
std: float
|
|
min: float
|
|
max: float
|
|
percentiles: dict[int, float] # {25: 0.65, 50: 0.72, 75: 0.81, 95: 0.89}
|
|
|
|
@dataclass
|
|
class BatchResult(Generic[T]):
|
|
results: list[T] # Individual results per sample
|
|
count: int
|
|
stats: dict[str, AggregateStats] # Aggregated stats for numeric fields
|
|
```
|
|
|
|
**Note:** Standalone metrics like readability return `False` for `requires_reference` and ignore the `reference` parameter. Comparison metrics (BLEU, ROUGE, semantic) return `True` and raise `ValueError` if `reference` is `None`.
|
|
|
|
### Validator Protocol
|
|
|
|
```python
|
|
class Check(Protocol):
|
|
"""Protocol for individual validation checks."""
|
|
|
|
@property
|
|
def name(self) -> str: ...
|
|
|
|
def check(self, text: str, context: ValidationContext) -> CheckResult: ...
|
|
|
|
@dataclass
|
|
class CheckResult:
|
|
name: str
|
|
passed: bool
|
|
actual: Any
|
|
threshold: Any | None
|
|
message: str
|
|
|
|
@dataclass
|
|
class ValidationResult:
|
|
passed: bool
|
|
checks: list[CheckResult]
|
|
|
|
@property
|
|
def failure_summary(self) -> str: ...
|
|
|
|
@property
|
|
def failed_checks(self) -> list[CheckResult]: ...
|
|
```
|
|
|
|
### Benchmark Models
|
|
|
|
```python
|
|
@dataclass
|
|
class BenchmarkRun:
|
|
id: str # UUID
|
|
benchmark_name: str
|
|
timestamp: datetime
|
|
veritext_version: str # Track library version
|
|
metrics: dict[str, float] # {"rouge_l": 0.82, "bleu4": 0.71}
|
|
sample_count: int
|
|
metadata: dict[str, Any] # {"git_sha": "abc123", "model": "v2"}
|
|
|
|
@dataclass
|
|
class RegressionReport:
|
|
detected: bool
|
|
baseline: dict[str, float]
|
|
current: dict[str, float]
|
|
deltas: dict[str, float] # {"rouge_l": -0.05}
|
|
tolerance: float
|
|
|
|
@property
|
|
def summary(self) -> str: ...
|
|
```
|
|
|
|
---
|
|
|
|
## Edge Case Handling
|
|
|
|
All components must handle edge cases consistently:
|
|
|
|
### Empty Text
|
|
|
|
| Input | Behaviour |
|
|
|-------|-----------|
|
|
| Empty candidate (`""`) | Metrics return zero scores; validators fail unless explicitly configured |
|
|
| Empty reference (`""`) | Comparison metrics raise `ValueError` |
|
|
| Whitespace-only text | Treated as empty after tokenisation |
|
|
|
|
### None Reference
|
|
|
|
| Component | Behaviour |
|
|
|-----------|-----------|
|
|
| Comparison metrics (BLEU, ROUGE, semantic) | Raise `ValueError("Reference required for {metric_name}")` |
|
|
| Standalone metrics (readability) | Ignore, compute normally |
|
|
| Validators wrapping comparison metrics | Raise `ValidationError` if `context.reference` is `None` |
|
|
|
|
### Unicode & Encoding
|
|
|
|
- All text assumed to be valid UTF-8 strings
|
|
- Normalisation: NFC by default (configurable in `Tokeniser`)
|
|
- Emoji and non-Latin scripts: Supported, tokenised as words where applicable
|
|
|
|
### Very Long Text
|
|
|
|
- No hard limits enforced by default
|
|
- `Tokeniser` can accept `max_tokens: int | None` for truncation
|
|
- Semantic similarity: Truncates to model's max sequence length (typically 512 tokens) with warning logged
|
|
|
|
### Multiple References
|
|
|
|
BLEU and ROUGE support multiple references (`list[str]`):
|
|
- BLEU: Computes against each reference, uses maximum n-gram matches
|
|
- ROUGE: Computes against each reference, returns best score
|
|
|
|
---
|
|
|
|
## Validator Naming Convention
|
|
|
|
Consistent short names:
|
|
|
|
```python
|
|
from veritext import validators as v
|
|
|
|
# Metric-based validators
|
|
v.bleu(min_score=0.7) # BLEU-4 by default
|
|
v.bleu(min_score=0.7, variant=1) # BLEU-1
|
|
v.rouge(min_score=0.7) # ROUGE-L by default
|
|
v.rouge(min_score=0.7, variant="1") # ROUGE-1
|
|
v.semantic(min_score=0.8) # Semantic similarity
|
|
|
|
# Constraint validators
|
|
v.length(max_chars=500)
|
|
v.length(min_chars=100, max_chars=500)
|
|
v.readability(max_grade=8)
|
|
v.contains(terms=["hello", "world"])
|
|
v.excludes(terms=["error", "fail"])
|
|
v.pattern(regex=r"^\d{4}-\d{2}-\d{2}$")
|
|
|
|
# Composition
|
|
v.all_of([...]) # All must pass
|
|
v.any_of([...]) # At least one must pass
|
|
v.weighted( # Weighted score threshold
|
|
checks=[
|
|
(v.bleu(min_score=0.7), 0.6), # (check, weight) tuples
|
|
(v.readability(max_grade=8), 0.4),
|
|
],
|
|
min_score=0.75, # Minimum weighted score to pass
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Project Scaffold & Core
|
|
|
|
**Goal:** Set up project structure with shared types and tokenisation.
|
|
|
|
**Tasks:**
|
|
1. Create directory structure
|
|
2. Write `pyproject.toml` with optional dependencies
|
|
3. Create `CLAUDE.md` with project guidelines
|
|
4. Implement `core/exceptions.py` (full hierarchy)
|
|
5. Implement `core/types.py` (`ValidationContext`, `CheckResult`, `ValidationResult`)
|
|
6. Implement `core/tokenisation.py` (`WordTokeniser` with NFC normalisation)
|
|
7. Implement `core/config.py` (pydantic-settings)
|
|
8. Implement `core/logging.py` (structlog configuration)
|
|
9. Create `__init__.py` with `__version__` and `__all__` exports
|
|
10. Write tests for tokenisation (including Unicode, empty input, whitespace-only)
|
|
11. Write tests for types (including edge cases)
|
|
12. Initial commit to Gitea
|
|
|
|
**Files:**
|
|
- `pyproject.toml`
|
|
- `CLAUDE.md`
|
|
- `readme.md` (stub)
|
|
- `changelog.md`
|
|
- `src/veritext/__init__.py`
|
|
- `src/veritext/py.typed`
|
|
- `src/veritext/core/__init__.py`
|
|
- `src/veritext/core/exceptions.py`
|
|
- `src/veritext/core/types.py`
|
|
- `src/veritext/core/tokenisation.py`
|
|
- `src/veritext/core/config.py`
|
|
- `src/veritext/core/logging.py`
|
|
- `tests/conftest.py`
|
|
- `tests/test_core/test_tokenisation.py`
|
|
- `tests/test_core/test_types.py`
|
|
|
|
**Verification:**
|
|
```bash
|
|
uv sync
|
|
uv run ruff check .
|
|
uv run ruff format --check .
|
|
uv run mypy src/
|
|
uv run pytest tests/test_core/ -v
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 2: Metrics — BLEU & Lexical
|
|
|
|
**Goal:** Implement BLEU and lexical similarity metrics.
|
|
|
|
**Tasks:**
|
|
1. Implement `metrics/base.py` (Metric protocol, `BatchResult`, `AggregateStats`)
|
|
2. Implement `metrics/results.py` (`BleuResult`, `LexicalResult`)
|
|
3. Implement `metrics/bleu.py` (BLEU-1 through BLEU-4)
|
|
4. Implement `metrics/lexical.py` (Jaccard, token overlap)
|
|
5. Add batch processing with aggregate statistics (mean, std, percentiles)
|
|
6. Write comprehensive tests:
|
|
- Single-pair scoring with reference values from NLTK
|
|
- Batch scoring with statistical aggregation
|
|
- Edge cases: empty text, single-word inputs, identical texts
|
|
- Multiple references support
|
|
7. Define `__all__` exports in each module's `__init__.py`
|
|
8. Update changelog
|
|
|
|
**Key Design:**
|
|
```python
|
|
class Bleu:
|
|
def __init__(self, tokeniser: Tokeniser | None = None, max_n: int = 4): ...
|
|
|
|
def score(self, candidate: str, reference: str | list[str]) -> BleuResult: ...
|
|
```
|
|
|
|
**Files:**
|
|
- `src/veritext/metrics/__init__.py`
|
|
- `src/veritext/metrics/base.py`
|
|
- `src/veritext/metrics/results.py`
|
|
- `src/veritext/metrics/bleu.py`
|
|
- `src/veritext/metrics/lexical.py`
|
|
- `tests/test_metrics/test_bleu.py`
|
|
- `tests/test_metrics/test_lexical.py`
|
|
|
|
**Verification:**
|
|
```bash
|
|
uv run pytest tests/test_metrics/ -v --cov=src/veritext/metrics
|
|
# Verify BLEU matches nltk.translate.bleu_score reference
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 3: Metrics — ROUGE & Readability
|
|
|
|
**Goal:** Implement ROUGE and readability metrics.
|
|
|
|
**Tasks:**
|
|
1. Implement `metrics/rouge.py` (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1)
|
|
2. Implement `metrics/readability.py` (Flesch-Kincaid grade level)
|
|
- Set `requires_reference = False` for standalone operation
|
|
3. Add `RougeResult`, `RougeScore`, `ReadabilityResult` to results.py
|
|
4. Write comprehensive tests:
|
|
- Single-pair scoring with reference values from `rouge-score` library
|
|
- Batch scoring with statistical aggregation
|
|
- Edge cases: empty text, very short text, identical texts
|
|
- Readability on various grade levels (children's text → academic)
|
|
5. Update changelog
|
|
|
|
**Files:**
|
|
- `src/veritext/metrics/rouge.py`
|
|
- `src/veritext/metrics/readability.py`
|
|
- `tests/test_metrics/test_rouge.py`
|
|
- `tests/test_metrics/test_readability.py`
|
|
|
|
**Verification:**
|
|
```bash
|
|
uv run pytest tests/test_metrics/ -v
|
|
# Verify ROUGE matches rouge-score library reference
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 4: Validators
|
|
|
|
**Goal:** Build composable validation system.
|
|
|
|
**Tasks:**
|
|
1. Implement `validators/base.py` (`Check` protocol, `ValidationResult`)
|
|
2. Implement `validators/metric.py` (`BleuValidator`, `RougeValidator`)
|
|
- Raise `ValidationError` if `context.reference` is `None`
|
|
3. Implement `validators/constraint.py` (`LengthValidator`, `ContainsValidator`, etc.)
|
|
4. Implement `validators/composite.py` (`AllOf`, `AnyOf`, `Weighted`)
|
|
5. Create validator factory functions (`v.bleu()`, `v.length()`, etc.)
|
|
6. Define `__all__` exports in `validators/__init__.py`
|
|
7. Write comprehensive tests:
|
|
- Individual validators with passing/failing cases
|
|
- Composition (all_of, any_of, weighted)
|
|
- Edge cases: missing reference, empty text, boundary thresholds
|
|
8. Update changelog
|
|
|
|
**Key Design:**
|
|
```python
|
|
# validators/metric.py
|
|
class BleuValidator:
|
|
def __init__(
|
|
self,
|
|
min_score: float,
|
|
variant: int = 4,
|
|
tokeniser: Tokeniser | None = None,
|
|
): ...
|
|
|
|
def check(self, text: str, context: ValidationContext) -> CheckResult: ...
|
|
|
|
# validators/__init__.py (factory functions)
|
|
def bleu(min_score: float, variant: int = 4) -> BleuValidator: ...
|
|
def rouge(min_score: float, variant: str = "l") -> RougeValidator: ...
|
|
def length(min_chars: int | None = None, max_chars: int | None = None) -> LengthValidator: ...
|
|
```
|
|
|
|
**Files:**
|
|
- `src/veritext/validators/__init__.py`
|
|
- `src/veritext/validators/base.py`
|
|
- `src/veritext/validators/metric.py`
|
|
- `src/veritext/validators/constraint.py`
|
|
- `src/veritext/validators/composite.py`
|
|
- `tests/test_validators/test_metric_validators.py`
|
|
- `tests/test_validators/test_constraint_validators.py`
|
|
- `tests/test_validators/test_composite.py`
|
|
|
|
**Verification:**
|
|
```bash
|
|
uv run pytest tests/test_validators/ -v --cov=src/veritext/validators
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 5: Semantic Similarity (Optional Dependency)
|
|
|
|
**Goal:** Add embedding-based semantic similarity as optional feature.
|
|
|
|
**Tasks:**
|
|
1. Implement `semantic/similarity.py` with lazy import
|
|
2. Add embedding caching for repeated texts
|
|
3. Add `DependencyError` for missing sentence-transformers
|
|
4. Add `SemanticResult` to `metrics/results.py`
|
|
5. Add `SemanticValidator` to `validators/metric.py` (extends existing file)
|
|
6. Add `v.semantic()` factory function to `validators/__init__.py`
|
|
7. Write tests (skipped if dependency missing via `pytest.importorskip`)
|
|
8. Update changelog
|
|
|
|
**Key Design:**
|
|
```python
|
|
# semantic/similarity.py
|
|
class SemanticSimilarity:
|
|
def __init__(
|
|
self,
|
|
model: str = "all-MiniLM-L6-v2",
|
|
cache_embeddings: bool = True,
|
|
):
|
|
try:
|
|
from sentence_transformers import SentenceTransformer
|
|
except ImportError:
|
|
raise DependencyError(
|
|
"Install veritext[semantic] for semantic similarity: "
|
|
"pip install veritext[semantic]"
|
|
)
|
|
self._model = SentenceTransformer(model)
|
|
self._cache: dict[str, Any] = {} if cache_embeddings else None
|
|
```
|
|
|
|
**Files:**
|
|
- `src/veritext/semantic/__init__.py`
|
|
- `src/veritext/semantic/similarity.py`
|
|
- `src/veritext/metrics/results.py` (add `SemanticResult`)
|
|
- `src/veritext/validators/metric.py` (add `SemanticValidator`)
|
|
- `src/veritext/validators/__init__.py` (add `semantic()` factory)
|
|
- `tests/test_semantic/test_similarity.py`
|
|
|
|
**Verification:**
|
|
```bash
|
|
# Without semantic dependency — tests should skip gracefully
|
|
uv run pytest tests/ -v
|
|
|
|
# With semantic dependency
|
|
uv sync --extra semantic
|
|
uv run pytest tests/test_semantic/ -v
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 6: Pytest Plugin
|
|
|
|
**Goal:** Native pytest integration for CI/CD.
|
|
|
|
**Tasks:**
|
|
1. Create plugin structure with entry points
|
|
2. Implement fixtures: `text_validator`
|
|
3. Implement `validate_text()` assertion function
|
|
4. Create detailed failure formatting
|
|
5. Add `@pytest.mark.text_validation` marker
|
|
6. Write integration tests
|
|
7. Update changelog
|
|
|
|
**Entry point:**
|
|
```toml
|
|
[project.entry-points.pytest11]
|
|
veritext = "veritext.pytest_plugin"
|
|
```
|
|
|
|
**Key Design:**
|
|
```python
|
|
# pytest_plugin/assertions.py
|
|
def validate_text(
|
|
text: str,
|
|
*,
|
|
reference: str | None = None,
|
|
min_bleu: float | None = None,
|
|
min_rouge: float | None = None,
|
|
min_semantic: float | None = None,
|
|
max_length: int | None = None,
|
|
max_reading_grade: float | None = None,
|
|
contains: list[str] | None = None,
|
|
excludes: list[str] | None = None,
|
|
) -> None:
|
|
"""
|
|
Assert text passes all specified validation criteria.
|
|
|
|
Raises:
|
|
AssertionError: With detailed failure information if validation fails.
|
|
ValueError: If comparison metrics requested but reference not provided.
|
|
"""
|
|
```
|
|
|
|
**Error handling:** If `min_bleu`, `min_rouge`, or `min_semantic` is specified without a `reference`, raise `ValueError` immediately with a clear message rather than failing inside the metric.
|
|
|
|
**Files:**
|
|
- `src/veritext/pytest_plugin/__init__.py`
|
|
- `src/veritext/pytest_plugin/fixtures.py`
|
|
- `src/veritext/pytest_plugin/assertions.py`
|
|
- `src/veritext/pytest_plugin/plugin.py`
|
|
- `tests/test_pytest_plugin/test_integration.py`
|
|
|
|
**Verification:**
|
|
```bash
|
|
uv pip install -e .
|
|
uv run pytest --co -q # Should show veritext plugin
|
|
uv run pytest tests/test_pytest_plugin/ -v
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 7: Benchmark & Regression
|
|
|
|
**Goal:** Track quality over time, detect regressions.
|
|
|
|
**Tasks:**
|
|
1. Implement `benchmark/models.py` (`BenchmarkRun`, `RegressionReport`)
|
|
2. Implement `benchmark/storage.py` (SQLite backend)
|
|
- Handle concurrent writes gracefully (SQLite WAL mode)
|
|
- Raise `StorageError` on corruption with recovery guidance
|
|
3. Implement `benchmark/runner.py` (`Benchmark` class)
|
|
4. Implement `benchmark/regression.py` (statistical detection using rolling window)
|
|
5. Add `assert_no_regression()` for CI integration
|
|
6. Write comprehensive tests:
|
|
- Storage CRUD operations
|
|
- Regression detection with known degradation
|
|
- Edge cases: first run (no baseline), empty metrics
|
|
7. Update changelog
|
|
|
|
**Key Interface:**
|
|
```python
|
|
class Benchmark:
|
|
def __init__(self, name: str, storage_path: str | Path = "benchmarks/"): ...
|
|
|
|
def evaluate(
|
|
self,
|
|
candidates: list[str],
|
|
references: list[str],
|
|
metrics: list[str] | None = None, # Default: ["rouge_l", "bleu4"]
|
|
) -> BenchmarkRun:
|
|
"""Evaluate candidates, store results, return the run record."""
|
|
...
|
|
|
|
def check_regression(
|
|
self,
|
|
tolerance: float = 0.05,
|
|
window: int = 10,
|
|
) -> RegressionReport:
|
|
"""Compare current run against historical baseline."""
|
|
...
|
|
|
|
def assert_no_regression(self, tolerance: float = 0.05) -> None:
|
|
"""Raise RegressionDetectedError if quality dropped."""
|
|
...
|
|
```
|
|
|
|
**SQLite Schema:**
|
|
```sql
|
|
CREATE TABLE benchmark_runs (
|
|
id TEXT PRIMARY KEY,
|
|
benchmark_name TEXT NOT NULL,
|
|
timestamp TEXT NOT NULL,
|
|
veritext_version TEXT NOT NULL,
|
|
sample_count INTEGER NOT NULL,
|
|
metadata TEXT -- JSON
|
|
);
|
|
|
|
CREATE TABLE benchmark_metrics (
|
|
run_id TEXT REFERENCES benchmark_runs(id),
|
|
metric_name TEXT NOT NULL,
|
|
value REAL NOT NULL,
|
|
PRIMARY KEY (run_id, metric_name)
|
|
);
|
|
|
|
CREATE INDEX idx_benchmark_name ON benchmark_runs(benchmark_name, timestamp);
|
|
```
|
|
|
|
**Files:**
|
|
- `src/veritext/benchmark/__init__.py`
|
|
- `src/veritext/benchmark/models.py`
|
|
- `src/veritext/benchmark/storage.py`
|
|
- `src/veritext/benchmark/runner.py`
|
|
- `src/veritext/benchmark/regression.py`
|
|
- `tests/test_benchmark/test_storage.py`
|
|
- `tests/test_benchmark/test_runner.py`
|
|
- `tests/test_benchmark/test_regression.py`
|
|
|
|
**Verification:**
|
|
```bash
|
|
uv run pytest tests/test_benchmark/ -v --cov=src/veritext/benchmark
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 8: CLI
|
|
|
|
**Goal:** Command-line interface for validation and benchmarking.
|
|
|
|
**Tasks:**
|
|
1. Implement Typer CLI app
|
|
2. Add `validate` command
|
|
3. Add `benchmark run` command
|
|
4. Add `benchmark show` command
|
|
5. Add rich output formatting
|
|
6. Write CLI tests
|
|
7. Update changelog
|
|
|
|
**Commands:**
|
|
```bash
|
|
veritext validate "text" --reference "ref" --metrics bleu,rouge
|
|
veritext validate --file outputs.jsonl --reference-file refs.jsonl
|
|
veritext benchmark run my_benchmark --inputs data/ --references refs/
|
|
veritext benchmark show my_benchmark --last 20
|
|
veritext benchmark check my_benchmark --tolerance 0.05
|
|
```
|
|
|
|
**Input Formats:**
|
|
- **JSONL:** One JSON object per line with `candidate` and `reference` fields:
|
|
```json
|
|
{"candidate": "The cat sat on the mat.", "reference": "A cat is sitting on a mat."}
|
|
{"candidate": "Hello world.", "reference": "Greetings, world."}
|
|
```
|
|
- **Directories:** Matching filenames in `--inputs` and `--references` directories:
|
|
```
|
|
data/sample1.txt ↔ refs/sample1.txt
|
|
data/sample2.txt ↔ refs/sample2.txt
|
|
```
|
|
|
|
**Files:**
|
|
- `src/veritext/cli/__init__.py`
|
|
- `src/veritext/cli/main.py`
|
|
- `tests/test_cli/test_commands.py`
|
|
|
|
**Verification:**
|
|
```bash
|
|
uv run veritext --help
|
|
uv run veritext validate "hello world" --reference "hello world" --metrics bleu
|
|
uv run pytest tests/test_cli/ -v
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 9: Documentation & Polish
|
|
|
|
**Goal:** Make portfolio-ready.
|
|
|
|
**Tasks:**
|
|
1. Write comprehensive `readme.md` with examples
|
|
2. Add docstrings to all public APIs
|
|
3. Create example scripts
|
|
4. Ensure ≥80% test coverage
|
|
5. Final linting/type checking
|
|
6. Update `changelog.md` with 0.1.0 release
|
|
7. Update project docs in `docs/`
|
|
|
|
**Files:**
|
|
- `readme.md` (comprehensive)
|
|
- `examples/basic_validation.py`
|
|
- `examples/chatbot_testing.py`
|
|
- `examples/benchmark_regression.py`
|
|
- Update all docstrings
|
|
- `docs/project-plan.md` (update)
|
|
- `docs/implementation-plan.md` (update)
|
|
|
|
**Verification:**
|
|
```bash
|
|
uv run ruff check .
|
|
uv run ruff format --check .
|
|
uv run mypy src/
|
|
uv run pytest --cov=src/veritext --cov-report=term-missing
|
|
# Verify ≥80% coverage
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 10: Portfolio Demos
|
|
|
|
**Goal:** Interactive demos for showcasing Veritext without installation.
|
|
|
|
**Step 1 — Streamlit Demo:**
|
|
|
|
Build a quick interactive web UI for general visitors.
|
|
|
|
- [ ] Create `demo/streamlit_app.py`
|
|
- [ ] Text input boxes (candidate + reference)
|
|
- [ ] Metric selector (BLEU, ROUGE, lexical, readability)
|
|
- [ ] Threshold sliders for pass/fail validation
|
|
- [ ] Results table with scores and status
|
|
- [ ] Deploy to homeserver (e.g., `veritext.kschappell.com`)
|
|
|
|
**Step 2 — Jupyter Notebook Collection:**
|
|
|
|
Deep-dive notebooks targeting data science and ML recruiters.
|
|
|
|
- [ ] Create `notebooks/` directory
|
|
- [ ] `01-metrics-overview.ipynb` — Introduction to each metric with visualisations
|
|
- [ ] `02-batch-evaluation.ipynb` — Evaluating model outputs at scale
|
|
- [ ] `03-regression-detection.ipynb` — Tracking quality over time
|
|
- [ ] `04-chatbot-validation.ipynb` — Real-world use case
|
|
|
|
**Step 3 — JupyterLite Deployment:**
|
|
|
|
Host notebooks as static files running in the browser.
|
|
|
|
- [ ] Configure JupyterLite build with veritext pre-installed
|
|
- [ ] Bundle notebooks into static site
|
|
- [ ] Deploy alongside Streamlit demo
|
|
|
|
**Files:**
|
|
- `demo/streamlit_app.py`
|
|
- `notebooks/01-metrics-overview.ipynb`
|
|
- `notebooks/02-batch-evaluation.ipynb`
|
|
- `notebooks/03-regression-detection.ipynb`
|
|
- `notebooks/04-chatbot-validation.ipynb`
|
|
- `notebooks/jupyterlite-config.json`
|
|
|
|
**Verification:**
|
|
```bash
|
|
# Streamlit
|
|
uv run streamlit run demo/streamlit_app.py
|
|
|
|
# JupyterLite (local preview)
|
|
jupyter lite build --contents notebooks/
|
|
jupyter lite serve
|
|
```
|
|
|
|
---
|
|
|
|
## Dependencies
|
|
|
|
```toml
|
|
[project]
|
|
name = "veritext"
|
|
version = "0.1.0"
|
|
description = "Semantic text validation framework"
|
|
readme = "readme.md"
|
|
requires-python = ">=3.11"
|
|
dependencies = [
|
|
"pydantic>=2.0",
|
|
"pydantic-settings>=2.0",
|
|
"structlog>=23.0",
|
|
"typer>=0.9",
|
|
"rich>=13.0",
|
|
]
|
|
|
|
[project.optional-dependencies]
|
|
semantic = ["sentence-transformers>=2.2"]
|
|
dev = [
|
|
"pytest>=7.0",
|
|
"pytest-cov>=4.0",
|
|
"mypy>=1.0",
|
|
"ruff>=0.1",
|
|
]
|
|
all = ["veritext[semantic]"]
|
|
|
|
[project.scripts]
|
|
veritext = "veritext.cli.main:app"
|
|
|
|
[project.entry-points.pytest11]
|
|
veritext = "veritext.pytest_plugin"
|
|
```
|
|
|
|
---
|
|
|
|
## Conventions
|
|
|
|
### Code Quality
|
|
- `ruff check .` — zero issues
|
|
- `ruff format --check .` — zero changes
|
|
- `mypy src/` — passes (strict mode)
|
|
- `pytest --cov=src/veritext` — ≥80% coverage
|
|
|
|
### Git
|
|
- **Author:** Kai Chappell <git@kschappell.com>
|
|
- **Signed commits:** GPG key 219AA60F0638489B
|
|
- **Format:** `type(scope): description`
|
|
- **Atomic:** ≤3 files, ≤150 LOC per commit
|
|
- **No AI/LLM attribution**
|
|
|
|
### Python
|
|
- Python 3.11+ with modern type hints
|
|
- Absolute imports from package root
|
|
- structlog for logging
|
|
- UK English (colour, behaviour, summarisation)
|
|
|
|
---
|
|
|
|
## Verification Checklist (Per Phase)
|
|
|
|
```bash
|
|
cd /home/kai/work/dev/portfolio/veritext
|
|
|
|
# Code quality
|
|
uv run ruff check .
|
|
uv run ruff format --check .
|
|
uv run mypy src/
|
|
|
|
# Tests
|
|
uv run pytest --cov=src/veritext --cov-report=term-missing
|
|
|
|
# Package installation
|
|
uv pip install -e .
|
|
uv run python -c "import veritext; print(veritext.__version__)"
|
|
```
|