- Add requires_reference property to Metric protocol for standalone metrics - Make reference parameter optional in score/batch_score methods - Add comprehensive Edge Case Handling section (empty text, Unicode, etc.) - Expand phase tasks with explicit test coverage requirements - Fix path reference to use relative workspace path - Add missing test_runner.py to directory structure - Clarify SemanticValidator integration in Phase 5 - Fix tuple/list type annotation in Benchmark.evaluate()
28 KiB
Implementation Plan: Veritext
Semantic text validation framework for Python — validates text outputs against quality criteria.
Project Overview
Location: portfolio/veritext/ (relative to workspace root)
Remote: https://gitea.kschappell.com/kschappell/veritext.git
Purpose: A Python library for validating text outputs against semantic criteria. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching.
Architectural Decisions
1. Layered Architecture
┌─────────────────────────────────────────────────────┐
│ CLI / pytest_plugin (presentation layer) │
├─────────────────────────────────────────────────────┤
│ validators/ (decision logic) │
│ benchmark/ (tracking & regression) │
├─────────────────────────────────────────────────────┤
│ metrics/ (pure computation) │
├─────────────────────────────────────────────────────┤
│ core/ (shared types, tokenisation) │
└─────────────────────────────────────────────────────┘
Dependency rule: Each layer depends only on layers below it.
2. Metrics vs Validators (Clear Separation)
| Concept | Responsibility | Output |
|---|---|---|
| Metric | Compute a score | Typed result object (e.g., BleuResult) |
| Validator | Make pass/fail decision | ValidationResult with diagnostics |
Validators wrap metrics and apply thresholds.
3. Optional Heavy Dependencies
sentence-transformers (~2GB with PyTorch) is optional:
[project.optional-dependencies]
semantic = ["sentence-transformers>=2.2"]
Core library works without ML dependencies.
4. Typed Result Objects
Each metric returns a specific result type, not just float:
@dataclass(frozen=True)
class BleuResult:
bleu1: float
bleu2: float
bleu3: float
bleu4: float
brevity_penalty: float
@dataclass(frozen=True)
class RougeScore:
precision: float
recall: float
fmeasure: float
@dataclass(frozen=True)
class RougeResult:
rouge1: RougeScore
rouge2: RougeScore
rouge_l: RougeScore
5. Shared Tokenisation
Single tokeniser used by all n-gram metrics:
class Tokeniser(Protocol):
def tokenise(self, text: str) -> list[str]: ...
class WordTokeniser:
def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ...
6. Explicit Context Object
Validation context is explicit, not **kwargs:
@dataclass
class ValidationContext:
reference: str | list[str] | None = None
metadata: dict[str, Any] = field(default_factory=dict)
Directory Structure
veritext/
├── src/
│ └── veritext/
│ ├── __init__.py # Public API exports
│ ├── py.typed # PEP 561 marker
│ ├── core/
│ │ ├── __init__.py
│ │ ├── types.py # ValidationContext, CheckResult, BatchResult
│ │ ├── exceptions.py # Exception hierarchy
│ │ ├── tokenisation.py # Shared tokeniser
│ │ ├── config.py # pydantic-settings
│ │ └── logging.py # structlog configuration
│ ├── metrics/
│ │ ├── __init__.py # Metric exports
│ │ ├── base.py # Metric protocol
│ │ ├── results.py # BleuResult, RougeResult, etc.
│ │ ├── bleu.py # BLEU implementation
│ │ ├── rouge.py # ROUGE implementation
│ │ ├── lexical.py # Jaccard, token overlap
│ │ └── readability.py # Flesch-Kincaid, etc.
│ ├── semantic/ # Optional (requires sentence-transformers)
│ │ ├── __init__.py
│ │ └── similarity.py # Embedding-based similarity
│ ├── validators/
│ │ ├── __init__.py # Validator exports
│ │ ├── base.py # Check protocol, ValidationResult
│ │ ├── metric.py # Validators wrapping metrics
│ │ ├── constraint.py # Length, content checks
│ │ └── composite.py # Validator composition
│ ├── benchmark/
│ │ ├── __init__.py
│ │ ├── models.py # BenchmarkRun, RegressionReport
│ │ ├── storage.py # SQLite backend
│ │ ├── runner.py # Benchmark execution
│ │ └── regression.py # Statistical detection
│ ├── pytest_plugin/
│ │ ├── __init__.py # Plugin entry point
│ │ ├── fixtures.py # Pytest fixtures
│ │ ├── assertions.py # validate_text(), assert_similar()
│ │ └── plugin.py # Pytest hooks
│ └── cli/
│ ├── __init__.py
│ └── main.py # Typer CLI app
├── tests/
│ ├── conftest.py
│ ├── test_core/
│ │ ├── test_tokenisation.py
│ │ └── test_types.py
│ ├── test_metrics/
│ │ ├── test_bleu.py
│ │ ├── test_rouge.py
│ │ ├── test_lexical.py
│ │ └── test_readability.py
│ ├── test_semantic/
│ │ └── test_similarity.py
│ ├── test_validators/
│ │ ├── test_metric_validators.py
│ │ ├── test_constraint_validators.py
│ │ └── test_composite.py
│ ├── test_benchmark/
│ │ ├── test_storage.py
│ │ ├── test_runner.py
│ │ └── test_regression.py
│ ├── test_pytest_plugin/
│ │ └── test_integration.py
│ └── test_cli/
│ └── test_commands.py
├── examples/
│ ├── basic_validation.py
│ ├── chatbot_testing.py
│ └── benchmark_regression.py
├── docs/
│ ├── project-plan.md
│ └── implementation-plan.md
├── pyproject.toml
├── readme.md
├── changelog.md
└── CLAUDE.md
Exception Hierarchy
class VeritextError(Exception):
"""Base exception for all Veritext errors."""
class MetricError(VeritextError):
"""Error during metric computation."""
class TokenisationError(MetricError):
"""Error during text tokenisation."""
class EmbeddingError(MetricError):
"""Error computing embeddings (semantic similarity)."""
class ValidationError(VeritextError):
"""Error during validation."""
class InvalidThresholdError(ValidationError):
"""Invalid threshold value provided."""
class BenchmarkError(VeritextError):
"""Error during benchmarking."""
class StorageError(BenchmarkError):
"""Error reading/writing benchmark storage."""
class RegressionDetectedError(BenchmarkError):
"""Quality regression detected (used in CI)."""
class ConfigurationError(VeritextError):
"""Invalid configuration."""
class DependencyError(VeritextError):
"""Optional dependency not installed."""
Core Interfaces
Metric Protocol
from typing import Protocol, TypeVar, Generic
T = TypeVar("T")
class Metric(Protocol[T]):
"""Protocol for text comparison metrics."""
@property
def name(self) -> str: ...
@property
def requires_reference(self) -> bool:
"""Whether this metric requires a reference text."""
...
def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
"""
Compute metric score.
Args:
candidate: The text to evaluate.
reference: Reference text(s) for comparison. Required for comparison
metrics (BLEU, ROUGE, semantic). Ignored for standalone
metrics (readability).
Raises:
ValueError: If reference is required but not provided.
"""
...
def batch_score(
self,
candidates: list[str],
references: list[str] | list[list[str]] | None = None,
) -> BatchResult[T]: ...
@dataclass
class AggregateStats:
mean: float
std: float
min: float
max: float
percentiles: dict[int, float] # {25: 0.65, 50: 0.72, 75: 0.81, 95: 0.89}
@dataclass
class BatchResult(Generic[T]):
results: list[T] # Individual results per sample
count: int
stats: dict[str, AggregateStats] # Aggregated stats for numeric fields
Note: Standalone metrics like readability return False for requires_reference and ignore the reference parameter. Comparison metrics (BLEU, ROUGE, semantic) return True and raise ValueError if reference is None.
Validator Protocol
class Check(Protocol):
"""Protocol for individual validation checks."""
@property
def name(self) -> str: ...
def check(self, text: str, context: ValidationContext) -> CheckResult: ...
@dataclass
class CheckResult:
name: str
passed: bool
actual: Any
threshold: Any | None
message: str
@dataclass
class ValidationResult:
passed: bool
checks: list[CheckResult]
@property
def failure_summary(self) -> str: ...
@property
def failed_checks(self) -> list[CheckResult]: ...
Benchmark Models
@dataclass
class BenchmarkRun:
id: str # UUID
benchmark_name: str
timestamp: datetime
veritext_version: str # Track library version
metrics: dict[str, float] # {"rouge_l": 0.82, "bleu4": 0.71}
sample_count: int
metadata: dict[str, Any] # {"git_sha": "abc123", "model": "v2"}
@dataclass
class RegressionReport:
detected: bool
baseline: dict[str, float]
current: dict[str, float]
deltas: dict[str, float] # {"rouge_l": -0.05}
tolerance: float
@property
def summary(self) -> str: ...
Edge Case Handling
All components must handle edge cases consistently:
Empty Text
| Input | Behaviour |
|---|---|
Empty candidate ("") |
Metrics return zero scores; validators fail unless explicitly configured |
Empty reference ("") |
Comparison metrics raise ValueError |
| Whitespace-only text | Treated as empty after tokenisation |
None Reference
| Component | Behaviour |
|---|---|
| Comparison metrics (BLEU, ROUGE, semantic) | Raise ValueError("Reference required for {metric_name}") |
| Standalone metrics (readability) | Ignore, compute normally |
| Validators wrapping comparison metrics | Raise ValidationError if context.reference is None |
Unicode & Encoding
- All text assumed to be valid UTF-8 strings
- Normalisation: NFC by default (configurable in
Tokeniser) - Emoji and non-Latin scripts: Supported, tokenised as words where applicable
Very Long Text
- No hard limits enforced by default
Tokenisercan acceptmax_tokens: int | Nonefor truncation- Semantic similarity: Truncates to model's max sequence length (typically 512 tokens) with warning logged
Multiple References
BLEU and ROUGE support multiple references (list[str]):
- BLEU: Computes against each reference, uses maximum n-gram matches
- ROUGE: Computes against each reference, returns best score
Validator Naming Convention
Consistent short names:
from veritext import validators as v
# Metric-based validators
v.bleu(min_score=0.7) # BLEU-4 by default
v.bleu(min_score=0.7, variant=1) # BLEU-1
v.rouge(min_score=0.7) # ROUGE-L by default
v.rouge(min_score=0.7, variant="1") # ROUGE-1
v.semantic(min_score=0.8) # Semantic similarity
# Constraint validators
v.length(max_chars=500)
v.length(min_chars=100, max_chars=500)
v.readability(max_grade=8)
v.contains(terms=["hello", "world"])
v.excludes(terms=["error", "fail"])
v.pattern(regex=r"^\d{4}-\d{2}-\d{2}$")
# Composition
v.all_of([...]) # All must pass
v.any_of([...]) # At least one must pass
v.weighted( # Weighted score threshold
checks=[
(v.bleu(min_score=0.7), 0.6), # (check, weight) tuples
(v.readability(max_grade=8), 0.4),
],
min_score=0.75, # Minimum weighted score to pass
)
Implementation Phases
Phase 1: Project Scaffold & Core
Goal: Set up project structure with shared types and tokenisation.
Tasks:
- Create directory structure
- Write
pyproject.tomlwith optional dependencies - Create
CLAUDE.mdwith project guidelines - Implement
core/exceptions.py(full hierarchy) - Implement
core/types.py(ValidationContext,CheckResult,ValidationResult) - Implement
core/tokenisation.py(WordTokeniserwith NFC normalisation) - Implement
core/config.py(pydantic-settings) - Implement
core/logging.py(structlog configuration) - Create
__init__.pywith__version__and__all__exports - Write tests for tokenisation (including Unicode, empty input, whitespace-only)
- Write tests for types (including edge cases)
- Initial commit to Gitea
Files:
pyproject.tomlCLAUDE.mdreadme.md(stub)changelog.mdsrc/veritext/__init__.pysrc/veritext/py.typedsrc/veritext/core/__init__.pysrc/veritext/core/exceptions.pysrc/veritext/core/types.pysrc/veritext/core/tokenisation.pysrc/veritext/core/config.pysrc/veritext/core/logging.pytests/conftest.pytests/test_core/test_tokenisation.pytests/test_core/test_types.py
Verification:
uv sync
uv run ruff check .
uv run ruff format --check .
uv run mypy src/
uv run pytest tests/test_core/ -v
Phase 2: Metrics — BLEU & Lexical
Goal: Implement BLEU and lexical similarity metrics.
Tasks:
- Implement
metrics/base.py(Metric protocol,BatchResult,AggregateStats) - Implement
metrics/results.py(BleuResult,LexicalResult) - Implement
metrics/bleu.py(BLEU-1 through BLEU-4) - Implement
metrics/lexical.py(Jaccard, token overlap) - Add batch processing with aggregate statistics (mean, std, percentiles)
- Write comprehensive tests:
- Single-pair scoring with reference values from NLTK
- Batch scoring with statistical aggregation
- Edge cases: empty text, single-word inputs, identical texts
- Multiple references support
- Define
__all__exports in each module's__init__.py - Update changelog
Key Design:
class Bleu:
def __init__(self, tokeniser: Tokeniser | None = None, max_n: int = 4): ...
def score(self, candidate: str, reference: str | list[str]) -> BleuResult: ...
Files:
src/veritext/metrics/__init__.pysrc/veritext/metrics/base.pysrc/veritext/metrics/results.pysrc/veritext/metrics/bleu.pysrc/veritext/metrics/lexical.pytests/test_metrics/test_bleu.pytests/test_metrics/test_lexical.py
Verification:
uv run pytest tests/test_metrics/ -v --cov=src/veritext/metrics
# Verify BLEU matches nltk.translate.bleu_score reference
Phase 3: Metrics — ROUGE & Readability
Goal: Implement ROUGE and readability metrics.
Tasks:
- Implement
metrics/rouge.py(ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1) - Implement
metrics/readability.py(Flesch-Kincaid grade level)- Set
requires_reference = Falsefor standalone operation
- Set
- Add
RougeResult,RougeScore,ReadabilityResultto results.py - Write comprehensive tests:
- Single-pair scoring with reference values from
rouge-scorelibrary - Batch scoring with statistical aggregation
- Edge cases: empty text, very short text, identical texts
- Readability on various grade levels (children's text → academic)
- Single-pair scoring with reference values from
- Update changelog
Files:
src/veritext/metrics/rouge.pysrc/veritext/metrics/readability.pytests/test_metrics/test_rouge.pytests/test_metrics/test_readability.py
Verification:
uv run pytest tests/test_metrics/ -v
# Verify ROUGE matches rouge-score library reference
Phase 4: Validators
Goal: Build composable validation system.
Tasks:
- Implement
validators/base.py(Checkprotocol,ValidationResult) - Implement
validators/metric.py(BleuValidator,RougeValidator)- Raise
ValidationErrorifcontext.referenceisNone
- Raise
- Implement
validators/constraint.py(LengthValidator,ContainsValidator, etc.) - Implement
validators/composite.py(AllOf,AnyOf,Weighted) - Create validator factory functions (
v.bleu(),v.length(), etc.) - Define
__all__exports invalidators/__init__.py - Write comprehensive tests:
- Individual validators with passing/failing cases
- Composition (all_of, any_of, weighted)
- Edge cases: missing reference, empty text, boundary thresholds
- Update changelog
Key Design:
# validators/metric.py
class BleuValidator:
def __init__(
self,
min_score: float,
variant: int = 4,
tokeniser: Tokeniser | None = None,
): ...
def check(self, text: str, context: ValidationContext) -> CheckResult: ...
# validators/__init__.py (factory functions)
def bleu(min_score: float, variant: int = 4) -> BleuValidator: ...
def rouge(min_score: float, variant: str = "l") -> RougeValidator: ...
def length(min_chars: int | None = None, max_chars: int | None = None) -> LengthValidator: ...
Files:
src/veritext/validators/__init__.pysrc/veritext/validators/base.pysrc/veritext/validators/metric.pysrc/veritext/validators/constraint.pysrc/veritext/validators/composite.pytests/test_validators/test_metric_validators.pytests/test_validators/test_constraint_validators.pytests/test_validators/test_composite.py
Verification:
uv run pytest tests/test_validators/ -v --cov=src/veritext/validators
Phase 5: Semantic Similarity (Optional Dependency)
Goal: Add embedding-based semantic similarity as optional feature.
Tasks:
- Implement
semantic/similarity.pywith lazy import - Add embedding caching for repeated texts
- Add
DependencyErrorfor missing sentence-transformers - Add
SemanticResulttometrics/results.py - Add
SemanticValidatortovalidators/metric.py(extends existing file) - Add
v.semantic()factory function tovalidators/__init__.py - Write tests (skipped if dependency missing via
pytest.importorskip) - Update changelog
Key Design:
# semantic/similarity.py
class SemanticSimilarity:
def __init__(
self,
model: str = "all-MiniLM-L6-v2",
cache_embeddings: bool = True,
):
try:
from sentence_transformers import SentenceTransformer
except ImportError:
raise DependencyError(
"Install veritext[semantic] for semantic similarity: "
"pip install veritext[semantic]"
)
self._model = SentenceTransformer(model)
self._cache: dict[str, Any] = {} if cache_embeddings else None
Files:
src/veritext/semantic/__init__.pysrc/veritext/semantic/similarity.pysrc/veritext/metrics/results.py(addSemanticResult)src/veritext/validators/metric.py(addSemanticValidator)src/veritext/validators/__init__.py(addsemantic()factory)tests/test_semantic/test_similarity.py
Verification:
# Without semantic dependency — tests should skip gracefully
uv run pytest tests/ -v
# With semantic dependency
uv sync --extra semantic
uv run pytest tests/test_semantic/ -v
Phase 6: Pytest Plugin
Goal: Native pytest integration for CI/CD.
Tasks:
- Create plugin structure with entry points
- Implement fixtures:
text_validator - Implement
validate_text()assertion function - Create detailed failure formatting
- Add
@pytest.mark.text_validationmarker - Write integration tests
- Update changelog
Entry point:
[project.entry-points.pytest11]
veritext = "veritext.pytest_plugin"
Key Design:
# pytest_plugin/assertions.py
def validate_text(
text: str,
*,
reference: str | None = None,
min_bleu: float | None = None,
min_rouge: float | None = None,
min_semantic: float | None = None,
max_length: int | None = None,
max_reading_grade: float | None = None,
contains: list[str] | None = None,
excludes: list[str] | None = None,
) -> None:
"""
Assert text passes all specified validation criteria.
Raises:
AssertionError: With detailed failure information if validation fails.
ValueError: If comparison metrics requested but reference not provided.
"""
Error handling: If min_bleu, min_rouge, or min_semantic is specified without a reference, raise ValueError immediately with a clear message rather than failing inside the metric.
Files:
src/veritext/pytest_plugin/__init__.pysrc/veritext/pytest_plugin/fixtures.pysrc/veritext/pytest_plugin/assertions.pysrc/veritext/pytest_plugin/plugin.pytests/test_pytest_plugin/test_integration.py
Verification:
uv pip install -e .
uv run pytest --co -q # Should show veritext plugin
uv run pytest tests/test_pytest_plugin/ -v
Phase 7: Benchmark & Regression
Goal: Track quality over time, detect regressions.
Tasks:
- Implement
benchmark/models.py(BenchmarkRun,RegressionReport) - Implement
benchmark/storage.py(SQLite backend)- Handle concurrent writes gracefully (SQLite WAL mode)
- Raise
StorageErroron corruption with recovery guidance
- Implement
benchmark/runner.py(Benchmarkclass) - Implement
benchmark/regression.py(statistical detection using rolling window) - Add
assert_no_regression()for CI integration - Write comprehensive tests:
- Storage CRUD operations
- Regression detection with known degradation
- Edge cases: first run (no baseline), empty metrics
- Update changelog
Key Interface:
class Benchmark:
def __init__(self, name: str, storage_path: str | Path = "benchmarks/"): ...
def evaluate(
self,
candidates: list[str],
references: list[str],
metrics: list[str] | None = None, # Default: ["rouge_l", "bleu4"]
) -> BenchmarkRun:
"""Evaluate candidates, store results, return the run record."""
...
def check_regression(
self,
tolerance: float = 0.05,
window: int = 10,
) -> RegressionReport:
"""Compare current run against historical baseline."""
...
def assert_no_regression(self, tolerance: float = 0.05) -> None:
"""Raise RegressionDetectedError if quality dropped."""
...
SQLite Schema:
CREATE TABLE benchmark_runs (
id TEXT PRIMARY KEY,
benchmark_name TEXT NOT NULL,
timestamp TEXT NOT NULL,
veritext_version TEXT NOT NULL,
sample_count INTEGER NOT NULL,
metadata TEXT -- JSON
);
CREATE TABLE benchmark_metrics (
run_id TEXT REFERENCES benchmark_runs(id),
metric_name TEXT NOT NULL,
value REAL NOT NULL,
PRIMARY KEY (run_id, metric_name)
);
CREATE INDEX idx_benchmark_name ON benchmark_runs(benchmark_name, timestamp);
Files:
src/veritext/benchmark/__init__.pysrc/veritext/benchmark/models.pysrc/veritext/benchmark/storage.pysrc/veritext/benchmark/runner.pysrc/veritext/benchmark/regression.pytests/test_benchmark/test_storage.pytests/test_benchmark/test_runner.pytests/test_benchmark/test_regression.py
Verification:
uv run pytest tests/test_benchmark/ -v --cov=src/veritext/benchmark
Phase 8: CLI
Goal: Command-line interface for validation and benchmarking.
Tasks:
- Implement Typer CLI app
- Add
validatecommand - Add
benchmark runcommand - Add
benchmark showcommand - Add rich output formatting
- Write CLI tests
- Update changelog
Commands:
veritext validate "text" --reference "ref" --metrics bleu,rouge
veritext validate --file outputs.jsonl --reference-file refs.jsonl
veritext benchmark run my_benchmark --inputs data/ --references refs/
veritext benchmark show my_benchmark --last 20
veritext benchmark check my_benchmark --tolerance 0.05
Input Formats:
- JSONL: One JSON object per line with
candidateandreferencefields:{"candidate": "The cat sat on the mat.", "reference": "A cat is sitting on a mat."} {"candidate": "Hello world.", "reference": "Greetings, world."} - Directories: Matching filenames in
--inputsand--referencesdirectories:data/sample1.txt ↔ refs/sample1.txt data/sample2.txt ↔ refs/sample2.txt
Files:
src/veritext/cli/__init__.pysrc/veritext/cli/main.pytests/test_cli/test_commands.py
Verification:
uv run veritext --help
uv run veritext validate "hello world" --reference "hello world" --metrics bleu
uv run pytest tests/test_cli/ -v
Phase 9: Documentation & Polish
Goal: Make portfolio-ready.
Tasks:
- Write comprehensive
readme.mdwith examples - Add docstrings to all public APIs
- Create example scripts
- Ensure ≥80% test coverage
- Final linting/type checking
- Update
changelog.mdwith 0.1.0 release - Update project docs in
docs/
Files:
readme.md(comprehensive)examples/basic_validation.pyexamples/chatbot_testing.pyexamples/benchmark_regression.py- Update all docstrings
docs/project-plan.md(update)docs/implementation-plan.md(update)
Verification:
uv run ruff check .
uv run ruff format --check .
uv run mypy src/
uv run pytest --cov=src/veritext --cov-report=term-missing
# Verify ≥80% coverage
Dependencies
[project]
name = "veritext"
version = "0.1.0"
description = "Semantic text validation framework"
readme = "readme.md"
requires-python = ">=3.11"
dependencies = [
"pydantic>=2.0",
"pydantic-settings>=2.0",
"structlog>=23.0",
"typer>=0.9",
"rich>=13.0",
]
[project.optional-dependencies]
semantic = ["sentence-transformers>=2.2"]
dev = [
"pytest>=7.0",
"pytest-cov>=4.0",
"mypy>=1.0",
"ruff>=0.1",
]
all = ["veritext[semantic]"]
[project.scripts]
veritext = "veritext.cli.main:app"
[project.entry-points.pytest11]
veritext = "veritext.pytest_plugin"
Conventions
Code Quality
ruff check .— zero issuesruff format --check .— zero changesmypy src/— passes (strict mode)pytest --cov=src/veritext— ≥80% coverage
Git
- Author: Kai Chappell git@kschappell.com
- Signed commits: GPG key 219AA60F0638489B
- Format:
type(scope): description - Atomic: ≤3 files, ≤150 LOC per commit
- No AI/LLM attribution
Python
- Python 3.11+ with modern type hints
- Absolute imports from package root
- structlog for logging
- UK English (colour, behaviour, summarisation)
Verification Checklist (Per Phase)
cd /home/kai/work/dev/portfolio/veritext
# Code quality
uv run ruff check .
uv run ruff format --check .
uv run mypy src/
# Tests
uv run pytest --cov=src/veritext --cov-report=term-missing
# Package installation
uv pip install -e .
uv run python -c "import veritext; print(veritext.__version__)"