docs: add project and implementation plans

Comprehensive documentation for Veritext semantic text validation framework: - Project plan with architecture, use cases, and success criteria - Implementation plan with 9 phases, interfaces, and verification steps
2026-02-03 15:27:00 +00:00
commit 49f1e27cb1
2 changed files with 1339 additions and 0 deletions
--- a/docs/project-plan.md
+++ b/docs/project-plan.md
@@ -0,0 +1,478 @@
+# Project Plan: Veritext — Semantic Text Validation Framework
+
+## Overview
+
+A Python library for validating text outputs against semantic criteria. Designed for
+developers building any system that produces text — chatbots, content generators,
+translation pipelines, summarisation tools — who need automated quality assurance
+beyond simple string matching.
+
+**Origin story:** "I was building a feature that generated article summaries and got
+tired of manually checking if they captured the key points. Existing tools could tell
+me if two strings matched, but not if they *meant* the same thing. So I built a
+validation framework that understands semantics."
+
+**Portfolio role:** A practical developer tool that demonstrates Python framework
+design, NLP evaluation techniques, and test automation integration. The project
+solves a real problem any developer working with text processing encounters.
+
+**Target users:** Developers building content pipelines, chatbot teams validating
+responses, ML engineers evaluating model outputs, QA teams testing text-based features.
+
+---
+
+## Problem Statement
+
+Text validation is hard. Traditional testing approaches fall short:
+
+| Approach | Problem |
+|----------|---------|
+| Exact string match | Fails on semantically equivalent variations |
+| Substring/regex | Brittle, misses meaning entirely |
+| Manual review | Doesn't scale, subjective |
+| Generic diff tools | Show *what* changed, not *if it matters* |
+
+**Example:** A summarisation system produces "The CEO announced layoffs affecting 500
+employees" one day and "500 workers will lose their jobs, the company's chief executive
+said" the next. These are semantically equivalent, but every traditional test would
+flag this as a failure.
+
+Veritext answers: "Is this text output *good enough* according to my criteria?" — not
+"Is it identical?"
+
+---
+
+## Core Concepts
+
+### Metrics (Pure Computation)
+
+Metrics compute scores comparing candidate text to reference text:
+
+```python
+from veritext.metrics import Bleu, Rouge
+
+bleu = Bleu()
+result = bleu.score(
+    candidate="The cat sat on the mat",
+    reference="A cat is sitting on a mat"
+)
+# BleuResult(bleu1=0.71, bleu2=0.58, bleu3=0.45, bleu4=0.41, brevity_penalty=1.0)
+
+rouge = Rouge()
+result = rouge.score(candidate, reference)
+# RougeResult(rouge1=RougeScore(...), rouge2=RougeScore(...), rouge_l=RougeScore(...))
+```
+
+**Built-in metrics:**
+
+| Metric | What it measures | Use case |
+|--------|------------------|----------|
+| BLEU-1 to BLEU-4 | N-gram precision | Translation, generation |
+| ROUGE-1, ROUGE-2 | N-gram recall | Summarisation |
+| ROUGE-L | Longest common subsequence | Summarisation |
+| Semantic similarity | Cosine distance of embeddings | Any meaning comparison |
+| Lexical overlap | Jaccard similarity of tokens | Simple similarity |
+| Reading level | Flesch-Kincaid grade | Accessibility |
+
+**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required.
+
+### Validators (Decision Logic)
+
+Validators wrap metrics and apply thresholds to make pass/fail decisions:
+
+```python
+from veritext import validators as v
+
+# Compose multiple checks
+validator = v.all_of([
+    v.bleu(min_score=0.7),
+    v.length(max_chars=500),
+    v.readability(max_grade=8),
+])
+
+from veritext.core.types import ValidationContext
+
+context = ValidationContext(reference="The quick brown fox jumps over the lazy dog")
+result = validator.validate("The fast brown fox leaped over the lazy dog", context)
+# ValidationResult(passed=True, checks=[...])
+```
+
+### Pytest Integration
+
+Native pytest fixtures and assertions for CI/CD:
+
+```python
+from veritext.pytest_plugin import validate_text
+
+def test_summary_quality(summariser, document):
+    summary = summariser.summarise(document)
+
+    validate_text(
+        summary,
+        reference=expected_summary,
+        min_rouge=0.7,
+        min_semantic=0.85,
+    )
+```
+
+### Regression Detection
+
+Track output quality over time, catch degradations before users do:
+
+```python
+from veritext.benchmark import Benchmark
+
+benchmark = Benchmark("summarisation_quality", storage_path="benchmarks/")
+results = benchmark.evaluate(outputs, references, metrics=["rouge_l", "bleu4"])
+benchmark.assert_no_regression(tolerance=0.05)
+```
+
+---
+
+## Tech Stack
+
+| Component | Technology | Rationale |
+|-----------|------------|-----------|
+| Core | Python 3.11+ | Target ecosystem, modern type hints |
+| Metrics | Custom implementations | Full control, understanding of algorithms |
+| Embeddings | sentence-transformers | Semantic similarity (optional) |
+| Test integration | pytest | Fixtures, plugins, assertions |
+| CLI | typer | Consistent with portfolio projects |
+| Data handling | pydantic | Validation, serialisation |
+| Storage | SQLite | Benchmark history, lightweight |
+| Output | rich | Terminal formatting |
+
+---
+
+## Architecture
+
+### Layered Design
+
+```
+┌─────────────────────────────────────────────────────┐
+│  CLI / pytest_plugin  (presentation layer)          │
+├─────────────────────────────────────────────────────┤
+│  validators/          (decision logic)              │
+│  benchmark/           (tracking & regression)       │
+├─────────────────────────────────────────────────────┤
+│  metrics/             (pure computation)            │
+├─────────────────────────────────────────────────────┤
+│  core/                (shared types, tokenisation)  │
+└─────────────────────────────────────────────────────┘
+```
+
+**Dependency rule:** Each layer depends only on layers below it.
+
+### Key Design Decisions
+
+1. **Metrics vs Validators separation** — Metrics compute scores; validators make
+   pass/fail decisions. Clear separation of concerns.
+
+2. **Typed result objects** — Each metric returns a specific result type (e.g.,
+   `BleuResult`, `RougeResult`), not just `float`. Full information preserved.
+
+3. **Optional heavy dependencies** — `sentence-transformers` (~2GB with PyTorch) is
+   optional. Core library works without ML dependencies.
+
+4. **Shared tokenisation** — Single `Tokeniser` protocol used by all n-gram metrics.
+   Consistent behaviour across BLEU and ROUGE.
+
+5. **Explicit context** — `ValidationContext` dataclass instead of `**kwargs`.
+   Type-safe, discoverable API.
+
+---
+
+## Project Components
+
+### Component 1: Core Module
+
+Shared types, exceptions, and tokenisation.
+
+**Types:**
+- `ValidationContext` — reference text and metadata for validation
+- `CheckResult` — individual check result with diagnostics
+- `ValidationResult` — aggregate result with pass/fail and all checks
+- `BatchResult` — statistics over multiple evaluations
+
+**Tokeniser:**
+```python
+class Tokeniser(Protocol):
+    def tokenise(self, text: str) -> list[str]: ...
+
+class WordTokeniser:
+    def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ...
+```
+
+---
+
+### Component 2: Metric Engine
+
+Pure implementations of text evaluation metrics.
+
+**Interface:**
+```python
+class Metric(Protocol[T]):
+    @property
+    def name(self) -> str: ...
+
+    def score(self, candidate: str, reference: str | list[str]) -> T: ...
+
+    def batch_score(
+        self,
+        candidates: list[str],
+        references: list[str] | list[list[str]]
+    ) -> BatchResult[T]: ...
+```
+
+**Metrics:**
+- `Bleu` — BLEU-1 through BLEU-4 with brevity penalty
+- `Rouge` — ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1
+- `Lexical` — Jaccard similarity, token overlap
+- `Readability` — Flesch-Kincaid grade level
+- `SemanticSimilarity` — Embedding cosine distance (optional dependency)
+
+---
+
+### Component 3: Validator Framework
+
+Composable validation rules with clear pass/fail semantics.
+
+**Built-in validators:**
+
+| Validator | Description |
+|-----------|-------------|
+| `v.bleu(min_score, variant)` | BLEU score above minimum |
+| `v.rouge(min_score, variant)` | ROUGE score above minimum |
+| `v.semantic(min_score)` | Semantic similarity above threshold |
+| `v.length(min_chars, max_chars)` | Length constraints |
+| `v.readability(max_grade)` | Reading level constraint |
+| `v.contains(terms)` | Required terms present |
+| `v.excludes(terms)` | Forbidden terms absent |
+| `v.pattern(regex)` | Regex pattern match |
+
+**Composition:**
+
+```python
+# All validators must pass
+v.all_of([v.bleu(min_score=0.7), v.length(max_chars=500)])
+
+# At least one must pass
+v.any_of([v.contains(["error"]), v.contains(["failed"])])
+
+# Weighted scoring
+v.weighted([
+    (v.bleu(min_score=0.7), 0.6),
+    (v.readability(max_grade=8), 0.4),
+], min_score=0.75)
+```
+
+---
+
+### Component 4: Pytest Plugin
+
+First-class pytest integration for CI/CD pipelines.
+
+**Features:**
+- Custom assertions with detailed failure messages
+- Fixtures for common validation patterns
+- Markers for categorising text tests
+
+**Usage:**
+
+```python
+from veritext.pytest_plugin import validate_text
+
+def test_chatbot_response():
+    response = chatbot.respond("What are your hours?")
+
+    validate_text(
+        response,
+        reference="We're open Monday to Friday, 9am to 5pm.",
+        min_bleu=0.6,
+        min_semantic=0.8,
+        max_length=500,
+    )
+```
+
+**Failure output:**
+
+```
+FAILED test_summary.py::test_summary_quality
+    AssertionError: Text failed 2 of 4 checks:
+
+    ✗ rouge: 0.58 (minimum: 0.70)
+    ✗ semantic: 0.72 (minimum: 0.85)
+    ✓ length: 342 (maximum: 500)
+    ✓ readability: 6.2 (maximum: 8)
+
+    Candidate: "The company reported losses..."
+    Reference: "Financial results showed significant decline..."
+```
+
+---
+
+### Component 5: Benchmark & Regression Detection
+
+Track quality over time, catch degradations automatically.
+
+**Features:**
+- Store historical metric values in SQLite
+- Statistical regression detection
+- Configurable tolerance thresholds
+- CI integration for blocking degradations
+
+**Usage:**
+
+```python
+from veritext.benchmark import Benchmark
+
+benchmark = Benchmark("chatbot_quality", storage_path="benchmarks/")
+
+# Record current run (returns BenchmarkRun with metrics and metadata)
+run = benchmark.evaluate(
+    candidates=current_outputs,
+    references=expected_outputs,
+    metrics=["rouge_l", "semantic"]
+)
+# run.metrics = {"rouge_l": 0.82, "semantic": 0.89}
+
+# Compare against historical baseline
+regression = benchmark.check_regression(tolerance=0.05, window=10)
+
+if regression.detected:
+    print(f"Quality dropped: {regression.summary}")
+
+# In CI: fail the build on regression
+benchmark.assert_no_regression(tolerance=0.05)
+```
+
+---
+
+### Component 6: CLI Tool
+
+Command-line interface for quick validation and benchmarking.
+
+```bash
+# Validate a single text
+$ veritext validate "Your text here" --reference "Expected text" --metrics bleu,rouge
+
+# Validate from files
+$ veritext validate --file outputs.jsonl --reference-file expected.jsonl
+
+# Run benchmark
+$ veritext benchmark run summarisation --inputs docs/ --references refs/
+
+# Show benchmark history
+$ veritext benchmark show summarisation --last 20
+
+# Check for regression
+$ veritext benchmark check summarisation --tolerance 0.05
+```
+
+---
+
+## Example Use Cases
+
+### Use Case 1: Chatbot Response Validation
+
+```python
+from veritext import validators as v
+from veritext.core.types import ValidationContext
+
+# Define acceptable response criteria
+response_validator = v.all_of([
+    v.length(max_chars=500),
+    v.readability(max_grade=8),
+    v.excludes(terms=["I don't know", "I'm not sure"]),
+])
+
+def test_chatbot_responds_helpfully():
+    response = chatbot.respond("How do I reset my password?")
+    context = ValidationContext()
+    result = response_validator.validate(response, context)
+    assert result.passed, result.failure_summary
+```
+
+### Use Case 2: Summarisation Quality Gate
+
+```python
+from veritext.pytest_plugin import validate_text
+
+def test_summary_captures_key_points():
+    article = load_article("financial_report.txt")
+    summary = summariser.summarise(article)
+
+    validate_text(
+        summary,
+        reference=load_reference_summary("financial_report_summary.txt"),
+        min_rouge=0.65,
+        min_semantic=0.80,
+        max_length=300,
+    )
+```
+
+### Use Case 3: Translation Quality Monitoring
+
+```python
+from veritext.benchmark import Benchmark
+
+benchmark = Benchmark("translation_en_de", storage_path="benchmarks/")
+
+# Nightly CI job
+results = benchmark.evaluate(
+    candidates=translate_batch(test_documents),
+    references=human_translations,
+    metrics=["bleu4", "semantic"]
+)
+
+# Block deployment if quality drops
+benchmark.assert_no_regression(tolerance=0.03)
+```
+
+---
+
+## Success Criteria
+
+- [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
+- [ ] Semantic similarity correlates with human judgement on test pairs
+- [ ] Pytest plugin installs cleanly via `pip install veritext`
+- [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
+- [ ] Benchmark regression detection has <5% false positive rate
+- [ ] Documentation includes working examples for each use case
+- [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
+- [ ] Can explain design decisions and metric theory in interview
+
+---
+
+## Skills Demonstrated
+
+| Skill | How Veritext demonstrates it |
+|-------|------------------------------|
+| Python framework design | Composable validators, clean API, plugin architecture |
+| Test automation | Native pytest integration, CI/CD workflows |
+| NLP evaluation metrics | BLEU, ROUGE, semantic similarity implementations |
+| Data analysis | Statistical regression detection, batch processing |
+| CLI development | Typer-based interface, rich output |
+| Software architecture | Layered design, clear separation of concerns |
+| Documentation | Comprehensive readme, examples |
+| Quality engineering | High test coverage, type safety, linting |
+
+---
+
+## What Makes This Project Credible
+
+1. **Solves a real problem** — Anyone building text-based features faces validation
+   challenges.
+
+2. **Not tied to a specific technology** — Works with any text source (chatbots, LLMs,
+   translation APIs, content generators). It's a general-purpose tool, not an "LLM
+   testing framework."
+
+3. **Practical scope** — Not trying to reinvent pytest or build an ML platform. Focused
+   on one thing: validating text quality.
+
+4. **Demonstrates depth** — Implementing BLEU/ROUGE from understanding (not just
+   wrapping libraries) shows knowledge of how these metrics work.
+
+5. **Natural portfolio narrative** — "I was building X and needed a better way to test
+   it, so I built this tool." Every interviewer has faced similar problems.