veritext/docs/project-plan.md

# Project Plan: Veritext — Semantic Text Validation Framework

## Overview

A Python library for validating text outputs against semantic criteria. Designed for
developers building any system that produces text — chatbots, content generators,
translation pipelines, summarisation tools — who need automated quality assurance
beyond simple string matching.

**Origin story:** "I was building a feature that generated article summaries and got
tired of manually checking if they captured the key points. Existing tools could tell
me if two strings matched, but not if they *meant* the same thing. So I built a
validation framework that understands semantics."

**Portfolio role:** A practical developer tool that demonstrates Python framework
design, NLP evaluation techniques, and test automation integration. The project
solves a real problem any developer working with text processing encounters.

**Target users:** Developers building content pipelines, chatbot teams validating
responses, ML engineers evaluating model outputs, QA teams testing text-based features.

---

## Problem Statement

Text validation is hard. Traditional testing approaches fall short:

| Approach | Problem |
|----------|---------|
| Exact string match | Fails on semantically equivalent variations |
| Substring/regex | Brittle, misses meaning entirely |
| Manual review | Doesn't scale, subjective |
| Generic diff tools | Show *what* changed, not *if it matters* |

**Example:** A summarisation system produces "The CEO announced layoffs affecting 500
employees" one day and "500 workers will lose their jobs, the company's chief executive
said" the next. These are semantically equivalent, but every traditional test would
flag this as a failure.

Veritext answers: "Is this text output *good enough* according to my criteria?" — not
"Is it identical?"

---

## Core Concepts

### Metrics (Pure Computation)

Metrics compute scores comparing candidate text to reference text:

```python
from veritext.metrics import Bleu, Rouge

bleu = Bleu()
result = bleu.score(
    candidate="The cat sat on the mat",
    reference="A cat is sitting on a mat"
)
# BleuResult(bleu1=0.71, bleu2=0.58, bleu3=0.45, bleu4=0.41, brevity_penalty=1.0)

rouge = Rouge()
result = rouge.score(candidate, reference)
# RougeResult(rouge1=RougeScore(...), rouge2=RougeScore(...), rouge_l=RougeScore(...))
```

**Built-in metrics:**

| Metric | What it measures | Use case |
|--------|------------------|----------|
| BLEU-1 to BLEU-4 | N-gram precision | Translation, generation |
| ROUGE-1, ROUGE-2 | N-gram recall | Summarisation |
| ROUGE-L | Longest common subsequence | Summarisation |
| Semantic similarity | Cosine distance of embeddings | Any meaning comparison |
| Lexical overlap | Jaccard similarity of tokens | Simple similarity |
| Reading level | Flesch-Kincaid grade | Accessibility |

**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required.

### Validators (Decision Logic)

Validators wrap metrics and apply thresholds to make pass/fail decisions:

```python
from veritext import validators as v

# Compose multiple checks
validator = v.all_of([
    v.bleu(min_score=0.7),
    v.length(max_chars=500),
    v.readability(max_grade=8),
])

from veritext.core.types import ValidationContext

context = ValidationContext(reference="The quick brown fox jumps over the lazy dog")
result = validator.validate("The fast brown fox leaped over the lazy dog", context)
# ValidationResult(passed=True, checks=[...])
```

### Pytest Integration

Native pytest fixtures and assertions for CI/CD:

```python
from veritext.pytest_plugin import validate_text

def test_summary_quality(summariser, document):
    summary = summariser.summarise(document)

    validate_text(
        summary,
        reference=expected_summary,
        min_rouge=0.7,
        min_semantic=0.85,
    )
```

### Regression Detection

Track output quality over time, catch degradations before users do:

```python
from veritext.benchmark import Benchmark

benchmark = Benchmark("summarisation_quality", storage_path="benchmarks/")
results = benchmark.evaluate(outputs, references, metrics=["rouge_l", "bleu4"])
benchmark.assert_no_regression(tolerance=0.05)
```

---

## Tech Stack

| Component | Technology | Rationale |
|-----------|------------|-----------|
| Core | Python 3.11+ | Target ecosystem, modern type hints |
| Metrics | Custom implementations | Full control, understanding of algorithms |
| Embeddings | sentence-transformers | Semantic similarity (optional) |
| Test integration | pytest | Fixtures, plugins, assertions |
| CLI | typer | Consistent with portfolio projects |
| Data handling | pydantic | Validation, serialisation |
| Storage | SQLite | Benchmark history, lightweight |
| Output | rich | Terminal formatting |

---

## Architecture

### Layered Design

```
┌─────────────────────────────────────────────────────┐
│  CLI / pytest_plugin  (presentation layer)          │
├─────────────────────────────────────────────────────┤
│  validators/          (decision logic)              │
│  benchmark/           (tracking & regression)       │
├─────────────────────────────────────────────────────┤
│  metrics/             (pure computation)            │
├─────────────────────────────────────────────────────┤
│  core/                (shared types, tokenisation)  │
└─────────────────────────────────────────────────────┘
```

**Dependency rule:** Each layer depends only on layers below it.

### Key Design Decisions

1. **Metrics vs Validators separation** — Metrics compute scores; validators make
   pass/fail decisions. Clear separation of concerns.

2. **Typed result objects** — Each metric returns a specific result type (e.g.,
   `BleuResult`, `RougeResult`), not just `float`. Full information preserved.

3. **Optional heavy dependencies** — `sentence-transformers` (~2GB with PyTorch) is
   optional. Core library works without ML dependencies.

4. **Shared tokenisation** — Single `Tokeniser` protocol used by all n-gram metrics.
   Consistent behaviour across BLEU and ROUGE.

5. **Explicit context** — `ValidationContext` dataclass instead of `**kwargs`.
   Type-safe, discoverable API.

---

## Project Components

### Component 1: Core Module

Shared types, exceptions, and tokenisation.

**Types:**
- `ValidationContext` — reference text and metadata for validation
- `CheckResult` — individual check result with diagnostics
- `ValidationResult` — aggregate result with pass/fail and all checks
- `BatchResult` — statistics over multiple evaluations

**Tokeniser:**
```python
class Tokeniser(Protocol):
    def tokenise(self, text: str) -> list[str]: ...

class WordTokeniser:
    def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ...
```

---

### Component 2: Metric Engine

Pure implementations of text evaluation metrics.

**Interface:**
```python
class Metric(Protocol[T]):
    @property
    def name(self) -> str: ...

    def score(self, candidate: str, reference: str | list[str]) -> T: ...

    def batch_score(
        self,
        candidates: list[str],
        references: list[str] | list[list[str]]
    ) -> BatchResult[T]: ...
```

**Metrics:**
- `Bleu` — BLEU-1 through BLEU-4 with brevity penalty
- `Rouge` — ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1
- `Lexical` — Jaccard similarity, token overlap
- `Readability` — Flesch-Kincaid grade level
- `SemanticSimilarity` — Embedding cosine distance (optional dependency)

---

### Component 3: Validator Framework

Composable validation rules with clear pass/fail semantics.

**Built-in validators:**

| Validator | Description |
|-----------|-------------|
| `v.bleu(min_score, variant)` | BLEU score above minimum |
| `v.rouge(min_score, variant)` | ROUGE score above minimum |
| `v.semantic(min_score)` | Semantic similarity above threshold |
| `v.length(min_chars, max_chars)` | Length constraints |
| `v.readability(max_grade)` | Reading level constraint |
| `v.contains(terms)` | Required terms present |
| `v.excludes(terms)` | Forbidden terms absent |
| `v.pattern(regex)` | Regex pattern match |

**Composition:**

```python
# All validators must pass
v.all_of([v.bleu(min_score=0.7), v.length(max_chars=500)])

# At least one must pass
v.any_of([v.contains(["error"]), v.contains(["failed"])])

# Weighted scoring
v.weighted([
    (v.bleu(min_score=0.7), 0.6),
    (v.readability(max_grade=8), 0.4),
], min_score=0.75)
```

---

### Component 4: Pytest Plugin

First-class pytest integration for CI/CD pipelines.

**Features:**
- Custom assertions with detailed failure messages
- Fixtures for common validation patterns
- Markers for categorising text tests

**Usage:**

```python
from veritext.pytest_plugin import validate_text

def test_chatbot_response():
    response = chatbot.respond("What are your hours?")

    validate_text(
        response,
        reference="We're open Monday to Friday, 9am to 5pm.",
        min_bleu=0.6,
        min_semantic=0.8,
        max_length=500,
    )
```

**Failure output:**

```
FAILED test_summary.py::test_summary_quality
    AssertionError: Text failed 2 of 4 checks:

    ✗ rouge: 0.58 (minimum: 0.70)
    ✗ semantic: 0.72 (minimum: 0.85)
    ✓ length: 342 (maximum: 500)
    ✓ readability: 6.2 (maximum: 8)

    Candidate: "The company reported losses..."
    Reference: "Financial results showed significant decline..."
```

---

### Component 5: Benchmark & Regression Detection

Track quality over time, catch degradations automatically.

**Features:**
- Store historical metric values in SQLite
- Statistical regression detection
- Configurable tolerance thresholds
- CI integration for blocking degradations

**Usage:**

```python
from veritext.benchmark import Benchmark

benchmark = Benchmark("chatbot_quality", storage_path="benchmarks/")

# Record current run (returns BenchmarkRun with metrics and metadata)
run = benchmark.evaluate(
    candidates=current_outputs,
    references=expected_outputs,
    metrics=["rouge_l", "semantic"]
)
# run.metrics = {"rouge_l": 0.82, "semantic": 0.89}

# Compare against historical baseline
regression = benchmark.check_regression(tolerance=0.05, window=10)

if regression.detected:
    print(f"Quality dropped: {regression.summary}")

# In CI: fail the build on regression
benchmark.assert_no_regression(tolerance=0.05)
```

---

### Component 6: CLI Tool

Command-line interface for quick validation and benchmarking.

```bash
# Validate a single text
$ veritext validate "Your text here" --reference "Expected text" --metrics bleu,rouge

# Validate from files
$ veritext validate --file outputs.jsonl --reference-file expected.jsonl

# Run benchmark
$ veritext benchmark run summarisation --inputs docs/ --references refs/

# Show benchmark history
$ veritext benchmark show summarisation --last 20

# Check for regression
$ veritext benchmark check summarisation --tolerance 0.05
```

---

## Example Use Cases

### Use Case 1: Chatbot Response Validation

```python
from veritext import validators as v
from veritext.core.types import ValidationContext

# Define acceptable response criteria
response_validator = v.all_of([
    v.length(max_chars=500),
    v.readability(max_grade=8),
    v.excludes(terms=["I don't know", "I'm not sure"]),
])

def test_chatbot_responds_helpfully():
    response = chatbot.respond("How do I reset my password?")
    context = ValidationContext()
    result = response_validator.validate(response, context)
    assert result.passed, result.failure_summary
```

### Use Case 2: Summarisation Quality Gate

```python
from veritext.pytest_plugin import validate_text

def test_summary_captures_key_points():
    article = load_article("financial_report.txt")
    summary = summariser.summarise(article)

    validate_text(
        summary,
        reference=load_reference_summary("financial_report_summary.txt"),
        min_rouge=0.65,
        min_semantic=0.80,
        max_length=300,
    )
```

### Use Case 3: Translation Quality Monitoring

```python
from veritext.benchmark import Benchmark

benchmark = Benchmark("translation_en_de", storage_path="benchmarks/")

# Nightly CI job
results = benchmark.evaluate(
    candidates=translate_batch(test_documents),
    references=human_translations,
    metrics=["bleu4", "semantic"]
)

# Block deployment if quality drops
benchmark.assert_no_regression(tolerance=0.03)
```

---

## Success Criteria

- [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
- [ ] Semantic similarity correlates with human judgement on test pairs
- [ ] Pytest plugin installs cleanly via `pip install veritext`
- [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
- [ ] Benchmark regression detection has <5% false positive rate
- [ ] Documentation includes working examples for each use case
- [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
- [ ] Can explain design decisions and metric theory in interview

---

## Skills Demonstrated

| Skill | How Veritext demonstrates it |
|-------|------------------------------|
| Python framework design | Composable validators, clean API, plugin architecture |
| Test automation | Native pytest integration, CI/CD workflows |
| NLP evaluation metrics | BLEU, ROUGE, semantic similarity implementations |
| Data analysis | Statistical regression detection, batch processing |
| CLI development | Typer-based interface, rich output |
| Software architecture | Layered design, clear separation of concerns |
| Documentation | Comprehensive readme, examples |
| Quality engineering | High test coverage, type safety, linting |

---

## What Makes This Project Credible

1. **Solves a real problem** — Anyone building text-based features faces validation
   challenges.

2. **Not tied to a specific technology** — Works with any text source (chatbots, LLMs,
   translation APIs, content generators). It's a general-purpose tool, not an "LLM
   testing framework."

3. **Practical scope** — Not trying to reinvent pytest or build an ML platform. Focused
   on one thing: validating text quality.

4. **Demonstrates depth** — Implementing BLEU/ROUGE from understanding (not just
   wrapping libraries) shows knowledge of how these metrics work.

5. **Natural portfolio narrative** — "I was building X and needed a better way to test
   it, so I built this tool." Every interviewer has faced similar problems.