docs: add project and implementation plans
Comprehensive documentation for Veritext semantic text validation framework: - Project plan with architecture, use cases, and success criteria - Implementation plan with 9 phases, interfaces, and verification steps
This commit is contained in:
478
docs/project-plan.md
Normal file
478
docs/project-plan.md
Normal file
@@ -0,0 +1,478 @@
|
||||
# Project Plan: Veritext — Semantic Text Validation Framework
|
||||
|
||||
## Overview
|
||||
|
||||
A Python library for validating text outputs against semantic criteria. Designed for
|
||||
developers building any system that produces text — chatbots, content generators,
|
||||
translation pipelines, summarisation tools — who need automated quality assurance
|
||||
beyond simple string matching.
|
||||
|
||||
**Origin story:** "I was building a feature that generated article summaries and got
|
||||
tired of manually checking if they captured the key points. Existing tools could tell
|
||||
me if two strings matched, but not if they *meant* the same thing. So I built a
|
||||
validation framework that understands semantics."
|
||||
|
||||
**Portfolio role:** A practical developer tool that demonstrates Python framework
|
||||
design, NLP evaluation techniques, and test automation integration. The project
|
||||
solves a real problem any developer working with text processing encounters.
|
||||
|
||||
**Target users:** Developers building content pipelines, chatbot teams validating
|
||||
responses, ML engineers evaluating model outputs, QA teams testing text-based features.
|
||||
|
||||
---
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Text validation is hard. Traditional testing approaches fall short:
|
||||
|
||||
| Approach | Problem |
|
||||
|----------|---------|
|
||||
| Exact string match | Fails on semantically equivalent variations |
|
||||
| Substring/regex | Brittle, misses meaning entirely |
|
||||
| Manual review | Doesn't scale, subjective |
|
||||
| Generic diff tools | Show *what* changed, not *if it matters* |
|
||||
|
||||
**Example:** A summarisation system produces "The CEO announced layoffs affecting 500
|
||||
employees" one day and "500 workers will lose their jobs, the company's chief executive
|
||||
said" the next. These are semantically equivalent, but every traditional test would
|
||||
flag this as a failure.
|
||||
|
||||
Veritext answers: "Is this text output *good enough* according to my criteria?" — not
|
||||
"Is it identical?"
|
||||
|
||||
---
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Metrics (Pure Computation)
|
||||
|
||||
Metrics compute scores comparing candidate text to reference text:
|
||||
|
||||
```python
|
||||
from veritext.metrics import Bleu, Rouge
|
||||
|
||||
bleu = Bleu()
|
||||
result = bleu.score(
|
||||
candidate="The cat sat on the mat",
|
||||
reference="A cat is sitting on a mat"
|
||||
)
|
||||
# BleuResult(bleu1=0.71, bleu2=0.58, bleu3=0.45, bleu4=0.41, brevity_penalty=1.0)
|
||||
|
||||
rouge = Rouge()
|
||||
result = rouge.score(candidate, reference)
|
||||
# RougeResult(rouge1=RougeScore(...), rouge2=RougeScore(...), rouge_l=RougeScore(...))
|
||||
```
|
||||
|
||||
**Built-in metrics:**
|
||||
|
||||
| Metric | What it measures | Use case |
|
||||
|--------|------------------|----------|
|
||||
| BLEU-1 to BLEU-4 | N-gram precision | Translation, generation |
|
||||
| ROUGE-1, ROUGE-2 | N-gram recall | Summarisation |
|
||||
| ROUGE-L | Longest common subsequence | Summarisation |
|
||||
| Semantic similarity | Cosine distance of embeddings | Any meaning comparison |
|
||||
| Lexical overlap | Jaccard similarity of tokens | Simple similarity |
|
||||
| Reading level | Flesch-Kincaid grade | Accessibility |
|
||||
|
||||
**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required.
|
||||
|
||||
### Validators (Decision Logic)
|
||||
|
||||
Validators wrap metrics and apply thresholds to make pass/fail decisions:
|
||||
|
||||
```python
|
||||
from veritext import validators as v
|
||||
|
||||
# Compose multiple checks
|
||||
validator = v.all_of([
|
||||
v.bleu(min_score=0.7),
|
||||
v.length(max_chars=500),
|
||||
v.readability(max_grade=8),
|
||||
])
|
||||
|
||||
from veritext.core.types import ValidationContext
|
||||
|
||||
context = ValidationContext(reference="The quick brown fox jumps over the lazy dog")
|
||||
result = validator.validate("The fast brown fox leaped over the lazy dog", context)
|
||||
# ValidationResult(passed=True, checks=[...])
|
||||
```
|
||||
|
||||
### Pytest Integration
|
||||
|
||||
Native pytest fixtures and assertions for CI/CD:
|
||||
|
||||
```python
|
||||
from veritext.pytest_plugin import validate_text
|
||||
|
||||
def test_summary_quality(summariser, document):
|
||||
summary = summariser.summarise(document)
|
||||
|
||||
validate_text(
|
||||
summary,
|
||||
reference=expected_summary,
|
||||
min_rouge=0.7,
|
||||
min_semantic=0.85,
|
||||
)
|
||||
```
|
||||
|
||||
### Regression Detection
|
||||
|
||||
Track output quality over time, catch degradations before users do:
|
||||
|
||||
```python
|
||||
from veritext.benchmark import Benchmark
|
||||
|
||||
benchmark = Benchmark("summarisation_quality", storage_path="benchmarks/")
|
||||
results = benchmark.evaluate(outputs, references, metrics=["rouge_l", "bleu4"])
|
||||
benchmark.assert_no_regression(tolerance=0.05)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack
|
||||
|
||||
| Component | Technology | Rationale |
|
||||
|-----------|------------|-----------|
|
||||
| Core | Python 3.11+ | Target ecosystem, modern type hints |
|
||||
| Metrics | Custom implementations | Full control, understanding of algorithms |
|
||||
| Embeddings | sentence-transformers | Semantic similarity (optional) |
|
||||
| Test integration | pytest | Fixtures, plugins, assertions |
|
||||
| CLI | typer | Consistent with portfolio projects |
|
||||
| Data handling | pydantic | Validation, serialisation |
|
||||
| Storage | SQLite | Benchmark history, lightweight |
|
||||
| Output | rich | Terminal formatting |
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Layered Design
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ CLI / pytest_plugin (presentation layer) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ validators/ (decision logic) │
|
||||
│ benchmark/ (tracking & regression) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ metrics/ (pure computation) │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ core/ (shared types, tokenisation) │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Dependency rule:** Each layer depends only on layers below it.
|
||||
|
||||
### Key Design Decisions
|
||||
|
||||
1. **Metrics vs Validators separation** — Metrics compute scores; validators make
|
||||
pass/fail decisions. Clear separation of concerns.
|
||||
|
||||
2. **Typed result objects** — Each metric returns a specific result type (e.g.,
|
||||
`BleuResult`, `RougeResult`), not just `float`. Full information preserved.
|
||||
|
||||
3. **Optional heavy dependencies** — `sentence-transformers` (~2GB with PyTorch) is
|
||||
optional. Core library works without ML dependencies.
|
||||
|
||||
4. **Shared tokenisation** — Single `Tokeniser` protocol used by all n-gram metrics.
|
||||
Consistent behaviour across BLEU and ROUGE.
|
||||
|
||||
5. **Explicit context** — `ValidationContext` dataclass instead of `**kwargs`.
|
||||
Type-safe, discoverable API.
|
||||
|
||||
---
|
||||
|
||||
## Project Components
|
||||
|
||||
### Component 1: Core Module
|
||||
|
||||
Shared types, exceptions, and tokenisation.
|
||||
|
||||
**Types:**
|
||||
- `ValidationContext` — reference text and metadata for validation
|
||||
- `CheckResult` — individual check result with diagnostics
|
||||
- `ValidationResult` — aggregate result with pass/fail and all checks
|
||||
- `BatchResult` — statistics over multiple evaluations
|
||||
|
||||
**Tokeniser:**
|
||||
```python
|
||||
class Tokeniser(Protocol):
|
||||
def tokenise(self, text: str) -> list[str]: ...
|
||||
|
||||
class WordTokeniser:
|
||||
def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Component 2: Metric Engine
|
||||
|
||||
Pure implementations of text evaluation metrics.
|
||||
|
||||
**Interface:**
|
||||
```python
|
||||
class Metric(Protocol[T]):
|
||||
@property
|
||||
def name(self) -> str: ...
|
||||
|
||||
def score(self, candidate: str, reference: str | list[str]) -> T: ...
|
||||
|
||||
def batch_score(
|
||||
self,
|
||||
candidates: list[str],
|
||||
references: list[str] | list[list[str]]
|
||||
) -> BatchResult[T]: ...
|
||||
```
|
||||
|
||||
**Metrics:**
|
||||
- `Bleu` — BLEU-1 through BLEU-4 with brevity penalty
|
||||
- `Rouge` — ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1
|
||||
- `Lexical` — Jaccard similarity, token overlap
|
||||
- `Readability` — Flesch-Kincaid grade level
|
||||
- `SemanticSimilarity` — Embedding cosine distance (optional dependency)
|
||||
|
||||
---
|
||||
|
||||
### Component 3: Validator Framework
|
||||
|
||||
Composable validation rules with clear pass/fail semantics.
|
||||
|
||||
**Built-in validators:**
|
||||
|
||||
| Validator | Description |
|
||||
|-----------|-------------|
|
||||
| `v.bleu(min_score, variant)` | BLEU score above minimum |
|
||||
| `v.rouge(min_score, variant)` | ROUGE score above minimum |
|
||||
| `v.semantic(min_score)` | Semantic similarity above threshold |
|
||||
| `v.length(min_chars, max_chars)` | Length constraints |
|
||||
| `v.readability(max_grade)` | Reading level constraint |
|
||||
| `v.contains(terms)` | Required terms present |
|
||||
| `v.excludes(terms)` | Forbidden terms absent |
|
||||
| `v.pattern(regex)` | Regex pattern match |
|
||||
|
||||
**Composition:**
|
||||
|
||||
```python
|
||||
# All validators must pass
|
||||
v.all_of([v.bleu(min_score=0.7), v.length(max_chars=500)])
|
||||
|
||||
# At least one must pass
|
||||
v.any_of([v.contains(["error"]), v.contains(["failed"])])
|
||||
|
||||
# Weighted scoring
|
||||
v.weighted([
|
||||
(v.bleu(min_score=0.7), 0.6),
|
||||
(v.readability(max_grade=8), 0.4),
|
||||
], min_score=0.75)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Component 4: Pytest Plugin
|
||||
|
||||
First-class pytest integration for CI/CD pipelines.
|
||||
|
||||
**Features:**
|
||||
- Custom assertions with detailed failure messages
|
||||
- Fixtures for common validation patterns
|
||||
- Markers for categorising text tests
|
||||
|
||||
**Usage:**
|
||||
|
||||
```python
|
||||
from veritext.pytest_plugin import validate_text
|
||||
|
||||
def test_chatbot_response():
|
||||
response = chatbot.respond("What are your hours?")
|
||||
|
||||
validate_text(
|
||||
response,
|
||||
reference="We're open Monday to Friday, 9am to 5pm.",
|
||||
min_bleu=0.6,
|
||||
min_semantic=0.8,
|
||||
max_length=500,
|
||||
)
|
||||
```
|
||||
|
||||
**Failure output:**
|
||||
|
||||
```
|
||||
FAILED test_summary.py::test_summary_quality
|
||||
AssertionError: Text failed 2 of 4 checks:
|
||||
|
||||
✗ rouge: 0.58 (minimum: 0.70)
|
||||
✗ semantic: 0.72 (minimum: 0.85)
|
||||
✓ length: 342 (maximum: 500)
|
||||
✓ readability: 6.2 (maximum: 8)
|
||||
|
||||
Candidate: "The company reported losses..."
|
||||
Reference: "Financial results showed significant decline..."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Component 5: Benchmark & Regression Detection
|
||||
|
||||
Track quality over time, catch degradations automatically.
|
||||
|
||||
**Features:**
|
||||
- Store historical metric values in SQLite
|
||||
- Statistical regression detection
|
||||
- Configurable tolerance thresholds
|
||||
- CI integration for blocking degradations
|
||||
|
||||
**Usage:**
|
||||
|
||||
```python
|
||||
from veritext.benchmark import Benchmark
|
||||
|
||||
benchmark = Benchmark("chatbot_quality", storage_path="benchmarks/")
|
||||
|
||||
# Record current run (returns BenchmarkRun with metrics and metadata)
|
||||
run = benchmark.evaluate(
|
||||
candidates=current_outputs,
|
||||
references=expected_outputs,
|
||||
metrics=["rouge_l", "semantic"]
|
||||
)
|
||||
# run.metrics = {"rouge_l": 0.82, "semantic": 0.89}
|
||||
|
||||
# Compare against historical baseline
|
||||
regression = benchmark.check_regression(tolerance=0.05, window=10)
|
||||
|
||||
if regression.detected:
|
||||
print(f"Quality dropped: {regression.summary}")
|
||||
|
||||
# In CI: fail the build on regression
|
||||
benchmark.assert_no_regression(tolerance=0.05)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Component 6: CLI Tool
|
||||
|
||||
Command-line interface for quick validation and benchmarking.
|
||||
|
||||
```bash
|
||||
# Validate a single text
|
||||
$ veritext validate "Your text here" --reference "Expected text" --metrics bleu,rouge
|
||||
|
||||
# Validate from files
|
||||
$ veritext validate --file outputs.jsonl --reference-file expected.jsonl
|
||||
|
||||
# Run benchmark
|
||||
$ veritext benchmark run summarisation --inputs docs/ --references refs/
|
||||
|
||||
# Show benchmark history
|
||||
$ veritext benchmark show summarisation --last 20
|
||||
|
||||
# Check for regression
|
||||
$ veritext benchmark check summarisation --tolerance 0.05
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example Use Cases
|
||||
|
||||
### Use Case 1: Chatbot Response Validation
|
||||
|
||||
```python
|
||||
from veritext import validators as v
|
||||
from veritext.core.types import ValidationContext
|
||||
|
||||
# Define acceptable response criteria
|
||||
response_validator = v.all_of([
|
||||
v.length(max_chars=500),
|
||||
v.readability(max_grade=8),
|
||||
v.excludes(terms=["I don't know", "I'm not sure"]),
|
||||
])
|
||||
|
||||
def test_chatbot_responds_helpfully():
|
||||
response = chatbot.respond("How do I reset my password?")
|
||||
context = ValidationContext()
|
||||
result = response_validator.validate(response, context)
|
||||
assert result.passed, result.failure_summary
|
||||
```
|
||||
|
||||
### Use Case 2: Summarisation Quality Gate
|
||||
|
||||
```python
|
||||
from veritext.pytest_plugin import validate_text
|
||||
|
||||
def test_summary_captures_key_points():
|
||||
article = load_article("financial_report.txt")
|
||||
summary = summariser.summarise(article)
|
||||
|
||||
validate_text(
|
||||
summary,
|
||||
reference=load_reference_summary("financial_report_summary.txt"),
|
||||
min_rouge=0.65,
|
||||
min_semantic=0.80,
|
||||
max_length=300,
|
||||
)
|
||||
```
|
||||
|
||||
### Use Case 3: Translation Quality Monitoring
|
||||
|
||||
```python
|
||||
from veritext.benchmark import Benchmark
|
||||
|
||||
benchmark = Benchmark("translation_en_de", storage_path="benchmarks/")
|
||||
|
||||
# Nightly CI job
|
||||
results = benchmark.evaluate(
|
||||
candidates=translate_batch(test_documents),
|
||||
references=human_translations,
|
||||
metrics=["bleu4", "semantic"]
|
||||
)
|
||||
|
||||
# Block deployment if quality drops
|
||||
benchmark.assert_no_regression(tolerance=0.03)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
|
||||
- [ ] Semantic similarity correlates with human judgement on test pairs
|
||||
- [ ] Pytest plugin installs cleanly via `pip install veritext`
|
||||
- [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
|
||||
- [ ] Benchmark regression detection has <5% false positive rate
|
||||
- [ ] Documentation includes working examples for each use case
|
||||
- [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
|
||||
- [ ] Can explain design decisions and metric theory in interview
|
||||
|
||||
---
|
||||
|
||||
## Skills Demonstrated
|
||||
|
||||
| Skill | How Veritext demonstrates it |
|
||||
|-------|------------------------------|
|
||||
| Python framework design | Composable validators, clean API, plugin architecture |
|
||||
| Test automation | Native pytest integration, CI/CD workflows |
|
||||
| NLP evaluation metrics | BLEU, ROUGE, semantic similarity implementations |
|
||||
| Data analysis | Statistical regression detection, batch processing |
|
||||
| CLI development | Typer-based interface, rich output |
|
||||
| Software architecture | Layered design, clear separation of concerns |
|
||||
| Documentation | Comprehensive readme, examples |
|
||||
| Quality engineering | High test coverage, type safety, linting |
|
||||
|
||||
---
|
||||
|
||||
## What Makes This Project Credible
|
||||
|
||||
1. **Solves a real problem** — Anyone building text-based features faces validation
|
||||
challenges.
|
||||
|
||||
2. **Not tied to a specific technology** — Works with any text source (chatbots, LLMs,
|
||||
translation APIs, content generators). It's a general-purpose tool, not an "LLM
|
||||
testing framework."
|
||||
|
||||
3. **Practical scope** — Not trying to reinvent pytest or build an ML platform. Focused
|
||||
on one thing: validating text quality.
|
||||
|
||||
4. **Demonstrates depth** — Implementing BLEU/ROUGE from understanding (not just
|
||||
wrapping libraries) shows knowledge of how these metrics work.
|
||||
|
||||
5. **Natural portfolio narrative** — "I was building X and needed a better way to test
|
||||
it, so I built this tool." Every interviewer has faced similar problems.
|
||||
Reference in New Issue
Block a user