Comprehensive documentation for Veritext semantic text validation framework: - Project plan with architecture, use cases, and success criteria - Implementation plan with 9 phases, interfaces, and verification steps
479 lines
15 KiB
Markdown
479 lines
15 KiB
Markdown
# Project Plan: Veritext — Semantic Text Validation Framework
|
|
|
|
## Overview
|
|
|
|
A Python library for validating text outputs against semantic criteria. Designed for
|
|
developers building any system that produces text — chatbots, content generators,
|
|
translation pipelines, summarisation tools — who need automated quality assurance
|
|
beyond simple string matching.
|
|
|
|
**Origin story:** "I was building a feature that generated article summaries and got
|
|
tired of manually checking if they captured the key points. Existing tools could tell
|
|
me if two strings matched, but not if they *meant* the same thing. So I built a
|
|
validation framework that understands semantics."
|
|
|
|
**Portfolio role:** A practical developer tool that demonstrates Python framework
|
|
design, NLP evaluation techniques, and test automation integration. The project
|
|
solves a real problem any developer working with text processing encounters.
|
|
|
|
**Target users:** Developers building content pipelines, chatbot teams validating
|
|
responses, ML engineers evaluating model outputs, QA teams testing text-based features.
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
Text validation is hard. Traditional testing approaches fall short:
|
|
|
|
| Approach | Problem |
|
|
|----------|---------|
|
|
| Exact string match | Fails on semantically equivalent variations |
|
|
| Substring/regex | Brittle, misses meaning entirely |
|
|
| Manual review | Doesn't scale, subjective |
|
|
| Generic diff tools | Show *what* changed, not *if it matters* |
|
|
|
|
**Example:** A summarisation system produces "The CEO announced layoffs affecting 500
|
|
employees" one day and "500 workers will lose their jobs, the company's chief executive
|
|
said" the next. These are semantically equivalent, but every traditional test would
|
|
flag this as a failure.
|
|
|
|
Veritext answers: "Is this text output *good enough* according to my criteria?" — not
|
|
"Is it identical?"
|
|
|
|
---
|
|
|
|
## Core Concepts
|
|
|
|
### Metrics (Pure Computation)
|
|
|
|
Metrics compute scores comparing candidate text to reference text:
|
|
|
|
```python
|
|
from veritext.metrics import Bleu, Rouge
|
|
|
|
bleu = Bleu()
|
|
result = bleu.score(
|
|
candidate="The cat sat on the mat",
|
|
reference="A cat is sitting on a mat"
|
|
)
|
|
# BleuResult(bleu1=0.71, bleu2=0.58, bleu3=0.45, bleu4=0.41, brevity_penalty=1.0)
|
|
|
|
rouge = Rouge()
|
|
result = rouge.score(candidate, reference)
|
|
# RougeResult(rouge1=RougeScore(...), rouge2=RougeScore(...), rouge_l=RougeScore(...))
|
|
```
|
|
|
|
**Built-in metrics:**
|
|
|
|
| Metric | What it measures | Use case |
|
|
|--------|------------------|----------|
|
|
| BLEU-1 to BLEU-4 | N-gram precision | Translation, generation |
|
|
| ROUGE-1, ROUGE-2 | N-gram recall | Summarisation |
|
|
| ROUGE-L | Longest common subsequence | Summarisation |
|
|
| Semantic similarity | Cosine distance of embeddings | Any meaning comparison |
|
|
| Lexical overlap | Jaccard similarity of tokens | Simple similarity |
|
|
| Reading level | Flesch-Kincaid grade | Accessibility |
|
|
|
|
**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required.
|
|
|
|
### Validators (Decision Logic)
|
|
|
|
Validators wrap metrics and apply thresholds to make pass/fail decisions:
|
|
|
|
```python
|
|
from veritext import validators as v
|
|
|
|
# Compose multiple checks
|
|
validator = v.all_of([
|
|
v.bleu(min_score=0.7),
|
|
v.length(max_chars=500),
|
|
v.readability(max_grade=8),
|
|
])
|
|
|
|
from veritext.core.types import ValidationContext
|
|
|
|
context = ValidationContext(reference="The quick brown fox jumps over the lazy dog")
|
|
result = validator.validate("The fast brown fox leaped over the lazy dog", context)
|
|
# ValidationResult(passed=True, checks=[...])
|
|
```
|
|
|
|
### Pytest Integration
|
|
|
|
Native pytest fixtures and assertions for CI/CD:
|
|
|
|
```python
|
|
from veritext.pytest_plugin import validate_text
|
|
|
|
def test_summary_quality(summariser, document):
|
|
summary = summariser.summarise(document)
|
|
|
|
validate_text(
|
|
summary,
|
|
reference=expected_summary,
|
|
min_rouge=0.7,
|
|
min_semantic=0.85,
|
|
)
|
|
```
|
|
|
|
### Regression Detection
|
|
|
|
Track output quality over time, catch degradations before users do:
|
|
|
|
```python
|
|
from veritext.benchmark import Benchmark
|
|
|
|
benchmark = Benchmark("summarisation_quality", storage_path="benchmarks/")
|
|
results = benchmark.evaluate(outputs, references, metrics=["rouge_l", "bleu4"])
|
|
benchmark.assert_no_regression(tolerance=0.05)
|
|
```
|
|
|
|
---
|
|
|
|
## Tech Stack
|
|
|
|
| Component | Technology | Rationale |
|
|
|-----------|------------|-----------|
|
|
| Core | Python 3.11+ | Target ecosystem, modern type hints |
|
|
| Metrics | Custom implementations | Full control, understanding of algorithms |
|
|
| Embeddings | sentence-transformers | Semantic similarity (optional) |
|
|
| Test integration | pytest | Fixtures, plugins, assertions |
|
|
| CLI | typer | Consistent with portfolio projects |
|
|
| Data handling | pydantic | Validation, serialisation |
|
|
| Storage | SQLite | Benchmark history, lightweight |
|
|
| Output | rich | Terminal formatting |
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
### Layered Design
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ CLI / pytest_plugin (presentation layer) │
|
|
├─────────────────────────────────────────────────────┤
|
|
│ validators/ (decision logic) │
|
|
│ benchmark/ (tracking & regression) │
|
|
├─────────────────────────────────────────────────────┤
|
|
│ metrics/ (pure computation) │
|
|
├─────────────────────────────────────────────────────┤
|
|
│ core/ (shared types, tokenisation) │
|
|
└─────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Dependency rule:** Each layer depends only on layers below it.
|
|
|
|
### Key Design Decisions
|
|
|
|
1. **Metrics vs Validators separation** — Metrics compute scores; validators make
|
|
pass/fail decisions. Clear separation of concerns.
|
|
|
|
2. **Typed result objects** — Each metric returns a specific result type (e.g.,
|
|
`BleuResult`, `RougeResult`), not just `float`. Full information preserved.
|
|
|
|
3. **Optional heavy dependencies** — `sentence-transformers` (~2GB with PyTorch) is
|
|
optional. Core library works without ML dependencies.
|
|
|
|
4. **Shared tokenisation** — Single `Tokeniser` protocol used by all n-gram metrics.
|
|
Consistent behaviour across BLEU and ROUGE.
|
|
|
|
5. **Explicit context** — `ValidationContext` dataclass instead of `**kwargs`.
|
|
Type-safe, discoverable API.
|
|
|
|
---
|
|
|
|
## Project Components
|
|
|
|
### Component 1: Core Module
|
|
|
|
Shared types, exceptions, and tokenisation.
|
|
|
|
**Types:**
|
|
- `ValidationContext` — reference text and metadata for validation
|
|
- `CheckResult` — individual check result with diagnostics
|
|
- `ValidationResult` — aggregate result with pass/fail and all checks
|
|
- `BatchResult` — statistics over multiple evaluations
|
|
|
|
**Tokeniser:**
|
|
```python
|
|
class Tokeniser(Protocol):
|
|
def tokenise(self, text: str) -> list[str]: ...
|
|
|
|
class WordTokeniser:
|
|
def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ...
|
|
```
|
|
|
|
---
|
|
|
|
### Component 2: Metric Engine
|
|
|
|
Pure implementations of text evaluation metrics.
|
|
|
|
**Interface:**
|
|
```python
|
|
class Metric(Protocol[T]):
|
|
@property
|
|
def name(self) -> str: ...
|
|
|
|
def score(self, candidate: str, reference: str | list[str]) -> T: ...
|
|
|
|
def batch_score(
|
|
self,
|
|
candidates: list[str],
|
|
references: list[str] | list[list[str]]
|
|
) -> BatchResult[T]: ...
|
|
```
|
|
|
|
**Metrics:**
|
|
- `Bleu` — BLEU-1 through BLEU-4 with brevity penalty
|
|
- `Rouge` — ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1
|
|
- `Lexical` — Jaccard similarity, token overlap
|
|
- `Readability` — Flesch-Kincaid grade level
|
|
- `SemanticSimilarity` — Embedding cosine distance (optional dependency)
|
|
|
|
---
|
|
|
|
### Component 3: Validator Framework
|
|
|
|
Composable validation rules with clear pass/fail semantics.
|
|
|
|
**Built-in validators:**
|
|
|
|
| Validator | Description |
|
|
|-----------|-------------|
|
|
| `v.bleu(min_score, variant)` | BLEU score above minimum |
|
|
| `v.rouge(min_score, variant)` | ROUGE score above minimum |
|
|
| `v.semantic(min_score)` | Semantic similarity above threshold |
|
|
| `v.length(min_chars, max_chars)` | Length constraints |
|
|
| `v.readability(max_grade)` | Reading level constraint |
|
|
| `v.contains(terms)` | Required terms present |
|
|
| `v.excludes(terms)` | Forbidden terms absent |
|
|
| `v.pattern(regex)` | Regex pattern match |
|
|
|
|
**Composition:**
|
|
|
|
```python
|
|
# All validators must pass
|
|
v.all_of([v.bleu(min_score=0.7), v.length(max_chars=500)])
|
|
|
|
# At least one must pass
|
|
v.any_of([v.contains(["error"]), v.contains(["failed"])])
|
|
|
|
# Weighted scoring
|
|
v.weighted([
|
|
(v.bleu(min_score=0.7), 0.6),
|
|
(v.readability(max_grade=8), 0.4),
|
|
], min_score=0.75)
|
|
```
|
|
|
|
---
|
|
|
|
### Component 4: Pytest Plugin
|
|
|
|
First-class pytest integration for CI/CD pipelines.
|
|
|
|
**Features:**
|
|
- Custom assertions with detailed failure messages
|
|
- Fixtures for common validation patterns
|
|
- Markers for categorising text tests
|
|
|
|
**Usage:**
|
|
|
|
```python
|
|
from veritext.pytest_plugin import validate_text
|
|
|
|
def test_chatbot_response():
|
|
response = chatbot.respond("What are your hours?")
|
|
|
|
validate_text(
|
|
response,
|
|
reference="We're open Monday to Friday, 9am to 5pm.",
|
|
min_bleu=0.6,
|
|
min_semantic=0.8,
|
|
max_length=500,
|
|
)
|
|
```
|
|
|
|
**Failure output:**
|
|
|
|
```
|
|
FAILED test_summary.py::test_summary_quality
|
|
AssertionError: Text failed 2 of 4 checks:
|
|
|
|
✗ rouge: 0.58 (minimum: 0.70)
|
|
✗ semantic: 0.72 (minimum: 0.85)
|
|
✓ length: 342 (maximum: 500)
|
|
✓ readability: 6.2 (maximum: 8)
|
|
|
|
Candidate: "The company reported losses..."
|
|
Reference: "Financial results showed significant decline..."
|
|
```
|
|
|
|
---
|
|
|
|
### Component 5: Benchmark & Regression Detection
|
|
|
|
Track quality over time, catch degradations automatically.
|
|
|
|
**Features:**
|
|
- Store historical metric values in SQLite
|
|
- Statistical regression detection
|
|
- Configurable tolerance thresholds
|
|
- CI integration for blocking degradations
|
|
|
|
**Usage:**
|
|
|
|
```python
|
|
from veritext.benchmark import Benchmark
|
|
|
|
benchmark = Benchmark("chatbot_quality", storage_path="benchmarks/")
|
|
|
|
# Record current run (returns BenchmarkRun with metrics and metadata)
|
|
run = benchmark.evaluate(
|
|
candidates=current_outputs,
|
|
references=expected_outputs,
|
|
metrics=["rouge_l", "semantic"]
|
|
)
|
|
# run.metrics = {"rouge_l": 0.82, "semantic": 0.89}
|
|
|
|
# Compare against historical baseline
|
|
regression = benchmark.check_regression(tolerance=0.05, window=10)
|
|
|
|
if regression.detected:
|
|
print(f"Quality dropped: {regression.summary}")
|
|
|
|
# In CI: fail the build on regression
|
|
benchmark.assert_no_regression(tolerance=0.05)
|
|
```
|
|
|
|
---
|
|
|
|
### Component 6: CLI Tool
|
|
|
|
Command-line interface for quick validation and benchmarking.
|
|
|
|
```bash
|
|
# Validate a single text
|
|
$ veritext validate "Your text here" --reference "Expected text" --metrics bleu,rouge
|
|
|
|
# Validate from files
|
|
$ veritext validate --file outputs.jsonl --reference-file expected.jsonl
|
|
|
|
# Run benchmark
|
|
$ veritext benchmark run summarisation --inputs docs/ --references refs/
|
|
|
|
# Show benchmark history
|
|
$ veritext benchmark show summarisation --last 20
|
|
|
|
# Check for regression
|
|
$ veritext benchmark check summarisation --tolerance 0.05
|
|
```
|
|
|
|
---
|
|
|
|
## Example Use Cases
|
|
|
|
### Use Case 1: Chatbot Response Validation
|
|
|
|
```python
|
|
from veritext import validators as v
|
|
from veritext.core.types import ValidationContext
|
|
|
|
# Define acceptable response criteria
|
|
response_validator = v.all_of([
|
|
v.length(max_chars=500),
|
|
v.readability(max_grade=8),
|
|
v.excludes(terms=["I don't know", "I'm not sure"]),
|
|
])
|
|
|
|
def test_chatbot_responds_helpfully():
|
|
response = chatbot.respond("How do I reset my password?")
|
|
context = ValidationContext()
|
|
result = response_validator.validate(response, context)
|
|
assert result.passed, result.failure_summary
|
|
```
|
|
|
|
### Use Case 2: Summarisation Quality Gate
|
|
|
|
```python
|
|
from veritext.pytest_plugin import validate_text
|
|
|
|
def test_summary_captures_key_points():
|
|
article = load_article("financial_report.txt")
|
|
summary = summariser.summarise(article)
|
|
|
|
validate_text(
|
|
summary,
|
|
reference=load_reference_summary("financial_report_summary.txt"),
|
|
min_rouge=0.65,
|
|
min_semantic=0.80,
|
|
max_length=300,
|
|
)
|
|
```
|
|
|
|
### Use Case 3: Translation Quality Monitoring
|
|
|
|
```python
|
|
from veritext.benchmark import Benchmark
|
|
|
|
benchmark = Benchmark("translation_en_de", storage_path="benchmarks/")
|
|
|
|
# Nightly CI job
|
|
results = benchmark.evaluate(
|
|
candidates=translate_batch(test_documents),
|
|
references=human_translations,
|
|
metrics=["bleu4", "semantic"]
|
|
)
|
|
|
|
# Block deployment if quality drops
|
|
benchmark.assert_no_regression(tolerance=0.03)
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
- [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
|
|
- [ ] Semantic similarity correlates with human judgement on test pairs
|
|
- [ ] Pytest plugin installs cleanly via `pip install veritext`
|
|
- [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
|
|
- [ ] Benchmark regression detection has <5% false positive rate
|
|
- [ ] Documentation includes working examples for each use case
|
|
- [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
|
|
- [ ] Can explain design decisions and metric theory in interview
|
|
|
|
---
|
|
|
|
## Skills Demonstrated
|
|
|
|
| Skill | How Veritext demonstrates it |
|
|
|-------|------------------------------|
|
|
| Python framework design | Composable validators, clean API, plugin architecture |
|
|
| Test automation | Native pytest integration, CI/CD workflows |
|
|
| NLP evaluation metrics | BLEU, ROUGE, semantic similarity implementations |
|
|
| Data analysis | Statistical regression detection, batch processing |
|
|
| CLI development | Typer-based interface, rich output |
|
|
| Software architecture | Layered design, clear separation of concerns |
|
|
| Documentation | Comprehensive readme, examples |
|
|
| Quality engineering | High test coverage, type safety, linting |
|
|
|
|
---
|
|
|
|
## What Makes This Project Credible
|
|
|
|
1. **Solves a real problem** — Anyone building text-based features faces validation
|
|
challenges.
|
|
|
|
2. **Not tied to a specific technology** — Works with any text source (chatbots, LLMs,
|
|
translation APIs, content generators). It's a general-purpose tool, not an "LLM
|
|
testing framework."
|
|
|
|
3. **Practical scope** — Not trying to reinvent pytest or build an ML platform. Focused
|
|
on one thing: validating text quality.
|
|
|
|
4. **Demonstrates depth** — Implementing BLEU/ROUGE from understanding (not just
|
|
wrapping libraries) shows knowledge of how these metrics work.
|
|
|
|
5. **Natural portfolio narrative** — "I was building X and needed a better way to test
|
|
it, so I built this tool." Every interviewer has faced similar problems.
|