Files

refactor: CLI cleanup and documentation updates

- Refactor CLI metric computation to eliminate code duplication
- Update version format to PEP 440 compliance (0.1.0.dev0)
- Cache Settings instance via @lru_cache for performance
- Document composite validators' protocol deviation
- Consolidate redundant empty checks in ROUGE-L computation
- Add Phase 10 (Portfolio Demos) to implementation plan

2026-02-04 15:38:46 +00:00

17 KiB

Raw Blame History

Project Plan: Veritext — Semantic Text Validation Framework

Overview

A Python library for validating text outputs against semantic criteria. Designed for developers building any system that produces text — chatbots, content generators, translation pipelines, summarisation tools — who need automated quality assurance beyond simple string matching.

Origin story: "I was building a feature that generated article summaries and got tired of manually checking if they captured the key points. Existing tools could tell me if two strings matched, but not if they meant the same thing. So I built a validation framework that understands semantics."

Portfolio role: A practical developer tool that demonstrates Python framework design, NLP evaluation techniques, and test automation integration. The project solves a real problem any developer working with text processing encounters.

Target users: Developers building content pipelines, chatbot teams validating responses, ML engineers evaluating model outputs, QA teams testing text-based features.

Problem Statement

Text validation is hard. Traditional testing approaches fall short:

Approach	Problem
Exact string match	Fails on semantically equivalent variations
Substring/regex	Brittle, misses meaning entirely
Manual review	Doesn't scale, subjective
Generic diff tools	Show what changed, not if it matters

Example: A summarisation system produces "The CEO announced layoffs affecting 500 employees" one day and "500 workers will lose their jobs, the company's chief executive said" the next. These are semantically equivalent, but every traditional test would flag this as a failure.

Veritext answers: "Is this text output good enough according to my criteria?" — not "Is it identical?"

Core Concepts

Metrics (Pure Computation)

Metrics compute scores comparing candidate text to reference text:

from veritext.metrics import Bleu, Rouge

bleu = Bleu()
result = bleu.score(
    candidate="The cat sat on the mat",
    reference="A cat is sitting on a mat"
)
# BleuResult(bleu1=0.71, bleu2=0.58, bleu3=0.45, bleu4=0.41, brevity_penalty=1.0)

rouge = Rouge()
result = rouge.score(candidate, reference)
# RougeResult(rouge1=RougeScore(...), rouge2=RougeScore(...), rouge_l=RougeScore(...))

Built-in metrics:

Metric	What it measures	Use case
BLEU-1 to BLEU-4	N-gram precision	Translation, generation
ROUGE-1, ROUGE-2	N-gram recall	Summarisation
ROUGE-L	Longest common subsequence	Summarisation
Semantic similarity	Cosine distance of embeddings	Any meaning comparison
Lexical overlap	Jaccard similarity of tokens	Simple similarity
Reading level	Flesch-Kincaid grade	Accessibility

Note: Reading level is a standalone metric (requires_reference = False) that analyses only the candidate text. Comparison metrics (BLEU, ROUGE, semantic) require a reference and raise ValueError if none provided.

Validators (Decision Logic)

Validators wrap metrics and apply thresholds to make pass/fail decisions:

from veritext import validators as v

# Compose multiple checks
validator = v.all_of([
    v.bleu(min_score=0.7),
    v.length(max_chars=500),
    v.readability(max_grade=8),
])

from veritext.core.types import ValidationContext

context = ValidationContext(reference="The quick brown fox jumps over the lazy dog")
result = validator.validate("The fast brown fox leaped over the lazy dog", context)
# ValidationResult(passed=True, checks=[...])

Pytest Integration

Native pytest fixtures and assertions for CI/CD:

from veritext.pytest_plugin import validate_text

def test_summary_quality(summariser, document):
    summary = summariser.summarise(document)

    validate_text(
        summary,
        reference=expected_summary,
        min_rouge=0.7,
        min_semantic=0.85,
    )

Regression Detection

Track output quality over time, catch degradations before users do:

from veritext.benchmark import Benchmark

benchmark = Benchmark("summarisation_quality", storage_path="benchmarks/")
results = benchmark.evaluate(outputs, references, metrics=["rouge_l", "bleu4"])
benchmark.assert_no_regression(tolerance=0.05)

Tech Stack

Component	Technology	Rationale
Core	Python 3.11+	Target ecosystem, modern type hints
Metrics	Custom implementations	Full control, understanding of algorithms
Embeddings	sentence-transformers	Semantic similarity (optional)
Test integration	pytest	Fixtures, plugins, assertions
CLI	typer	Consistent with portfolio projects
Data handling	pydantic	Validation, serialisation
Storage	SQLite	Benchmark history, lightweight
Output	rich	Terminal formatting

Architecture

Layered Design

┌─────────────────────────────────────────────────────┐
│  CLI / pytest_plugin  (presentation layer)          │
├─────────────────────────────────────────────────────┤
│  validators/          (decision logic)              │
│  benchmark/           (tracking & regression)       │
├─────────────────────────────────────────────────────┤
│  metrics/             (pure computation)            │
├─────────────────────────────────────────────────────┤
│  core/                (shared types, tokenisation)  │
└─────────────────────────────────────────────────────┘

Dependency rule: Each layer depends only on layers below it.

Key Design Decisions

Metrics vs Validators separation — Metrics compute scores; validators make pass/fail decisions. Clear separation of concerns.
Typed result objects — Each metric returns a specific result type (e.g., BleuResult, RougeResult), not just float. Full information preserved.
Optional heavy dependencies — sentence-transformers (~2GB with PyTorch) is optional. Core library works without ML dependencies.
Shared tokenisation — Single Tokeniser protocol used by all n-gram metrics. Consistent behaviour across BLEU and ROUGE.
Explicit context — ValidationContext dataclass instead of **kwargs. Type-safe, discoverable API.
Graceful edge case handling — Empty text returns zero scores (not errors). Missing reference raises clear ValueError for comparison metrics. Unicode normalised to NFC by default.

Project Components

Component 1: Core Module

Shared types, exceptions, and tokenisation.

Types:

ValidationContext — reference text and metadata for validation
CheckResult — individual check result with diagnostics
ValidationResult — aggregate result with pass/fail and all checks
BatchResult — statistics over multiple evaluations

Tokeniser:

class Tokeniser(Protocol):
    def tokenise(self, text: str) -> list[str]: ...

class WordTokeniser:
    def __init__(self, lowercase: bool = True, remove_punctuation: bool = True): ...

Component 2: Metric Engine

Pure implementations of text evaluation metrics.

Interface:

class Metric(Protocol[T]):
    @property
    def name(self) -> str: ...

    @property
    def requires_reference(self) -> bool: ...

    def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
        """Raises ValueError if reference required but not provided."""
        ...

    def batch_score(
        self,
        candidates: list[str],
        references: list[str] | list[list[str]] | None = None,
    ) -> BatchResult[T]: ...

Metrics:

Bleu — BLEU-1 through BLEU-4 with brevity penalty
Rouge — ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1
Lexical — Jaccard similarity, token overlap
Readability — Flesch-Kincaid grade level
SemanticSimilarity — Embedding cosine distance (optional dependency)

Component 3: Validator Framework

Composable validation rules with clear pass/fail semantics.

Built-in validators:

Validator	Description
`v.bleu(min_score, variant)`	BLEU score above minimum
`v.rouge(min_score, variant)`	ROUGE score above minimum
`v.semantic(min_score)`	Semantic similarity above threshold
`v.length(min_chars, max_chars)`	Length constraints
`v.readability(max_grade)`	Reading level constraint
`v.contains(terms)`	Required terms present
`v.excludes(terms)`	Forbidden terms absent
`v.pattern(regex)`	Regex pattern match

Composition:

# All validators must pass
v.all_of([v.bleu(min_score=0.7), v.length(max_chars=500)])

# At least one must pass
v.any_of([v.contains(["error"]), v.contains(["failed"])])

# Weighted scoring
v.weighted([
    (v.bleu(min_score=0.7), 0.6),
    (v.readability(max_grade=8), 0.4),
], min_score=0.75)

Reference requirements: Validators wrapping comparison metrics (bleu, rouge, semantic) require context.reference to be set. If None, they raise ValidationError with a clear message. Constraint validators (length, readability, contains) do not require a reference.

Component 4: Pytest Plugin

First-class pytest integration for CI/CD pipelines.

Features:

Custom assertions with detailed failure messages
Fixtures for common validation patterns
Markers for categorising text tests

Usage:

from veritext.pytest_plugin import validate_text

def test_chatbot_response():
    response = chatbot.respond("What are your hours?")

    validate_text(
        response,
        reference="We're open Monday to Friday, 9am to 5pm.",
        min_bleu=0.6,
        min_semantic=0.8,
        max_length=500,
    )

Failure output:

FAILED test_summary.py::test_summary_quality
    AssertionError: Text failed 2 of 4 checks:

    ✗ rouge: 0.58 (minimum: 0.70)
    ✗ semantic: 0.72 (minimum: 0.85)
    ✓ length: 342 (maximum: 500)
    ✓ readability: 6.2 (maximum: 8)

    Candidate: "The company reported losses..."
    Reference: "Financial results showed significant decline..."

Component 5: Benchmark & Regression Detection

Track quality over time, catch degradations automatically.

Features:

Store historical metric values in SQLite
Statistical regression detection
Configurable tolerance thresholds
CI integration for blocking degradations

Usage:

from veritext.benchmark import Benchmark

benchmark = Benchmark("chatbot_quality", storage_path="benchmarks/")

# Record current run (returns BenchmarkRun with metrics and metadata)
run = benchmark.evaluate(
    candidates=current_outputs,
    references=expected_outputs,
    metrics=["rouge_l", "semantic"]
)
# run.metrics = {"rouge_l": 0.82, "semantic": 0.89}

# Compare against historical baseline
regression = benchmark.check_regression(tolerance=0.05, window=10)

if regression.detected:
    print(f"Quality dropped: {regression.summary}")

# In CI: fail the build on regression
benchmark.assert_no_regression(tolerance=0.05)

Component 6: CLI Tool

Command-line interface for quick validation and benchmarking.

# Validate a single text
$ veritext validate "Your text here" --reference "Expected text" --metrics bleu,rouge

# Validate from files
$ veritext validate --file outputs.jsonl --reference-file expected.jsonl

# Run benchmark
$ veritext benchmark run summarisation --inputs docs/ --references refs/

# Show benchmark history
$ veritext benchmark show summarisation --last 20

# Check for regression
$ veritext benchmark check summarisation --tolerance 0.05

Example Use Cases

Use Case 1: Chatbot Response Validation

from veritext import validators as v
from veritext.core.types import ValidationContext

# Define acceptable response criteria
response_validator = v.all_of([
    v.length(max_chars=500),
    v.readability(max_grade=8),
    v.excludes(terms=["I don't know", "I'm not sure"]),
])

def test_chatbot_responds_helpfully():
    response = chatbot.respond("How do I reset my password?")
    context = ValidationContext()
    result = response_validator.validate(response, context)
    assert result.passed, result.failure_summary

Use Case 2: Summarisation Quality Gate

from veritext.pytest_plugin import validate_text

def test_summary_captures_key_points():
    article = load_article("financial_report.txt")
    summary = summariser.summarise(article)

    validate_text(
        summary,
        reference=load_reference_summary("financial_report_summary.txt"),
        min_rouge=0.65,
        min_semantic=0.80,
        max_length=300,
    )

Use Case 3: Translation Quality Monitoring

from veritext.benchmark import Benchmark

benchmark = Benchmark("translation_en_de", storage_path="benchmarks/")

# Nightly CI job
results = benchmark.evaluate(
    candidates=translate_batch(test_documents),
    references=human_translations,
    metrics=["bleu4", "semantic"]
)

# Block deployment if quality drops
benchmark.assert_no_regression(tolerance=0.03)

Success Criteria

BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
Semantic similarity correlates with human judgement on test pairs
Pytest plugin installs cleanly via uv pip install veritext
Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
Benchmark regression detection has <5% false positive rate
Edge cases handled gracefully (empty text, None reference, Unicode)
Documentation includes working examples for each use case
All code passes ruff, mypy strict, and pytest with ≥80% coverage
Can explain design decisions and metric theory in interview

Skills Demonstrated

Skill	How Veritext demonstrates it
Python framework design	Composable validators, clean API, plugin architecture
Test automation	Native pytest integration, CI/CD workflows
NLP evaluation metrics	BLEU, ROUGE, semantic similarity implementations
Data analysis	Statistical regression detection, batch processing
CLI development	Typer-based interface, rich output
Software architecture	Layered design, clear separation of concerns
Documentation	Comprehensive readme, examples
Quality engineering	High test coverage, type safety, linting

What Makes This Project Credible

Solves a real problem — Anyone building text-based features faces validation challenges.
Not tied to a specific technology — Works with any text source (chatbots, LLMs, translation APIs, content generators). It's a general-purpose tool, not an "LLM testing framework."
Practical scope — Not trying to reinvent pytest or build an ML platform. Focused on one thing: validating text quality.
Demonstrates depth — Implementing BLEU/ROUGE from understanding (not just wrapping libraries) shows knowledge of how these metrics work.
Natural portfolio narrative — "I was building X and needed a better way to test it, so I built this tool." Every interviewer has faced similar problems.

Portfolio Demos (Future)

Interactive demos to showcase Veritext without requiring installation.

Streamlit Demo

A quick interactive web UI for general visitors and recruiters.

Features:

Text input boxes (candidate + reference)
Metric selector (BLEU, ROUGE, lexical, readability)
Threshold sliders for pass/fail validation
Results table with scores and status

Deployment: Self-hosted on homeserver (e.g., veritext.kschappell.com)

Effort: ~30 minutes

Jupyter Notebook Collection

Deep-dive notebooks targeting data science and ML recruiters.

Notebooks:

Notebook	Purpose
`01-metrics-overview.ipynb`	Introduction to each metric with visualisations
`02-batch-evaluation.ipynb`	Evaluating model outputs at scale, statistical analysis
`03-regression-detection.ipynb`	Tracking quality over time, detecting degradation
`04-chatbot-validation.ipynb`	Real-world use case: validating chatbot responses

Hosting: JupyterLite (static files, runs in browser via WebAssembly)

Deployment: Self-hosted alongside Streamlit demo

Why both:

Demo Type	Audience	Value
Streamlit	General visitors	Quick, interactive, no friction
Notebooks	Data/ML recruiters	Shows analytical depth, speaks their language

17 KiB Raw Blame History

Project Plan: Veritext — Semantic Text Validation Framework

Overview

Problem Statement

Core Concepts

Metrics (Pure Computation)

Validators (Decision Logic)

Pytest Integration

Regression Detection

Tech Stack

Architecture

Layered Design

Key Design Decisions

Project Components

Component 1: Core Module

Component 2: Metric Engine

Component 3: Validator Framework

Component 4: Pytest Plugin

Component 5: Benchmark & Regression Detection

Component 6: CLI Tool

Example Use Cases

Use Case 1: Chatbot Response Validation

Use Case 2: Summarisation Quality Gate

Use Case 3: Translation Quality Monitoring

Success Criteria

Skills Demonstrated

What Makes This Project Credible

Portfolio Demos (Future)

Streamlit Demo

Jupyter Notebook Collection

17 KiB

Raw Blame History