docs(plans): improve consistency and add edge case handling

- Add requires_reference property to Metric protocol for standalone metrics
- Make reference parameter optional in score/batch_score methods
- Add comprehensive Edge Case Handling section (empty text, Unicode, etc.)
- Expand phase tasks with explicit test coverage requirements
- Fix path reference to use relative workspace path
- Add missing test_runner.py to directory structure
- Clarify SemanticValidator integration in Phase 5
- Fix tuple/list type annotation in Benchmark.evaluate()
This commit is contained in:
2026-02-03 16:04:02 +00:00
parent 49f1e27cb1
commit 818e241ab2
2 changed files with 143 additions and 43 deletions

View File

@@ -4,7 +4,7 @@ Semantic text validation framework for Python — validates text outputs against
## Project Overview
**Location:** `/home/kai/work/dev/portfolio/veritext/`
**Location:** `portfolio/veritext/` (relative to workspace root)
**Remote:** `https://gitea.kschappell.com/kschappell/veritext.git`
**Purpose:** A Python library for validating text outputs against semantic criteria. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching.
@@ -165,6 +165,7 @@ veritext/
│ │ └── test_composite.py
│ ├── test_benchmark/
│ │ ├── test_storage.py
│ │ ├── test_runner.py
│ │ └── test_regression.py
│ ├── test_pytest_plugin/
│ │ └── test_integration.py
@@ -239,12 +240,30 @@ class Metric(Protocol[T]):
@property
def name(self) -> str: ...
def score(self, candidate: str, reference: str | list[str]) -> T: ...
@property
def requires_reference(self) -> bool:
"""Whether this metric requires a reference text."""
...
def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
"""
Compute metric score.
Args:
candidate: The text to evaluate.
reference: Reference text(s) for comparison. Required for comparison
metrics (BLEU, ROUGE, semantic). Ignored for standalone
metrics (readability).
Raises:
ValueError: If reference is required but not provided.
"""
...
def batch_score(
self,
candidates: list[str],
references: list[str] | list[list[str]]
references: list[str] | list[list[str]] | None = None,
) -> BatchResult[T]: ...
@dataclass
@@ -262,7 +281,7 @@ class BatchResult(Generic[T]):
stats: dict[str, AggregateStats] # Aggregated stats for numeric fields
```
**Note:** Readability metrics (Flesch-Kincaid) accept but ignore the `reference` parameter since they only analyse the candidate text.
**Note:** Standalone metrics like readability return `False` for `requires_reference` and ignore the `reference` parameter. Comparison metrics (BLEU, ROUGE, semantic) return `True` and raise `ValueError` if `reference` is `None`.
### Validator Protocol
@@ -322,6 +341,46 @@ class RegressionReport:
---
## Edge Case Handling
All components must handle edge cases consistently:
### Empty Text
| Input | Behaviour |
|-------|-----------|
| Empty candidate (`""`) | Metrics return zero scores; validators fail unless explicitly configured |
| Empty reference (`""`) | Comparison metrics raise `ValueError` |
| Whitespace-only text | Treated as empty after tokenisation |
### None Reference
| Component | Behaviour |
|-----------|-----------|
| Comparison metrics (BLEU, ROUGE, semantic) | Raise `ValueError("Reference required for {metric_name}")` |
| Standalone metrics (readability) | Ignore, compute normally |
| Validators wrapping comparison metrics | Raise `ValidationError` if `context.reference` is `None` |
### Unicode & Encoding
- All text assumed to be valid UTF-8 strings
- Normalisation: NFC by default (configurable in `Tokeniser`)
- Emoji and non-Latin scripts: Supported, tokenised as words where applicable
### Very Long Text
- No hard limits enforced by default
- `Tokeniser` can accept `max_tokens: int | None` for truncation
- Semantic similarity: Truncates to model's max sequence length (typically 512 tokens) with warning logged
### Multiple References
BLEU and ROUGE support multiple references (`list[str]`):
- BLEU: Computes against each reference, uses maximum n-gram matches
- ROUGE: Computes against each reference, returns best score
---
## Validator Naming Convention
Consistent short names:
@@ -369,13 +428,14 @@ v.weighted( # Weighted score threshold
2. Write `pyproject.toml` with optional dependencies
3. Create `CLAUDE.md` with project guidelines
4. Implement `core/exceptions.py` (full hierarchy)
5. Implement `core/types.py` (ValidationContext, CheckResult, BatchResult)
6. Implement `core/tokenisation.py` (WordTokeniser)
5. Implement `core/types.py` (`ValidationContext`, `CheckResult`, `ValidationResult`)
6. Implement `core/tokenisation.py` (`WordTokeniser` with NFC normalisation)
7. Implement `core/config.py` (pydantic-settings)
8. Implement `core/logging.py` (structlog configuration)
9. Create `__init__.py` with version
10. Write tests for tokenisation
11. Initial commit to Gitea
9. Create `__init__.py` with `__version__` and `__all__` exports
10. Write tests for tokenisation (including Unicode, empty input, whitespace-only)
11. Write tests for types (including edge cases)
12. Initial commit to Gitea
**Files:**
- `pyproject.toml`
@@ -410,13 +470,18 @@ uv run pytest tests/test_core/ -v
**Goal:** Implement BLEU and lexical similarity metrics.
**Tasks:**
1. Implement `metrics/base.py` (Metric protocol)
2. Implement `metrics/results.py` (BleuResult, LexicalResult)
1. Implement `metrics/base.py` (Metric protocol, `BatchResult`, `AggregateStats`)
2. Implement `metrics/results.py` (`BleuResult`, `LexicalResult`)
3. Implement `metrics/bleu.py` (BLEU-1 through BLEU-4)
4. Implement `metrics/lexical.py` (Jaccard, token overlap)
5. Add batch processing with statistics
6. Write comprehensive tests with reference values
7. Update changelog
5. Add batch processing with aggregate statistics (mean, std, percentiles)
6. Write comprehensive tests:
- Single-pair scoring with reference values from NLTK
- Batch scoring with statistical aggregation
- Edge cases: empty text, single-word inputs, identical texts
- Multiple references support
7. Define `__all__` exports in each module's `__init__.py`
8. Update changelog
**Key Design:**
```python
@@ -448,10 +513,15 @@ uv run pytest tests/test_metrics/ -v --cov=src/veritext/metrics
**Goal:** Implement ROUGE and readability metrics.
**Tasks:**
1. Implement `metrics/rouge.py` (ROUGE-1, ROUGE-2, ROUGE-L)
2. Implement `metrics/readability.py` (Flesch-Kincaid)
3. Add RougeResult, ReadabilityResult to results.py
4. Write comprehensive tests
1. Implement `metrics/rouge.py` (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1)
2. Implement `metrics/readability.py` (Flesch-Kincaid grade level)
- Set `requires_reference = False` for standalone operation
3. Add `RougeResult`, `RougeScore`, `ReadabilityResult` to results.py
4. Write comprehensive tests:
- Single-pair scoring with reference values from `rouge-score` library
- Batch scoring with statistical aggregation
- Edge cases: empty text, very short text, identical texts
- Readability on various grade levels (children's text → academic)
5. Update changelog
**Files:**
@@ -473,13 +543,18 @@ uv run pytest tests/test_metrics/ -v
**Goal:** Build composable validation system.
**Tasks:**
1. Implement `validators/base.py` (Check protocol, ValidationResult)
2. Implement `validators/metric.py` (BleuValidator, RougeValidator)
3. Implement `validators/constraint.py` (LengthValidator, ContainsValidator, etc.)
4. Implement `validators/composite.py` (AllOf, AnyOf, Weighted)
1. Implement `validators/base.py` (`Check` protocol, `ValidationResult`)
2. Implement `validators/metric.py` (`BleuValidator`, `RougeValidator`)
- Raise `ValidationError` if `context.reference` is `None`
3. Implement `validators/constraint.py` (`LengthValidator`, `ContainsValidator`, etc.)
4. Implement `validators/composite.py` (`AllOf`, `AnyOf`, `Weighted`)
5. Create validator factory functions (`v.bleu()`, `v.length()`, etc.)
6. Write comprehensive tests
7. Update changelog
6. Define `__all__` exports in `validators/__init__.py`
7. Write comprehensive tests:
- Individual validators with passing/failing cases
- Composition (all_of, any_of, weighted)
- Edge cases: missing reference, empty text, boundary thresholds
8. Update changelog
**Key Design:**
```python
@@ -523,11 +598,13 @@ uv run pytest tests/test_validators/ -v --cov=src/veritext/validators
**Tasks:**
1. Implement `semantic/similarity.py` with lazy import
2. Add embedding caching
3. Add DependencyError for missing sentence-transformers
4. Implement SemanticValidator
5. Write tests (skipped if dependency missing)
6. Update changelog
2. Add embedding caching for repeated texts
3. Add `DependencyError` for missing sentence-transformers
4. Add `SemanticResult` to `metrics/results.py`
5. Add `SemanticValidator` to `validators/metric.py` (extends existing file)
6. Add `v.semantic()` factory function to `validators/__init__.py`
7. Write tests (skipped if dependency missing via `pytest.importorskip`)
8. Update changelog
**Key Design:**
```python
@@ -552,15 +629,18 @@ class SemanticSimilarity:
**Files:**
- `src/veritext/semantic/__init__.py`
- `src/veritext/semantic/similarity.py`
- `src/veritext/metrics/results.py` (add `SemanticResult`)
- `src/veritext/validators/metric.py` (add `SemanticValidator`)
- `src/veritext/validators/__init__.py` (add `semantic()` factory)
- `tests/test_semantic/test_similarity.py`
**Verification:**
```bash
# Without semantic dependency
uv run pytest tests/ -v --ignore=tests/test_semantic/
# Without semantic dependency — tests should skip gracefully
uv run pytest tests/ -v
# With semantic dependency
uv pip install sentence-transformers
uv sync --extra semantic
uv run pytest tests/test_semantic/ -v
```
@@ -596,7 +676,7 @@ def validate_text(
min_rouge: float | None = None,
min_semantic: float | None = None,
max_length: int | None = None,
max_reading_grade: int | None = None,
max_reading_grade: float | None = None,
contains: list[str] | None = None,
excludes: list[str] | None = None,
) -> None:
@@ -605,9 +685,12 @@ def validate_text(
Raises:
AssertionError: With detailed failure information if validation fails.
ValueError: If comparison metrics requested but reference not provided.
"""
```
**Error handling:** If `min_bleu`, `min_rouge`, or `min_semantic` is specified without a `reference`, raise `ValueError` immediately with a clear message rather than failing inside the metric.
**Files:**
- `src/veritext/pytest_plugin/__init__.py`
- `src/veritext/pytest_plugin/fixtures.py`
@@ -629,12 +712,17 @@ uv run pytest tests/test_pytest_plugin/ -v
**Goal:** Track quality over time, detect regressions.
**Tasks:**
1. Implement `benchmark/models.py` (BenchmarkRun, RegressionReport)
1. Implement `benchmark/models.py` (`BenchmarkRun`, `RegressionReport`)
2. Implement `benchmark/storage.py` (SQLite backend)
3. Implement `benchmark/runner.py` (Benchmark class)
4. Implement `benchmark/regression.py` (statistical detection)
5. Add `assert_no_regression()` for CI
6. Write tests
- Handle concurrent writes gracefully (SQLite WAL mode)
- Raise `StorageError` on corruption with recovery guidance
3. Implement `benchmark/runner.py` (`Benchmark` class)
4. Implement `benchmark/regression.py` (statistical detection using rolling window)
5. Add `assert_no_regression()` for CI integration
6. Write comprehensive tests:
- Storage CRUD operations
- Regression detection with known degradation
- Edge cases: first run (no baseline), empty metrics
7. Update changelog
**Key Interface:**
@@ -646,7 +734,7 @@ class Benchmark:
self,
candidates: list[str],
references: list[str],
metrics: list[str] = ("rouge_l", "bleu4"),
metrics: list[str] | None = None, # Default: ["rouge_l", "bleu4"]
) -> BenchmarkRun:
"""Evaluate candidates, store results, return the run record."""
...

View File

@@ -74,7 +74,7 @@ result = rouge.score(candidate, reference)
| Lexical overlap | Jaccard similarity of tokens | Simple similarity |
| Reading level | Flesch-Kincaid grade | Accessibility |
**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required.
**Note:** Reading level is a standalone metric (`requires_reference = False`) that analyses only the candidate text. Comparison metrics (BLEU, ROUGE, semantic) require a reference and raise `ValueError` if none provided.
### Validators (Decision Logic)
@@ -180,6 +180,10 @@ benchmark.assert_no_regression(tolerance=0.05)
5. **Explicit context**`ValidationContext` dataclass instead of `**kwargs`.
Type-safe, discoverable API.
6. **Graceful edge case handling** — Empty text returns zero scores (not errors).
Missing reference raises clear `ValueError` for comparison metrics. Unicode
normalised to NFC by default.
---
## Project Components
@@ -215,12 +219,17 @@ class Metric(Protocol[T]):
@property
def name(self) -> str: ...
def score(self, candidate: str, reference: str | list[str]) -> T: ...
@property
def requires_reference(self) -> bool: ...
def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
"""Raises ValueError if reference required but not provided."""
...
def batch_score(
self,
candidates: list[str],
references: list[str] | list[list[str]]
references: list[str] | list[list[str]] | None = None,
) -> BatchResult[T]: ...
```
@@ -266,6 +275,8 @@ v.weighted([
], min_score=0.75)
```
**Reference requirements:** Validators wrapping comparison metrics (`bleu`, `rouge`, `semantic`) require `context.reference` to be set. If `None`, they raise `ValidationError` with a clear message. Constraint validators (`length`, `readability`, `contains`) do not require a reference.
---
### Component 4: Pytest Plugin
@@ -435,9 +446,10 @@ benchmark.assert_no_regression(tolerance=0.03)
- [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
- [ ] Semantic similarity correlates with human judgement on test pairs
- [ ] Pytest plugin installs cleanly via `pip install veritext`
- [ ] Pytest plugin installs cleanly via `uv pip install veritext`
- [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
- [ ] Benchmark regression detection has <5% false positive rate
- [ ] Edge cases handled gracefully (empty text, None reference, Unicode)
- [ ] Documentation includes working examples for each use case
- [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
- [ ] Can explain design decisions and metric theory in interview