docs(plans): improve consistency and add edge case handling

- Add requires_reference property to Metric protocol for standalone metrics
- Make reference parameter optional in score/batch_score methods
- Add comprehensive Edge Case Handling section (empty text, Unicode, etc.)
- Expand phase tasks with explicit test coverage requirements
- Fix path reference to use relative workspace path
- Add missing test_runner.py to directory structure
- Clarify SemanticValidator integration in Phase 5
- Fix tuple/list type annotation in Benchmark.evaluate()
This commit is contained in:
2026-02-03 16:04:02 +00:00
parent 49f1e27cb1
commit 818e241ab2
2 changed files with 143 additions and 43 deletions

View File

@@ -74,7 +74,7 @@ result = rouge.score(candidate, reference)
| Lexical overlap | Jaccard similarity of tokens | Simple similarity |
| Reading level | Flesch-Kincaid grade | Accessibility |
**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required.
**Note:** Reading level is a standalone metric (`requires_reference = False`) that analyses only the candidate text. Comparison metrics (BLEU, ROUGE, semantic) require a reference and raise `ValueError` if none provided.
### Validators (Decision Logic)
@@ -180,6 +180,10 @@ benchmark.assert_no_regression(tolerance=0.05)
5. **Explicit context**`ValidationContext` dataclass instead of `**kwargs`.
Type-safe, discoverable API.
6. **Graceful edge case handling** — Empty text returns zero scores (not errors).
Missing reference raises clear `ValueError` for comparison metrics. Unicode
normalised to NFC by default.
---
## Project Components
@@ -215,12 +219,17 @@ class Metric(Protocol[T]):
@property
def name(self) -> str: ...
def score(self, candidate: str, reference: str | list[str]) -> T: ...
@property
def requires_reference(self) -> bool: ...
def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
"""Raises ValueError if reference required but not provided."""
...
def batch_score(
self,
candidates: list[str],
references: list[str] | list[list[str]]
references: list[str] | list[list[str]] | None = None,
) -> BatchResult[T]: ...
```
@@ -266,6 +275,8 @@ v.weighted([
], min_score=0.75)
```
**Reference requirements:** Validators wrapping comparison metrics (`bleu`, `rouge`, `semantic`) require `context.reference` to be set. If `None`, they raise `ValidationError` with a clear message. Constraint validators (`length`, `readability`, `contains`) do not require a reference.
---
### Component 4: Pytest Plugin
@@ -435,9 +446,10 @@ benchmark.assert_no_regression(tolerance=0.03)
- [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
- [ ] Semantic similarity correlates with human judgement on test pairs
- [ ] Pytest plugin installs cleanly via `pip install veritext`
- [ ] Pytest plugin installs cleanly via `uv pip install veritext`
- [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
- [ ] Benchmark regression detection has <5% false positive rate
- [ ] Edge cases handled gracefully (empty text, None reference, Unicode)
- [ ] Documentation includes working examples for each use case
- [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
- [ ] Can explain design decisions and metric theory in interview