docs(plans): improve consistency and add edge case handling

- Add requires_reference property to Metric protocol for standalone metrics - Make reference parameter optional in score/batch_score methods - Add comprehensive Edge Case Handling section (empty text, Unicode, etc.) - Expand phase tasks with explicit test coverage requirements - Fix path reference to use relative workspace path - Add missing test_runner.py to directory structure - Clarify SemanticValidator integration in Phase 5 - Fix tuple/list type annotation in Benchmark.evaluate()
2026-02-03 16:04:02 +00:00
parent 49f1e27cb1
commit 818e241ab2
2 changed files with 143 additions and 43 deletions
--- a/docs/project-plan.md
+++ b/docs/project-plan.md
@@ -74,7 +74,7 @@ result = rouge.score(candidate, reference)
 | Lexical overlap | Jaccard similarity of tokens | Simple similarity |
 | Reading level | Flesch-Kincaid grade | Accessibility |

-**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required.
+**Note:** Reading level is a standalone metric (`requires_reference = False`) that analyses only the candidate text. Comparison metrics (BLEU, ROUGE, semantic) require a reference and raise `ValueError` if none provided.

 ### Validators (Decision Logic)

@@ -180,6 +180,10 @@ benchmark.assert_no_regression(tolerance=0.05)
 5. **Explicit context** — `ValidationContext` dataclass instead of `**kwargs`.
   Type-safe, discoverable API.

+6. **Graceful edge case handling** — Empty text returns zero scores (not errors).
+   Missing reference raises clear `ValueError` for comparison metrics. Unicode
+   normalised to NFC by default.
+
 ---

 ## Project Components
@@ -215,12 +219,17 @@ class Metric(Protocol[T]):
    @property
    def name(self) -> str: ...

-    def score(self, candidate: str, reference: str | list[str]) -> T: ...
+    @property
+    def requires_reference(self) -> bool: ...
+
+    def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
+        """Raises ValueError if reference required but not provided."""
+        ...

    def batch_score(
        self,
        candidates: list[str],
-        references: list[str] | list[list[str]]
+        references: list[str] | list[list[str]] | None = None,
    ) -> BatchResult[T]: ...
 ```

@@ -266,6 +275,8 @@ v.weighted([
 ], min_score=0.75)
 ```

+**Reference requirements:** Validators wrapping comparison metrics (`bleu`, `rouge`, `semantic`) require `context.reference` to be set. If `None`, they raise `ValidationError` with a clear message. Constraint validators (`length`, `readability`, `contains`) do not require a reference.
+
 ---

 ### Component 4: Pytest Plugin
@@ -435,9 +446,10 @@ benchmark.assert_no_regression(tolerance=0.03)

 - [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
 - [ ] Semantic similarity correlates with human judgement on test pairs
- [ ] Pytest plugin installs cleanly via `pip install veritext`
+- [ ] Pytest plugin installs cleanly via `uv pip install veritext`
 - [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
 - [ ] Benchmark regression detection has <5% false positive rate
+- [ ] Edge cases handled gracefully (empty text, None reference, Unicode)
 - [ ] Documentation includes working examples for each use case
 - [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
 - [ ] Can explain design decisions and metric theory in interview