docs(plans): improve consistency and add edge case handling

- Add requires_reference property to Metric protocol for standalone metrics - Make reference parameter optional in score/batch_score methods - Add comprehensive Edge Case Handling section (empty text, Unicode, etc.) - Expand phase tasks with explicit test coverage requirements - Fix path reference to use relative workspace path - Add missing test_runner.py to directory structure - Clarify SemanticValidator integration in Phase 5 - Fix tuple/list type annotation in Benchmark.evaluate()
2026-02-03 16:04:02 +00:00
parent 49f1e27cb1
commit 818e241ab2
2 changed files with 143 additions and 43 deletions
@@ -4,7 +4,7 @@ Semantic text validation framework for Python — validates text outputs against
 ## Project Overview
-**Location:** `/home/kai/work/dev/portfolio/veritext/`
+**Location:** `portfolio/veritext/` (relative to workspace root)
 **Remote:** `https://gitea.kschappell.com/kschappell/veritext.git`
 **Purpose:** A Python library for validating text outputs against semantic criteria. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching.
@@ -165,6 +165,7 @@ veritext/
 │   │   └── test_composite.py
 │   ├── test_benchmark/
 │   │   ├── test_storage.py
 │   │   ├── test_runner.py
 │   │   └── test_regression.py
 │   ├── test_pytest_plugin/
 │   │   └── test_integration.py
@@ -239,12 +240,30 @@ class Metric(Protocol[T]):
    @property
    def name(self) -> str: ...
-    def score(self, candidate: str, reference: str | list[str]) -> T: ...
+    @property
    def requires_reference(self) -> bool:
        """Whether this metric requires a reference text."""
        ...
    def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
        """
        Compute metric score.
        Args:
            candidate: The text to evaluate.
            reference: Reference text(s) for comparison. Required for comparison
                       metrics (BLEU, ROUGE, semantic). Ignored for standalone
                       metrics (readability).
        Raises:
            ValueError: If reference is required but not provided.
        """
        ...
    def batch_score(
        self,
        candidates: list[str],
-        references: list[str] | list[list[str]]
+        references: list[str] | list[list[str]] | None = None,
    ) -> BatchResult[T]: ...
@dataclass
@@ -262,7 +281,7 @@ class BatchResult(Generic[T]):
    stats: dict[str, AggregateStats]     # Aggregated stats for numeric fields
 ```
-**Note:** Readability metrics (Flesch-Kincaid) accept but ignore the `reference` parameter since they only analyse the candidate text.
+**Note:** Standalone metrics like readability return `False` for `requires_reference` and ignore the `reference` parameter. Comparison metrics (BLEU, ROUGE, semantic) return `True` and raise `ValueError` if `reference` is `None`.
 ### Validator Protocol
@@ -322,6 +341,46 @@ class RegressionReport:
 ---
 ## Edge Case Handling
 All components must handle edge cases consistently:
 ### Empty Text
 | Input | Behaviour |
 |-------|-----------|
 | Empty candidate (`""`) | Metrics return zero scores; validators fail unless explicitly configured |
 | Empty reference (`""`) | Comparison metrics raise `ValueError` |
 | Whitespace-only text | Treated as empty after tokenisation |
 ### None Reference
 | Component | Behaviour |
 |-----------|-----------|
 | Comparison metrics (BLEU, ROUGE, semantic) | Raise `ValueError("Reference required for {metric_name}")` |
 | Standalone metrics (readability) | Ignore, compute normally |
 | Validators wrapping comparison metrics | Raise `ValidationError` if `context.reference` is `None` |
 ### Unicode & Encoding
 - All text assumed to be valid UTF-8 strings
 - Normalisation: NFC by default (configurable in `Tokeniser`)
 - Emoji and non-Latin scripts: Supported, tokenised as words where applicable
 ### Very Long Text
 - No hard limits enforced by default
 - `Tokeniser` can accept `max_tokens: int | None` for truncation
 - Semantic similarity: Truncates to model's max sequence length (typically 512 tokens) with warning logged
 ### Multiple References
 BLEU and ROUGE support multiple references (`list[str]`):
 - BLEU: Computes against each reference, uses maximum n-gram matches
 - ROUGE: Computes against each reference, returns best score
 ---
 ## Validator Naming Convention
 Consistent short names:
@@ -369,13 +428,14 @@ v.weighted(                              # Weighted score threshold
 2. Write `pyproject.toml` with optional dependencies
 3. Create `CLAUDE.md` with project guidelines
 4. Implement `core/exceptions.py` (full hierarchy)
-5. Implement `core/types.py` (ValidationContext, CheckResult, BatchResult)
+5. Implement `core/types.py` (`ValidationContext`, `CheckResult`, `ValidationResult`)
-6. Implement `core/tokenisation.py` (WordTokeniser)
+6. Implement `core/tokenisation.py` (`WordTokeniser` with NFC normalisation)
 7. Implement `core/config.py` (pydantic-settings)
 8. Implement `core/logging.py` (structlog configuration)
-9. Create `__init__.py` with version
+9. Create `__init__.py` with `__version__` and `__all__` exports
-10. Write tests for tokenisation
+10. Write tests for tokenisation (including Unicode, empty input, whitespace-only)
-11. Initial commit to Gitea
+11. Write tests for types (including edge cases)
 12. Initial commit to Gitea
 **Files:**
 - `pyproject.toml`
@@ -410,13 +470,18 @@ uv run pytest tests/test_core/ -v
 **Goal:** Implement BLEU and lexical similarity metrics.
 **Tasks:**
-1. Implement `metrics/base.py` (Metric protocol)
+1. Implement `metrics/base.py` (Metric protocol, `BatchResult`, `AggregateStats`)
-2. Implement `metrics/results.py` (BleuResult, LexicalResult)
+2. Implement `metrics/results.py` (`BleuResult`, `LexicalResult`)
 3. Implement `metrics/bleu.py` (BLEU-1 through BLEU-4)
 4. Implement `metrics/lexical.py` (Jaccard, token overlap)
-5. Add batch processing with statistics
+5. Add batch processing with aggregate statistics (mean, std, percentiles)
-6. Write comprehensive tests with reference values
+6. Write comprehensive tests:
-7. Update changelog
+   - Single-pair scoring with reference values from NLTK
   - Batch scoring with statistical aggregation
   - Edge cases: empty text, single-word inputs, identical texts
   - Multiple references support
 7. Define `__all__` exports in each module's `__init__.py`
 8. Update changelog
 **Key Design:**
 ```python
@@ -448,10 +513,15 @@ uv run pytest tests/test_metrics/ -v --cov=src/veritext/metrics
 **Goal:** Implement ROUGE and readability metrics.
 **Tasks:**
-1. Implement `metrics/rouge.py` (ROUGE-1, ROUGE-2, ROUGE-L)
+1. Implement `metrics/rouge.py` (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1)
-2. Implement `metrics/readability.py` (Flesch-Kincaid)
+2. Implement `metrics/readability.py` (Flesch-Kincaid grade level)
-3. Add RougeResult, ReadabilityResult to results.py
+   - Set `requires_reference = False` for standalone operation
-4. Write comprehensive tests
+3. Add `RougeResult`, `RougeScore`, `ReadabilityResult` to results.py
 4. Write comprehensive tests:
   - Single-pair scoring with reference values from `rouge-score` library
   - Batch scoring with statistical aggregation
   - Edge cases: empty text, very short text, identical texts
   - Readability on various grade levels (children's text → academic)
 5. Update changelog
 **Files:**
@@ -473,13 +543,18 @@ uv run pytest tests/test_metrics/ -v
 **Goal:** Build composable validation system.
 **Tasks:**
-1. Implement `validators/base.py` (Check protocol, ValidationResult)
+1. Implement `validators/base.py` (`Check` protocol, `ValidationResult`)
-2. Implement `validators/metric.py` (BleuValidator, RougeValidator)
+2. Implement `validators/metric.py` (`BleuValidator`, `RougeValidator`)
-3. Implement `validators/constraint.py` (LengthValidator, ContainsValidator, etc.)
+   - Raise `ValidationError` if `context.reference` is `None`
-4. Implement `validators/composite.py` (AllOf, AnyOf, Weighted)
+3. Implement `validators/constraint.py` (`LengthValidator`, `ContainsValidator`, etc.)
 4. Implement `validators/composite.py` (`AllOf`, `AnyOf`, `Weighted`)
 5. Create validator factory functions (`v.bleu()`, `v.length()`, etc.)
-6. Write comprehensive tests
+6. Define `__all__` exports in `validators/__init__.py`
-7. Update changelog
+7. Write comprehensive tests:
   - Individual validators with passing/failing cases
   - Composition (all_of, any_of, weighted)
   - Edge cases: missing reference, empty text, boundary thresholds
 8. Update changelog
 **Key Design:**
 ```python
@@ -523,11 +598,13 @@ uv run pytest tests/test_validators/ -v --cov=src/veritext/validators
 **Tasks:**
 1. Implement `semantic/similarity.py` with lazy import
-2. Add embedding caching
+2. Add embedding caching for repeated texts
-3. Add DependencyError for missing sentence-transformers
+3. Add `DependencyError` for missing sentence-transformers
-4. Implement SemanticValidator
+4. Add `SemanticResult` to `metrics/results.py`
-5. Write tests (skipped if dependency missing)
+5. Add `SemanticValidator` to `validators/metric.py` (extends existing file)
-6. Update changelog
+6. Add `v.semantic()` factory function to `validators/__init__.py`
 7. Write tests (skipped if dependency missing via `pytest.importorskip`)
 8. Update changelog
 **Key Design:**
 ```python
@@ -552,15 +629,18 @@ class SemanticSimilarity:
 **Files:**
 - `src/veritext/semantic/__init__.py`
 - `src/veritext/semantic/similarity.py`
 - `src/veritext/metrics/results.py` (add `SemanticResult`)
 - `src/veritext/validators/metric.py` (add `SemanticValidator`)
 - `src/veritext/validators/__init__.py` (add `semantic()` factory)
 - `tests/test_semantic/test_similarity.py`
 **Verification:**
 ```bash
-# Without semantic dependency
+# Without semantic dependency — tests should skip gracefully
-uv run pytest tests/ -v --ignore=tests/test_semantic/
+uv run pytest tests/ -v
 # With semantic dependency
-uv pip install sentence-transformers
+uv sync --extra semantic
 uv run pytest tests/test_semantic/ -v
 ```
@@ -596,7 +676,7 @@ def validate_text(
    min_rouge: float | None = None,
    min_semantic: float | None = None,
    max_length: int | None = None,
-    max_reading_grade: int | None = None,
+    max_reading_grade: float | None = None,
    contains: list[str] | None = None,
    excludes: list[str] | None = None,
 ) -> None:
@@ -605,9 +685,12 @@ def validate_text(
    Raises:
        AssertionError: With detailed failure information if validation fails.
        ValueError: If comparison metrics requested but reference not provided.
    """
 ```
 **Error handling:** If `min_bleu`, `min_rouge`, or `min_semantic` is specified without a `reference`, raise `ValueError` immediately with a clear message rather than failing inside the metric.
 **Files:**
 - `src/veritext/pytest_plugin/__init__.py`
 - `src/veritext/pytest_plugin/fixtures.py`
@@ -629,12 +712,17 @@ uv run pytest tests/test_pytest_plugin/ -v
 **Goal:** Track quality over time, detect regressions.
 **Tasks:**
-1. Implement `benchmark/models.py` (BenchmarkRun, RegressionReport)
+1. Implement `benchmark/models.py` (`BenchmarkRun`, `RegressionReport`)
 2. Implement `benchmark/storage.py` (SQLite backend)
-3. Implement `benchmark/runner.py` (Benchmark class)
+   - Handle concurrent writes gracefully (SQLite WAL mode)
-4. Implement `benchmark/regression.py` (statistical detection)
+   - Raise `StorageError` on corruption with recovery guidance
-5. Add `assert_no_regression()` for CI
+3. Implement `benchmark/runner.py` (`Benchmark` class)
-6. Write tests
+4. Implement `benchmark/regression.py` (statistical detection using rolling window)
 5. Add `assert_no_regression()` for CI integration
 6. Write comprehensive tests:
   - Storage CRUD operations
   - Regression detection with known degradation
   - Edge cases: first run (no baseline), empty metrics
 7. Update changelog
 **Key Interface:**
@@ -646,7 +734,7 @@ class Benchmark:
        self,
        candidates: list[str],
        references: list[str],
-        metrics: list[str] = ("rouge_l", "bleu4"),
+        metrics: list[str] | None = None,  # Default: ["rouge_l", "bleu4"]
    ) -> BenchmarkRun:
        """Evaluate candidates, store results, return the run record."""
        ...
@@ -74,7 +74,7 @@ result = rouge.score(candidate, reference)
 | Lexical overlap | Jaccard similarity of tokens | Simple similarity |
 | Reading level | Flesch-Kincaid grade | Accessibility |
-**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required.
+**Note:** Reading level is a standalone metric (`requires_reference = False`) that analyses only the candidate text. Comparison metrics (BLEU, ROUGE, semantic) require a reference and raise `ValueError` if none provided.
 ### Validators (Decision Logic)
@@ -180,6 +180,10 @@ benchmark.assert_no_regression(tolerance=0.05)
 5. **Explicit context** — `ValidationContext` dataclass instead of `**kwargs`.
   Type-safe, discoverable API.
 6. **Graceful edge case handling** — Empty text returns zero scores (not errors).
   Missing reference raises clear `ValueError` for comparison metrics. Unicode
   normalised to NFC by default.
 ---
 ## Project Components
@@ -215,12 +219,17 @@ class Metric(Protocol[T]):
    @property
    def name(self) -> str: ...
-    def score(self, candidate: str, reference: str | list[str]) -> T: ...
+    @property
    def requires_reference(self) -> bool: ...
    def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
        """Raises ValueError if reference required but not provided."""
        ...
    def batch_score(
        self,
        candidates: list[str],
-        references: list[str] | list[list[str]]
+        references: list[str] | list[list[str]] | None = None,
    ) -> BatchResult[T]: ...
 ```
@@ -266,6 +275,8 @@ v.weighted([
 ], min_score=0.75)
 ```
 **Reference requirements:** Validators wrapping comparison metrics (`bleu`, `rouge`, `semantic`) require `context.reference` to be set. If `None`, they raise `ValidationError` with a clear message. Constraint validators (`length`, `readability`, `contains`) do not require a reference.
 ---
 ### Component 4: Pytest Plugin
@@ -435,9 +446,10 @@ benchmark.assert_no_regression(tolerance=0.03)
 - [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
 - [ ] Semantic similarity correlates with human judgement on test pairs
- [ ] Pytest plugin installs cleanly via `pip install veritext`
+- [ ] Pytest plugin installs cleanly via `uv pip install veritext`
 - [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
 - [ ] Benchmark regression detection has <5% false positive rate
 - [ ] Edge cases handled gracefully (empty text, None reference, Unicode)
 - [ ] Documentation includes working examples for each use case
 - [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
 - [ ] Can explain design decisions and metric theory in interview