docs(plans): improve consistency and add edge case handling
- Add requires_reference property to Metric protocol for standalone metrics - Make reference parameter optional in score/batch_score methods - Add comprehensive Edge Case Handling section (empty text, Unicode, etc.) - Expand phase tasks with explicit test coverage requirements - Fix path reference to use relative workspace path - Add missing test_runner.py to directory structure - Clarify SemanticValidator integration in Phase 5 - Fix tuple/list type annotation in Benchmark.evaluate()
This commit is contained in:
@@ -4,7 +4,7 @@ Semantic text validation framework for Python — validates text outputs against
|
|||||||
|
|
||||||
## Project Overview
|
## Project Overview
|
||||||
|
|
||||||
**Location:** `/home/kai/work/dev/portfolio/veritext/`
|
**Location:** `portfolio/veritext/` (relative to workspace root)
|
||||||
**Remote:** `https://gitea.kschappell.com/kschappell/veritext.git`
|
**Remote:** `https://gitea.kschappell.com/kschappell/veritext.git`
|
||||||
|
|
||||||
**Purpose:** A Python library for validating text outputs against semantic criteria. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching.
|
**Purpose:** A Python library for validating text outputs against semantic criteria. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching.
|
||||||
@@ -165,6 +165,7 @@ veritext/
|
|||||||
│ │ └── test_composite.py
|
│ │ └── test_composite.py
|
||||||
│ ├── test_benchmark/
|
│ ├── test_benchmark/
|
||||||
│ │ ├── test_storage.py
|
│ │ ├── test_storage.py
|
||||||
|
│ │ ├── test_runner.py
|
||||||
│ │ └── test_regression.py
|
│ │ └── test_regression.py
|
||||||
│ ├── test_pytest_plugin/
|
│ ├── test_pytest_plugin/
|
||||||
│ │ └── test_integration.py
|
│ │ └── test_integration.py
|
||||||
@@ -239,12 +240,30 @@ class Metric(Protocol[T]):
|
|||||||
@property
|
@property
|
||||||
def name(self) -> str: ...
|
def name(self) -> str: ...
|
||||||
|
|
||||||
def score(self, candidate: str, reference: str | list[str]) -> T: ...
|
@property
|
||||||
|
def requires_reference(self) -> bool:
|
||||||
|
"""Whether this metric requires a reference text."""
|
||||||
|
...
|
||||||
|
|
||||||
|
def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
|
||||||
|
"""
|
||||||
|
Compute metric score.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
candidate: The text to evaluate.
|
||||||
|
reference: Reference text(s) for comparison. Required for comparison
|
||||||
|
metrics (BLEU, ROUGE, semantic). Ignored for standalone
|
||||||
|
metrics (readability).
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
ValueError: If reference is required but not provided.
|
||||||
|
"""
|
||||||
|
...
|
||||||
|
|
||||||
def batch_score(
|
def batch_score(
|
||||||
self,
|
self,
|
||||||
candidates: list[str],
|
candidates: list[str],
|
||||||
references: list[str] | list[list[str]]
|
references: list[str] | list[list[str]] | None = None,
|
||||||
) -> BatchResult[T]: ...
|
) -> BatchResult[T]: ...
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -262,7 +281,7 @@ class BatchResult(Generic[T]):
|
|||||||
stats: dict[str, AggregateStats] # Aggregated stats for numeric fields
|
stats: dict[str, AggregateStats] # Aggregated stats for numeric fields
|
||||||
```
|
```
|
||||||
|
|
||||||
**Note:** Readability metrics (Flesch-Kincaid) accept but ignore the `reference` parameter since they only analyse the candidate text.
|
**Note:** Standalone metrics like readability return `False` for `requires_reference` and ignore the `reference` parameter. Comparison metrics (BLEU, ROUGE, semantic) return `True` and raise `ValueError` if `reference` is `None`.
|
||||||
|
|
||||||
### Validator Protocol
|
### Validator Protocol
|
||||||
|
|
||||||
@@ -322,6 +341,46 @@ class RegressionReport:
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Edge Case Handling
|
||||||
|
|
||||||
|
All components must handle edge cases consistently:
|
||||||
|
|
||||||
|
### Empty Text
|
||||||
|
|
||||||
|
| Input | Behaviour |
|
||||||
|
|-------|-----------|
|
||||||
|
| Empty candidate (`""`) | Metrics return zero scores; validators fail unless explicitly configured |
|
||||||
|
| Empty reference (`""`) | Comparison metrics raise `ValueError` |
|
||||||
|
| Whitespace-only text | Treated as empty after tokenisation |
|
||||||
|
|
||||||
|
### None Reference
|
||||||
|
|
||||||
|
| Component | Behaviour |
|
||||||
|
|-----------|-----------|
|
||||||
|
| Comparison metrics (BLEU, ROUGE, semantic) | Raise `ValueError("Reference required for {metric_name}")` |
|
||||||
|
| Standalone metrics (readability) | Ignore, compute normally |
|
||||||
|
| Validators wrapping comparison metrics | Raise `ValidationError` if `context.reference` is `None` |
|
||||||
|
|
||||||
|
### Unicode & Encoding
|
||||||
|
|
||||||
|
- All text assumed to be valid UTF-8 strings
|
||||||
|
- Normalisation: NFC by default (configurable in `Tokeniser`)
|
||||||
|
- Emoji and non-Latin scripts: Supported, tokenised as words where applicable
|
||||||
|
|
||||||
|
### Very Long Text
|
||||||
|
|
||||||
|
- No hard limits enforced by default
|
||||||
|
- `Tokeniser` can accept `max_tokens: int | None` for truncation
|
||||||
|
- Semantic similarity: Truncates to model's max sequence length (typically 512 tokens) with warning logged
|
||||||
|
|
||||||
|
### Multiple References
|
||||||
|
|
||||||
|
BLEU and ROUGE support multiple references (`list[str]`):
|
||||||
|
- BLEU: Computes against each reference, uses maximum n-gram matches
|
||||||
|
- ROUGE: Computes against each reference, returns best score
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Validator Naming Convention
|
## Validator Naming Convention
|
||||||
|
|
||||||
Consistent short names:
|
Consistent short names:
|
||||||
@@ -369,13 +428,14 @@ v.weighted( # Weighted score threshold
|
|||||||
2. Write `pyproject.toml` with optional dependencies
|
2. Write `pyproject.toml` with optional dependencies
|
||||||
3. Create `CLAUDE.md` with project guidelines
|
3. Create `CLAUDE.md` with project guidelines
|
||||||
4. Implement `core/exceptions.py` (full hierarchy)
|
4. Implement `core/exceptions.py` (full hierarchy)
|
||||||
5. Implement `core/types.py` (ValidationContext, CheckResult, BatchResult)
|
5. Implement `core/types.py` (`ValidationContext`, `CheckResult`, `ValidationResult`)
|
||||||
6. Implement `core/tokenisation.py` (WordTokeniser)
|
6. Implement `core/tokenisation.py` (`WordTokeniser` with NFC normalisation)
|
||||||
7. Implement `core/config.py` (pydantic-settings)
|
7. Implement `core/config.py` (pydantic-settings)
|
||||||
8. Implement `core/logging.py` (structlog configuration)
|
8. Implement `core/logging.py` (structlog configuration)
|
||||||
9. Create `__init__.py` with version
|
9. Create `__init__.py` with `__version__` and `__all__` exports
|
||||||
10. Write tests for tokenisation
|
10. Write tests for tokenisation (including Unicode, empty input, whitespace-only)
|
||||||
11. Initial commit to Gitea
|
11. Write tests for types (including edge cases)
|
||||||
|
12. Initial commit to Gitea
|
||||||
|
|
||||||
**Files:**
|
**Files:**
|
||||||
- `pyproject.toml`
|
- `pyproject.toml`
|
||||||
@@ -410,13 +470,18 @@ uv run pytest tests/test_core/ -v
|
|||||||
**Goal:** Implement BLEU and lexical similarity metrics.
|
**Goal:** Implement BLEU and lexical similarity metrics.
|
||||||
|
|
||||||
**Tasks:**
|
**Tasks:**
|
||||||
1. Implement `metrics/base.py` (Metric protocol)
|
1. Implement `metrics/base.py` (Metric protocol, `BatchResult`, `AggregateStats`)
|
||||||
2. Implement `metrics/results.py` (BleuResult, LexicalResult)
|
2. Implement `metrics/results.py` (`BleuResult`, `LexicalResult`)
|
||||||
3. Implement `metrics/bleu.py` (BLEU-1 through BLEU-4)
|
3. Implement `metrics/bleu.py` (BLEU-1 through BLEU-4)
|
||||||
4. Implement `metrics/lexical.py` (Jaccard, token overlap)
|
4. Implement `metrics/lexical.py` (Jaccard, token overlap)
|
||||||
5. Add batch processing with statistics
|
5. Add batch processing with aggregate statistics (mean, std, percentiles)
|
||||||
6. Write comprehensive tests with reference values
|
6. Write comprehensive tests:
|
||||||
7. Update changelog
|
- Single-pair scoring with reference values from NLTK
|
||||||
|
- Batch scoring with statistical aggregation
|
||||||
|
- Edge cases: empty text, single-word inputs, identical texts
|
||||||
|
- Multiple references support
|
||||||
|
7. Define `__all__` exports in each module's `__init__.py`
|
||||||
|
8. Update changelog
|
||||||
|
|
||||||
**Key Design:**
|
**Key Design:**
|
||||||
```python
|
```python
|
||||||
@@ -448,10 +513,15 @@ uv run pytest tests/test_metrics/ -v --cov=src/veritext/metrics
|
|||||||
**Goal:** Implement ROUGE and readability metrics.
|
**Goal:** Implement ROUGE and readability metrics.
|
||||||
|
|
||||||
**Tasks:**
|
**Tasks:**
|
||||||
1. Implement `metrics/rouge.py` (ROUGE-1, ROUGE-2, ROUGE-L)
|
1. Implement `metrics/rouge.py` (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F1)
|
||||||
2. Implement `metrics/readability.py` (Flesch-Kincaid)
|
2. Implement `metrics/readability.py` (Flesch-Kincaid grade level)
|
||||||
3. Add RougeResult, ReadabilityResult to results.py
|
- Set `requires_reference = False` for standalone operation
|
||||||
4. Write comprehensive tests
|
3. Add `RougeResult`, `RougeScore`, `ReadabilityResult` to results.py
|
||||||
|
4. Write comprehensive tests:
|
||||||
|
- Single-pair scoring with reference values from `rouge-score` library
|
||||||
|
- Batch scoring with statistical aggregation
|
||||||
|
- Edge cases: empty text, very short text, identical texts
|
||||||
|
- Readability on various grade levels (children's text → academic)
|
||||||
5. Update changelog
|
5. Update changelog
|
||||||
|
|
||||||
**Files:**
|
**Files:**
|
||||||
@@ -473,13 +543,18 @@ uv run pytest tests/test_metrics/ -v
|
|||||||
**Goal:** Build composable validation system.
|
**Goal:** Build composable validation system.
|
||||||
|
|
||||||
**Tasks:**
|
**Tasks:**
|
||||||
1. Implement `validators/base.py` (Check protocol, ValidationResult)
|
1. Implement `validators/base.py` (`Check` protocol, `ValidationResult`)
|
||||||
2. Implement `validators/metric.py` (BleuValidator, RougeValidator)
|
2. Implement `validators/metric.py` (`BleuValidator`, `RougeValidator`)
|
||||||
3. Implement `validators/constraint.py` (LengthValidator, ContainsValidator, etc.)
|
- Raise `ValidationError` if `context.reference` is `None`
|
||||||
4. Implement `validators/composite.py` (AllOf, AnyOf, Weighted)
|
3. Implement `validators/constraint.py` (`LengthValidator`, `ContainsValidator`, etc.)
|
||||||
|
4. Implement `validators/composite.py` (`AllOf`, `AnyOf`, `Weighted`)
|
||||||
5. Create validator factory functions (`v.bleu()`, `v.length()`, etc.)
|
5. Create validator factory functions (`v.bleu()`, `v.length()`, etc.)
|
||||||
6. Write comprehensive tests
|
6. Define `__all__` exports in `validators/__init__.py`
|
||||||
7. Update changelog
|
7. Write comprehensive tests:
|
||||||
|
- Individual validators with passing/failing cases
|
||||||
|
- Composition (all_of, any_of, weighted)
|
||||||
|
- Edge cases: missing reference, empty text, boundary thresholds
|
||||||
|
8. Update changelog
|
||||||
|
|
||||||
**Key Design:**
|
**Key Design:**
|
||||||
```python
|
```python
|
||||||
@@ -523,11 +598,13 @@ uv run pytest tests/test_validators/ -v --cov=src/veritext/validators
|
|||||||
|
|
||||||
**Tasks:**
|
**Tasks:**
|
||||||
1. Implement `semantic/similarity.py` with lazy import
|
1. Implement `semantic/similarity.py` with lazy import
|
||||||
2. Add embedding caching
|
2. Add embedding caching for repeated texts
|
||||||
3. Add DependencyError for missing sentence-transformers
|
3. Add `DependencyError` for missing sentence-transformers
|
||||||
4. Implement SemanticValidator
|
4. Add `SemanticResult` to `metrics/results.py`
|
||||||
5. Write tests (skipped if dependency missing)
|
5. Add `SemanticValidator` to `validators/metric.py` (extends existing file)
|
||||||
6. Update changelog
|
6. Add `v.semantic()` factory function to `validators/__init__.py`
|
||||||
|
7. Write tests (skipped if dependency missing via `pytest.importorskip`)
|
||||||
|
8. Update changelog
|
||||||
|
|
||||||
**Key Design:**
|
**Key Design:**
|
||||||
```python
|
```python
|
||||||
@@ -552,15 +629,18 @@ class SemanticSimilarity:
|
|||||||
**Files:**
|
**Files:**
|
||||||
- `src/veritext/semantic/__init__.py`
|
- `src/veritext/semantic/__init__.py`
|
||||||
- `src/veritext/semantic/similarity.py`
|
- `src/veritext/semantic/similarity.py`
|
||||||
|
- `src/veritext/metrics/results.py` (add `SemanticResult`)
|
||||||
|
- `src/veritext/validators/metric.py` (add `SemanticValidator`)
|
||||||
|
- `src/veritext/validators/__init__.py` (add `semantic()` factory)
|
||||||
- `tests/test_semantic/test_similarity.py`
|
- `tests/test_semantic/test_similarity.py`
|
||||||
|
|
||||||
**Verification:**
|
**Verification:**
|
||||||
```bash
|
```bash
|
||||||
# Without semantic dependency
|
# Without semantic dependency — tests should skip gracefully
|
||||||
uv run pytest tests/ -v --ignore=tests/test_semantic/
|
uv run pytest tests/ -v
|
||||||
|
|
||||||
# With semantic dependency
|
# With semantic dependency
|
||||||
uv pip install sentence-transformers
|
uv sync --extra semantic
|
||||||
uv run pytest tests/test_semantic/ -v
|
uv run pytest tests/test_semantic/ -v
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -596,7 +676,7 @@ def validate_text(
|
|||||||
min_rouge: float | None = None,
|
min_rouge: float | None = None,
|
||||||
min_semantic: float | None = None,
|
min_semantic: float | None = None,
|
||||||
max_length: int | None = None,
|
max_length: int | None = None,
|
||||||
max_reading_grade: int | None = None,
|
max_reading_grade: float | None = None,
|
||||||
contains: list[str] | None = None,
|
contains: list[str] | None = None,
|
||||||
excludes: list[str] | None = None,
|
excludes: list[str] | None = None,
|
||||||
) -> None:
|
) -> None:
|
||||||
@@ -605,9 +685,12 @@ def validate_text(
|
|||||||
|
|
||||||
Raises:
|
Raises:
|
||||||
AssertionError: With detailed failure information if validation fails.
|
AssertionError: With detailed failure information if validation fails.
|
||||||
|
ValueError: If comparison metrics requested but reference not provided.
|
||||||
"""
|
"""
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Error handling:** If `min_bleu`, `min_rouge`, or `min_semantic` is specified without a `reference`, raise `ValueError` immediately with a clear message rather than failing inside the metric.
|
||||||
|
|
||||||
**Files:**
|
**Files:**
|
||||||
- `src/veritext/pytest_plugin/__init__.py`
|
- `src/veritext/pytest_plugin/__init__.py`
|
||||||
- `src/veritext/pytest_plugin/fixtures.py`
|
- `src/veritext/pytest_plugin/fixtures.py`
|
||||||
@@ -629,12 +712,17 @@ uv run pytest tests/test_pytest_plugin/ -v
|
|||||||
**Goal:** Track quality over time, detect regressions.
|
**Goal:** Track quality over time, detect regressions.
|
||||||
|
|
||||||
**Tasks:**
|
**Tasks:**
|
||||||
1. Implement `benchmark/models.py` (BenchmarkRun, RegressionReport)
|
1. Implement `benchmark/models.py` (`BenchmarkRun`, `RegressionReport`)
|
||||||
2. Implement `benchmark/storage.py` (SQLite backend)
|
2. Implement `benchmark/storage.py` (SQLite backend)
|
||||||
3. Implement `benchmark/runner.py` (Benchmark class)
|
- Handle concurrent writes gracefully (SQLite WAL mode)
|
||||||
4. Implement `benchmark/regression.py` (statistical detection)
|
- Raise `StorageError` on corruption with recovery guidance
|
||||||
5. Add `assert_no_regression()` for CI
|
3. Implement `benchmark/runner.py` (`Benchmark` class)
|
||||||
6. Write tests
|
4. Implement `benchmark/regression.py` (statistical detection using rolling window)
|
||||||
|
5. Add `assert_no_regression()` for CI integration
|
||||||
|
6. Write comprehensive tests:
|
||||||
|
- Storage CRUD operations
|
||||||
|
- Regression detection with known degradation
|
||||||
|
- Edge cases: first run (no baseline), empty metrics
|
||||||
7. Update changelog
|
7. Update changelog
|
||||||
|
|
||||||
**Key Interface:**
|
**Key Interface:**
|
||||||
@@ -646,7 +734,7 @@ class Benchmark:
|
|||||||
self,
|
self,
|
||||||
candidates: list[str],
|
candidates: list[str],
|
||||||
references: list[str],
|
references: list[str],
|
||||||
metrics: list[str] = ("rouge_l", "bleu4"),
|
metrics: list[str] | None = None, # Default: ["rouge_l", "bleu4"]
|
||||||
) -> BenchmarkRun:
|
) -> BenchmarkRun:
|
||||||
"""Evaluate candidates, store results, return the run record."""
|
"""Evaluate candidates, store results, return the run record."""
|
||||||
...
|
...
|
||||||
|
|||||||
@@ -74,7 +74,7 @@ result = rouge.score(candidate, reference)
|
|||||||
| Lexical overlap | Jaccard similarity of tokens | Simple similarity |
|
| Lexical overlap | Jaccard similarity of tokens | Simple similarity |
|
||||||
| Reading level | Flesch-Kincaid grade | Accessibility |
|
| Reading level | Flesch-Kincaid grade | Accessibility |
|
||||||
|
|
||||||
**Note:** Reading level is a standalone metric that analyses only the candidate text — no reference required.
|
**Note:** Reading level is a standalone metric (`requires_reference = False`) that analyses only the candidate text. Comparison metrics (BLEU, ROUGE, semantic) require a reference and raise `ValueError` if none provided.
|
||||||
|
|
||||||
### Validators (Decision Logic)
|
### Validators (Decision Logic)
|
||||||
|
|
||||||
@@ -180,6 +180,10 @@ benchmark.assert_no_regression(tolerance=0.05)
|
|||||||
5. **Explicit context** — `ValidationContext` dataclass instead of `**kwargs`.
|
5. **Explicit context** — `ValidationContext` dataclass instead of `**kwargs`.
|
||||||
Type-safe, discoverable API.
|
Type-safe, discoverable API.
|
||||||
|
|
||||||
|
6. **Graceful edge case handling** — Empty text returns zero scores (not errors).
|
||||||
|
Missing reference raises clear `ValueError` for comparison metrics. Unicode
|
||||||
|
normalised to NFC by default.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Project Components
|
## Project Components
|
||||||
@@ -215,12 +219,17 @@ class Metric(Protocol[T]):
|
|||||||
@property
|
@property
|
||||||
def name(self) -> str: ...
|
def name(self) -> str: ...
|
||||||
|
|
||||||
def score(self, candidate: str, reference: str | list[str]) -> T: ...
|
@property
|
||||||
|
def requires_reference(self) -> bool: ...
|
||||||
|
|
||||||
|
def score(self, candidate: str, reference: str | list[str] | None = None) -> T:
|
||||||
|
"""Raises ValueError if reference required but not provided."""
|
||||||
|
...
|
||||||
|
|
||||||
def batch_score(
|
def batch_score(
|
||||||
self,
|
self,
|
||||||
candidates: list[str],
|
candidates: list[str],
|
||||||
references: list[str] | list[list[str]]
|
references: list[str] | list[list[str]] | None = None,
|
||||||
) -> BatchResult[T]: ...
|
) -> BatchResult[T]: ...
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -266,6 +275,8 @@ v.weighted([
|
|||||||
], min_score=0.75)
|
], min_score=0.75)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Reference requirements:** Validators wrapping comparison metrics (`bleu`, `rouge`, `semantic`) require `context.reference` to be set. If `None`, they raise `ValidationError` with a clear message. Constraint validators (`length`, `readability`, `contains`) do not require a reference.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Component 4: Pytest Plugin
|
### Component 4: Pytest Plugin
|
||||||
@@ -435,9 +446,10 @@ benchmark.assert_no_regression(tolerance=0.03)
|
|||||||
|
|
||||||
- [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
|
- [ ] BLEU/ROUGE implementations match reference implementations (nltk, rouge-score)
|
||||||
- [ ] Semantic similarity correlates with human judgement on test pairs
|
- [ ] Semantic similarity correlates with human judgement on test pairs
|
||||||
- [ ] Pytest plugin installs cleanly via `pip install veritext`
|
- [ ] Pytest plugin installs cleanly via `uv pip install veritext`
|
||||||
- [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
|
- [ ] Validation of 1000 text pairs completes in <5 seconds (excluding embeddings)
|
||||||
- [ ] Benchmark regression detection has <5% false positive rate
|
- [ ] Benchmark regression detection has <5% false positive rate
|
||||||
|
- [ ] Edge cases handled gracefully (empty text, None reference, Unicode)
|
||||||
- [ ] Documentation includes working examples for each use case
|
- [ ] Documentation includes working examples for each use case
|
||||||
- [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
|
- [ ] All code passes ruff, mypy strict, and pytest with ≥80% coverage
|
||||||
- [ ] Can explain design decisions and metric theory in interview
|
- [ ] Can explain design decisions and metric theory in interview
|
||||||
|
|||||||
Reference in New Issue
Block a user