Compare commits
11 Commits
feat/cli
...
docs/polis
| Author | SHA1 | Date | |
|---|---|---|---|
|
0699e97e1d
|
|||
|
7de4505e31
|
|||
|
564d663c78
|
|||
|
0b2bc6c688
|
|||
|
aa687f43cd
|
|||
|
f18427e123
|
|||
|
1754556c99
|
|||
|
13c869f5d6
|
|||
|
93515707cc
|
|||
|
3cde5aba77
|
|||
|
69966d171c
|
60
changelog.md
60
changelog.md
@@ -7,35 +7,85 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Changed
|
||||
|
||||
- Refactored CLI metric computation to eliminate code duplication
|
||||
- Version format updated from `0.1.0-dev` to `0.1.0.dev0` (PEP 440 compliance)
|
||||
- Settings instance is now cached via `@lru_cache` for better performance
|
||||
- Documented composite validators' intentional deviation from `Check` protocol return type
|
||||
|
||||
### Fixed
|
||||
|
||||
- Consolidated redundant empty checks in ROUGE-L computation
|
||||
- Fixed README example using incorrect property names (`grade_level` → `flesch_kincaid_grade`, `reading_ease` → `flesch_reading_ease`)
|
||||
|
||||
### Documentation
|
||||
|
||||
- Added Phase 10 (Portfolio Demos) to implementation plan: Streamlit demo and Jupyter notebooks
|
||||
- Updated project plan with portfolio demo section
|
||||
- Fixed potential crash in ROUGE metric when all references are empty after tokenisation
|
||||
- Fixed potential division by zero in readability metric when text has no sentence endings
|
||||
- Fixed unbounded cache growth in `SemanticSimilarity` by implementing LRU eviction with configurable max size
|
||||
- Fixed mutable list aliasing in `AllOf` and `AnyOf` composite validators
|
||||
- Fixed regex pattern validation in `ContainsValidator` and `ExcludesValidator` to fail at init time rather than during `check()`
|
||||
- Fixed pytest plugin tests failing with duplicate plugin registration error
|
||||
|
||||
### Added
|
||||
|
||||
- Added `.score` property to `LexicalResult` for API consistency with other result types
|
||||
- Added `cache_max_size` parameter to `SemanticSimilarity` (default: 1000 embeddings)
|
||||
- Added test coverage for `core/config.py` and `core/logging.py` modules
|
||||
|
||||
## [0.1.0] — 2026-02-03
|
||||
|
||||
Initial release of Veritext, a semantic text validation framework for Python.
|
||||
|
||||
### Added
|
||||
|
||||
#### Core
|
||||
|
||||
- Project scaffold with pyproject.toml and development tooling
|
||||
- Core exception hierarchy (`VeritextError` and subclasses)
|
||||
- Core types: `ValidationContext`, `CheckResult`, `ValidationResult`
|
||||
- Word tokeniser with Unicode normalisation support
|
||||
- Configuration module with pydantic-settings
|
||||
- Structured logging with structlog
|
||||
|
||||
#### Metrics
|
||||
|
||||
- Metrics module with `Metric` protocol, `AggregateStats`, and `BatchResult` types
|
||||
- BLEU metric implementation (BLEU-1 through BLEU-4 with brevity penalty)
|
||||
- Lexical similarity metric (Jaccard similarity and token overlap)
|
||||
- ROUGE metric (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F-measure)
|
||||
- Lexical similarity metric (Jaccard similarity and token overlap)
|
||||
- Flesch-Kincaid readability metrics (grade level and reading ease)
|
||||
- Batch scoring with aggregate statistics for all metrics
|
||||
|
||||
#### Validators
|
||||
|
||||
- Validators module with `Check` protocol for validation checks
|
||||
- Metric-based validators: `BleuValidator`, `RougeValidator`, `LexicalValidator`
|
||||
- Constraint validators: `LengthValidator`, `ReadabilityValidator`, `ContainsValidator`, `ExcludesValidator`
|
||||
- Composite validators: `AllOf` (all checks must pass), `AnyOf` (any check must pass)
|
||||
- Factory functions for clean validator API (`bleu()`, `rouge()`, `lexical()`, `length()`, `readability()`, `contains()`, `excludes()`, `all_of()`, `any_of()`)
|
||||
|
||||
#### Semantic Similarity
|
||||
|
||||
- Semantic similarity module with embedding-based text comparison (requires `veritext[semantic]` extra)
|
||||
- `SemanticSimilarity` metric using sentence-transformers for semantic relatedness
|
||||
- `SemanticValidator` for threshold-based semantic similarity validation
|
||||
- `semantic()` factory function for creating semantic validators
|
||||
- Embedding caching for performance optimisation in repeated comparisons
|
||||
|
||||
#### Pytest Plugin
|
||||
|
||||
- Native pytest plugin for CI/CD integration (entry point: `pytest11`)
|
||||
- `validate_text()` assertion function for expressive test assertions
|
||||
- `text_validation` marker for filtering validation tests
|
||||
- Pytest fixtures: `text_validator` factory and `validation_context` helper
|
||||
- Detailed failure messages with text preview and check diagnostics
|
||||
|
||||
#### Benchmarking
|
||||
|
||||
- Benchmark module for quality tracking and regression detection
|
||||
- `Benchmark` class for evaluating text quality over time with metric storage
|
||||
- `BenchmarkRun` and `RegressionReport` data models for tracking runs
|
||||
@@ -45,6 +95,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
- `assert_no_regression()` raises `RegressionDetectedError` for CI integration
|
||||
- Customisable tolerance threshold and window size for regression detection
|
||||
- Metadata support for tracking git SHA, model versions, etc.
|
||||
|
||||
#### CLI
|
||||
|
||||
- Command-line interface (CLI) via `veritext` command
|
||||
- `veritext validate` command for inline and file-based text validation
|
||||
- JSONL input format support for batch validation (`--file` option)
|
||||
@@ -54,3 +107,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
- `veritext benchmark show` command for viewing benchmark history
|
||||
- `veritext benchmark check` command for regression detection with exit code 1 on failure
|
||||
- Rich-formatted terminal output with tables and coloured panels
|
||||
|
||||
#### Documentation
|
||||
|
||||
- Comprehensive readme with usage examples
|
||||
- Example scripts: basic validation, chatbot testing, benchmark regression
|
||||
|
||||
@@ -871,6 +871,59 @@ uv run pytest --cov=src/veritext --cov-report=term-missing
|
||||
|
||||
---
|
||||
|
||||
### Phase 10: Portfolio Demos
|
||||
|
||||
**Goal:** Interactive demos for showcasing Veritext without installation.
|
||||
|
||||
**Step 1 — Streamlit Demo:**
|
||||
|
||||
Build a quick interactive web UI for general visitors.
|
||||
|
||||
- [ ] Create `demo/streamlit_app.py`
|
||||
- [ ] Text input boxes (candidate + reference)
|
||||
- [ ] Metric selector (BLEU, ROUGE, lexical, readability)
|
||||
- [ ] Threshold sliders for pass/fail validation
|
||||
- [ ] Results table with scores and status
|
||||
- [ ] Deploy to homeserver (e.g., `veritext.kschappell.com`)
|
||||
|
||||
**Step 2 — Jupyter Notebook Collection:**
|
||||
|
||||
Deep-dive notebooks targeting data science and ML recruiters.
|
||||
|
||||
- [ ] Create `notebooks/` directory
|
||||
- [ ] `01-metrics-overview.ipynb` — Introduction to each metric with visualisations
|
||||
- [ ] `02-batch-evaluation.ipynb` — Evaluating model outputs at scale
|
||||
- [ ] `03-regression-detection.ipynb` — Tracking quality over time
|
||||
- [ ] `04-chatbot-validation.ipynb` — Real-world use case
|
||||
|
||||
**Step 3 — JupyterLite Deployment:**
|
||||
|
||||
Host notebooks as static files running in the browser.
|
||||
|
||||
- [ ] Configure JupyterLite build with veritext pre-installed
|
||||
- [ ] Bundle notebooks into static site
|
||||
- [ ] Deploy alongside Streamlit demo
|
||||
|
||||
**Files:**
|
||||
- `demo/streamlit_app.py`
|
||||
- `notebooks/01-metrics-overview.ipynb`
|
||||
- `notebooks/02-batch-evaluation.ipynb`
|
||||
- `notebooks/03-regression-detection.ipynb`
|
||||
- `notebooks/04-chatbot-validation.ipynb`
|
||||
- `notebooks/jupyterlite-config.json`
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
# Streamlit
|
||||
uv run streamlit run demo/streamlit_app.py
|
||||
|
||||
# JupyterLite (local preview)
|
||||
jupyter lite build --contents notebooks/
|
||||
jupyter lite serve
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
```toml
|
||||
|
||||
@@ -488,3 +488,47 @@ benchmark.assert_no_regression(tolerance=0.03)
|
||||
|
||||
5. **Natural portfolio narrative** — "I was building X and needed a better way to test
|
||||
it, so I built this tool." Every interviewer has faced similar problems.
|
||||
|
||||
---
|
||||
|
||||
## Portfolio Demos (Future)
|
||||
|
||||
Interactive demos to showcase Veritext without requiring installation.
|
||||
|
||||
### Streamlit Demo
|
||||
|
||||
A quick interactive web UI for general visitors and recruiters.
|
||||
|
||||
**Features:**
|
||||
- Text input boxes (candidate + reference)
|
||||
- Metric selector (BLEU, ROUGE, lexical, readability)
|
||||
- Threshold sliders for pass/fail validation
|
||||
- Results table with scores and status
|
||||
|
||||
**Deployment:** Self-hosted on homeserver (e.g., `veritext.kschappell.com`)
|
||||
|
||||
**Effort:** ~30 minutes
|
||||
|
||||
### Jupyter Notebook Collection
|
||||
|
||||
Deep-dive notebooks targeting data science and ML recruiters.
|
||||
|
||||
**Notebooks:**
|
||||
|
||||
| Notebook | Purpose |
|
||||
|----------|---------|
|
||||
| `01-metrics-overview.ipynb` | Introduction to each metric with visualisations |
|
||||
| `02-batch-evaluation.ipynb` | Evaluating model outputs at scale, statistical analysis |
|
||||
| `03-regression-detection.ipynb` | Tracking quality over time, detecting degradation |
|
||||
| `04-chatbot-validation.ipynb` | Real-world use case: validating chatbot responses |
|
||||
|
||||
**Hosting:** JupyterLite (static files, runs in browser via WebAssembly)
|
||||
|
||||
**Deployment:** Self-hosted alongside Streamlit demo
|
||||
|
||||
**Why both:**
|
||||
|
||||
| Demo Type | Audience | Value |
|
||||
|-----------|----------|-------|
|
||||
| Streamlit | General visitors | Quick, interactive, no friction |
|
||||
| Notebooks | Data/ML recruiters | Shows analytical depth, speaks their language |
|
||||
|
||||
135
examples/basic_validation.py
Normal file
135
examples/basic_validation.py
Normal file
@@ -0,0 +1,135 @@
|
||||
"""Basic text validation examples.
|
||||
|
||||
Demonstrates core Veritext functionality:
|
||||
- Single metric scoring (BLEU, ROUGE)
|
||||
- Validator usage with thresholds
|
||||
- Composite validators (all_of, any_of)
|
||||
- Constraint validators (length, readability)
|
||||
"""
|
||||
|
||||
from veritext.core.types import ValidationContext
|
||||
from veritext.metrics import Bleu, Rouge
|
||||
from veritext.validators import (
|
||||
all_of,
|
||||
any_of,
|
||||
bleu,
|
||||
contains,
|
||||
excludes,
|
||||
length,
|
||||
readability,
|
||||
rouge,
|
||||
)
|
||||
|
||||
|
||||
def metric_scoring_example() -> None:
|
||||
"""Score text using individual metrics."""
|
||||
candidate = "The quick brown fox jumps over the lazy dog."
|
||||
reference = "A fast brown fox leaps over a sleepy dog."
|
||||
|
||||
# BLEU scoring (translation quality)
|
||||
bleu_metric = Bleu()
|
||||
bleu_result = bleu_metric.score(candidate, reference)
|
||||
print("BLEU Scores:")
|
||||
print(f" BLEU-1: {bleu_result.bleu1:.3f}")
|
||||
print(f" BLEU-4: {bleu_result.bleu4:.3f}")
|
||||
print(f" Brevity penalty: {bleu_result.brevity_penalty:.3f}")
|
||||
|
||||
# ROUGE scoring (summary quality)
|
||||
rouge_metric = Rouge()
|
||||
rouge_result = rouge_metric.score(candidate, reference)
|
||||
print("\nROUGE Scores:")
|
||||
print(f" ROUGE-1 F1: {rouge_result.rouge1.fmeasure:.3f}")
|
||||
print(f" ROUGE-L F1: {rouge_result.rouge_l.fmeasure:.3f}")
|
||||
|
||||
|
||||
def validator_example() -> None:
|
||||
"""Use validators to make pass/fail decisions."""
|
||||
reference = "Machine learning models require training data."
|
||||
candidate = "ML models need training data to learn patterns."
|
||||
|
||||
context = ValidationContext(reference=reference)
|
||||
|
||||
# BLEU validator with minimum threshold
|
||||
bleu_validator = bleu(min_score=0.3)
|
||||
result = bleu_validator.check(candidate, context)
|
||||
print(f"\nBLEU validation (min 0.3): {'PASS' if result.passed else 'FAIL'}")
|
||||
|
||||
# ROUGE validator
|
||||
rouge_validator = rouge(min_score=0.5)
|
||||
result = rouge_validator.check(candidate, context)
|
||||
print(f"ROUGE validation (min 0.5): {'PASS' if result.passed else 'FAIL'}")
|
||||
|
||||
|
||||
def composite_validator_example() -> None:
|
||||
"""Combine validators with all_of and any_of."""
|
||||
reference = "The product launch exceeded all expectations."
|
||||
candidate = "The product release performed beyond expectations."
|
||||
|
||||
context = ValidationContext(reference=reference)
|
||||
|
||||
# All checks must pass
|
||||
strict_validator = all_of(
|
||||
[
|
||||
bleu(min_score=0.2),
|
||||
rouge(min_score=0.4),
|
||||
length(max_chars=100),
|
||||
]
|
||||
)
|
||||
result = strict_validator.check(candidate, context)
|
||||
print(f"\nStrict (all_of): {'PASS' if result.passed else 'FAIL'}")
|
||||
if not result.passed:
|
||||
print(f" Failures: {result.failure_summary}")
|
||||
|
||||
# At least one check must pass
|
||||
flexible_validator = any_of(
|
||||
[
|
||||
bleu(min_score=0.8), # Unlikely to pass
|
||||
rouge(min_score=0.4), # More likely
|
||||
]
|
||||
)
|
||||
result = flexible_validator.check(candidate, context)
|
||||
print(f"Flexible (any_of): {'PASS' if result.passed else 'FAIL'}")
|
||||
|
||||
|
||||
def constraint_validator_example() -> None:
|
||||
"""Use constraint validators for text properties."""
|
||||
text = "This short guide explains the basics clearly."
|
||||
context = ValidationContext() # No reference needed for constraints
|
||||
|
||||
# Length constraints
|
||||
length_validator = length(min_chars=20, max_chars=100, min_words=5, max_words=20)
|
||||
result = length_validator.check(text, context)
|
||||
print(f"\nLength check: {'PASS' if result.passed else 'FAIL'}")
|
||||
|
||||
# Readability (Flesch-Kincaid)
|
||||
readability_validator = readability(max_grade=10.0)
|
||||
result = readability_validator.check(text, context)
|
||||
print(f"Readability (grade <= 10): {'PASS' if result.passed else 'FAIL'}")
|
||||
|
||||
# Content patterns
|
||||
contains_validator = contains(patterns=["guide", "basics"])
|
||||
result = contains_validator.check(text, context)
|
||||
print(f"Contains required terms: {'PASS' if result.passed else 'FAIL'}")
|
||||
|
||||
excludes_validator = excludes(patterns=["error", "warning"])
|
||||
result = excludes_validator.check(text, context)
|
||||
print(f"Excludes forbidden terms: {'PASS' if result.passed else 'FAIL'}")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""Run all examples."""
|
||||
print("=" * 60)
|
||||
print("Veritext Basic Validation Examples")
|
||||
print("=" * 60)
|
||||
|
||||
metric_scoring_example()
|
||||
validator_example()
|
||||
composite_validator_example()
|
||||
constraint_validator_example()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("All examples completed.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
160
examples/benchmark_regression.py
Normal file
160
examples/benchmark_regression.py
Normal file
@@ -0,0 +1,160 @@
|
||||
"""Benchmark quality tracking with regression detection.
|
||||
|
||||
Demonstrates Veritext's benchmark module for CI integration:
|
||||
- Creating a benchmark suite
|
||||
- Running evaluations and storing results
|
||||
- Checking for quality regression
|
||||
- CI integration pattern with exit codes
|
||||
"""
|
||||
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
from veritext.benchmark import Benchmark
|
||||
from veritext.core.exceptions import RegressionDetectedError
|
||||
|
||||
|
||||
def create_sample_data() -> tuple[list[str], list[str]]:
|
||||
"""Create sample candidate/reference pairs for benchmarking."""
|
||||
# Simulated summarisation outputs and references
|
||||
candidates = [
|
||||
"The new policy aims to reduce carbon emissions by 50% by 2030.",
|
||||
"Scientists discovered a new species of deep-sea fish.",
|
||||
"The company reported record profits in the third quarter.",
|
||||
"Researchers developed a breakthrough treatment for the disease.",
|
||||
"The city plans to expand public transportation routes.",
|
||||
]
|
||||
references = [
|
||||
"The policy targets a 50% reduction in carbon emissions by 2030.",
|
||||
"A new deep-sea fish species was discovered by marine biologists.",
|
||||
"Record profits were announced by the company for Q3.",
|
||||
"A breakthrough disease treatment was developed by researchers.",
|
||||
"Public transport expansion is planned for the city.",
|
||||
]
|
||||
return candidates, references
|
||||
|
||||
|
||||
def run_benchmark_example() -> None:
|
||||
"""Run a benchmark evaluation and view results."""
|
||||
# Use a temp directory for this example
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
storage_path = Path(tmpdir) / "benchmarks"
|
||||
|
||||
# Create benchmark suite
|
||||
bench = Benchmark("summariser_quality", storage_path=storage_path)
|
||||
|
||||
candidates, references = create_sample_data()
|
||||
|
||||
# Run evaluation
|
||||
print("Running benchmark evaluation...")
|
||||
run = bench.evaluate(
|
||||
candidates=candidates,
|
||||
references=references,
|
||||
metrics=["rouge_l", "bleu4"],
|
||||
metadata={"model": "v1.0", "dataset": "test"},
|
||||
)
|
||||
|
||||
print("\nBenchmark run completed:")
|
||||
print(f" Run ID: {run.id[:8]}...")
|
||||
print(f" Samples: {run.sample_count}")
|
||||
print(" Metrics:")
|
||||
for name, value in run.metrics.items():
|
||||
print(f" {name}: {value:.4f}")
|
||||
|
||||
|
||||
def regression_detection_example() -> None:
|
||||
"""Demonstrate regression detection with historical comparison."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
storage_path = Path(tmpdir) / "benchmarks"
|
||||
bench = Benchmark("summariser_quality", storage_path=storage_path)
|
||||
|
||||
candidates, references = create_sample_data()
|
||||
|
||||
# Simulate historical runs with stable quality
|
||||
print("\nBuilding baseline with historical runs...")
|
||||
for i in range(5):
|
||||
bench.evaluate(
|
||||
candidates=candidates,
|
||||
references=references,
|
||||
metrics=["rouge_l", "bleu4"],
|
||||
metadata={"run": f"baseline_{i}"},
|
||||
)
|
||||
print(f" Baseline run {i + 1} recorded")
|
||||
|
||||
# Check regression (no degradation expected)
|
||||
report = bench.check_regression(tolerance=0.05, window=5)
|
||||
print(f"\nRegression check: {'DETECTED' if report.detected else 'NONE'}")
|
||||
|
||||
# Simulate a degraded model
|
||||
print("\nSimulating degraded model output...")
|
||||
degraded_candidates = [
|
||||
"Policy carbon emissions.", # Much shorter/worse
|
||||
"Fish discovered.",
|
||||
"Company profits.",
|
||||
"Treatment developed.",
|
||||
"Transport expansion.",
|
||||
]
|
||||
bench.evaluate(
|
||||
candidates=degraded_candidates,
|
||||
references=references,
|
||||
metrics=["rouge_l", "bleu4"],
|
||||
metadata={"model": "v1.1-broken"},
|
||||
)
|
||||
|
||||
# Check regression (should detect)
|
||||
report = bench.check_regression(tolerance=0.05, window=5)
|
||||
print(f"Regression check: {'DETECTED' if report.detected else 'NONE'}")
|
||||
if report.detected:
|
||||
print("\nRegression details:")
|
||||
for metric, delta in report.deltas.items():
|
||||
baseline = report.baseline.get(metric, 0)
|
||||
current = report.current.get(metric, 0)
|
||||
print(f" {metric}: {baseline:.4f} -> {current:.4f} ({delta:+.4f})")
|
||||
|
||||
|
||||
def ci_integration_example() -> None:
|
||||
"""CI integration pattern using assert_no_regression()."""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
storage_path = Path(tmpdir) / "benchmarks"
|
||||
bench = Benchmark("ci_check", storage_path=storage_path)
|
||||
|
||||
candidates, references = create_sample_data()
|
||||
|
||||
# Build baseline
|
||||
for _ in range(3):
|
||||
bench.evaluate(candidates, references, metrics=["rouge_l"])
|
||||
|
||||
# Simulate CI check
|
||||
print("\n" + "=" * 50)
|
||||
print("CI Integration Example")
|
||||
print("=" * 50)
|
||||
|
||||
print("\nRunning evaluation...")
|
||||
bench.evaluate(candidates, references, metrics=["rouge_l"])
|
||||
|
||||
print("Checking for regression...")
|
||||
try:
|
||||
bench.assert_no_regression(tolerance=0.05, window=3)
|
||||
print("No regression detected.")
|
||||
print("CI status: EXIT 0")
|
||||
except RegressionDetectedError as e:
|
||||
print(f"Regression detected: {e}")
|
||||
print("CI status: EXIT 1")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""Run all benchmark examples."""
|
||||
print("=" * 60)
|
||||
print("Veritext Benchmark & Regression Detection Examples")
|
||||
print("=" * 60)
|
||||
|
||||
run_benchmark_example()
|
||||
regression_detection_example()
|
||||
ci_integration_example()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("All examples completed.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
140
examples/chatbot_testing.py
Normal file
140
examples/chatbot_testing.py
Normal file
@@ -0,0 +1,140 @@
|
||||
"""Pytest integration for chatbot testing.
|
||||
|
||||
Demonstrates Veritext's pytest plugin for testing chatbot responses:
|
||||
- validate_text() assertion function
|
||||
- Custom test fixtures
|
||||
- Test organisation with markers
|
||||
"""
|
||||
|
||||
import pytest
|
||||
|
||||
from veritext.pytest_plugin import validate_text
|
||||
|
||||
# Sample chatbot responses for testing
|
||||
CHATBOT_RESPONSES = {
|
||||
"greeting": {
|
||||
"input": "Hello!",
|
||||
"response": "Hi there! How can I help you today?",
|
||||
"expected_keywords": ["help", "hi"],
|
||||
},
|
||||
"weather": {
|
||||
"input": "What's the weather like?",
|
||||
"response": "I don't have access to real-time weather data, but you can "
|
||||
"check a weather service like weather.com for current conditions.",
|
||||
"expected_keywords": ["weather", "check"],
|
||||
},
|
||||
"farewell": {
|
||||
"input": "Goodbye!",
|
||||
"response": "Goodbye! Have a great day!",
|
||||
"expected_keywords": ["goodbye", "day"],
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
# Fixtures for common test setup
|
||||
@pytest.fixture
|
||||
def greeting_response() -> str:
|
||||
"""Provide a sample greeting response."""
|
||||
return CHATBOT_RESPONSES["greeting"]["response"]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def weather_response() -> str:
|
||||
"""Provide a sample weather response."""
|
||||
return CHATBOT_RESPONSES["weather"]["response"]
|
||||
|
||||
|
||||
# Basic validation tests
|
||||
class TestResponseQuality:
|
||||
"""Test chatbot response quality using Veritext."""
|
||||
|
||||
def test_greeting_length(self, greeting_response: str) -> None:
|
||||
"""Greeting responses should be concise."""
|
||||
validate_text(
|
||||
greeting_response,
|
||||
min_length=10,
|
||||
max_length=100,
|
||||
)
|
||||
|
||||
def test_greeting_readability(self, greeting_response: str) -> None:
|
||||
"""Greeting responses should be easy to read."""
|
||||
validate_text(
|
||||
greeting_response,
|
||||
max_reading_grade=8.0,
|
||||
)
|
||||
|
||||
def test_greeting_contains_keywords(self, greeting_response: str) -> None:
|
||||
"""Greeting should contain expected terms."""
|
||||
validate_text(
|
||||
greeting_response,
|
||||
must_contain=["help"],
|
||||
)
|
||||
|
||||
def test_weather_response_quality(self, weather_response: str) -> None:
|
||||
"""Weather response should be informative and readable."""
|
||||
validate_text(
|
||||
weather_response,
|
||||
min_length=50,
|
||||
max_length=500,
|
||||
max_reading_grade=10.0,
|
||||
must_contain=["weather"],
|
||||
)
|
||||
|
||||
|
||||
# Tests with reference comparison
|
||||
class TestResponseSimilarity:
|
||||
"""Test response similarity against reference texts."""
|
||||
|
||||
def test_greeting_similarity(self) -> None:
|
||||
"""Greeting should match expected style."""
|
||||
reference = "Hello! How may I assist you today?"
|
||||
response = CHATBOT_RESPONSES["greeting"]["response"]
|
||||
|
||||
validate_text(
|
||||
response,
|
||||
reference=reference,
|
||||
min_rouge=0.3, # Allow variation in wording
|
||||
min_length=10,
|
||||
)
|
||||
|
||||
def test_farewell_similarity(self) -> None:
|
||||
"""Farewell should match expected style."""
|
||||
reference = "Goodbye! Have a wonderful day!"
|
||||
response = CHATBOT_RESPONSES["farewell"]["response"]
|
||||
|
||||
validate_text(
|
||||
response,
|
||||
reference=reference,
|
||||
min_rouge=0.5,
|
||||
must_contain=["goodbye"],
|
||||
)
|
||||
|
||||
|
||||
# Content safety tests
|
||||
class TestContentSafety:
|
||||
"""Test responses for inappropriate content."""
|
||||
|
||||
@pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
|
||||
def test_no_profanity(self, response_key: str) -> None:
|
||||
"""Responses should not contain profanity."""
|
||||
response = CHATBOT_RESPONSES[response_key]["response"]
|
||||
validate_text(
|
||||
response,
|
||||
must_exclude=["damn", "hell", "crap"],
|
||||
min_length=1,
|
||||
)
|
||||
|
||||
@pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
|
||||
def test_no_harmful_content(self, response_key: str) -> None:
|
||||
"""Responses should not contain harmful instructions."""
|
||||
response = CHATBOT_RESPONSES[response_key]["response"]
|
||||
validate_text(
|
||||
response,
|
||||
must_exclude=["hack", "exploit", "attack"],
|
||||
min_length=1,
|
||||
)
|
||||
|
||||
|
||||
# Run tests when executed directly
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-v"])
|
||||
@@ -1,6 +1,6 @@
|
||||
[project]
|
||||
name = "veritext"
|
||||
version = "0.1.0-dev"
|
||||
version = "0.1.0.dev0"
|
||||
description = "Semantic text validation framework"
|
||||
readme = "readme.md"
|
||||
requires-python = ">=3.11"
|
||||
|
||||
386
readme.md
386
readme.md
@@ -2,48 +2,398 @@
|
||||
|
||||
Semantic text validation framework for Python.
|
||||
|
||||
Validates text outputs against quality criteria using metrics like BLEU, ROUGE,
|
||||
and semantic similarity. Designed for developers building systems that produce
|
||||
text (chatbots, content generators, summarisation tools) who need automated
|
||||
quality assurance beyond simple string matching.
|
||||
Veritext validates text outputs against quality criteria using metrics like BLEU,
|
||||
ROUGE, and semantic similarity. Designed for developers building systems that produce
|
||||
text (chatbots, content generators, summarisation tools) who need automated quality
|
||||
assurance beyond simple string matching.
|
||||
|
||||
## Status
|
||||
## Features
|
||||
|
||||
Under active development. See [changelog.md](changelog.md) for progress.
|
||||
- **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic
|
||||
embeddings
|
||||
- **Composable validators** — Build complex checks from simple primitives
|
||||
- **Native pytest integration** — `validate_text()` assertion for test suites
|
||||
- **Quality benchmarking** — Track metrics over time with regression detection
|
||||
- **CLI tools** — Command-line validation and benchmark management
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install veritext
|
||||
|
||||
# With semantic similarity support
|
||||
# With semantic similarity support (sentence-transformers)
|
||||
pip install veritext[semantic]
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from veritext import validators as v
|
||||
from veritext.core.types import ValidationContext
|
||||
from veritext.validators import all_of, bleu, length, rouge
|
||||
|
||||
# Create validators
|
||||
validator = v.all_of([
|
||||
v.bleu(min_score=0.7),
|
||||
v.length(max_chars=500),
|
||||
# Create a validator
|
||||
validator = all_of([
|
||||
bleu(min_score=0.5),
|
||||
rouge(min_score=0.6),
|
||||
length(max_chars=500),
|
||||
])
|
||||
|
||||
# Validate text
|
||||
context = ValidationContext(reference="The cat sat on the mat.")
|
||||
result = validator.check("A cat is sitting on the mat.", context)
|
||||
context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
|
||||
result = validator.check("A fast brown fox leaps over a sleepy dog.", context)
|
||||
|
||||
if not result.passed:
|
||||
if result.passed:
|
||||
print("Validation passed!")
|
||||
else:
|
||||
print(result.failure_summary)
|
||||
```
|
||||
|
||||
## Documentation
|
||||
## Metrics
|
||||
|
||||
- [Project Plan](docs/project-plan.md)
|
||||
- [Implementation Plan](docs/implementation-plan.md)
|
||||
Veritext provides several metrics for text evaluation.
|
||||
|
||||
### BLEU
|
||||
|
||||
Measures n-gram precision against reference text. Useful for translation and
|
||||
generation quality.
|
||||
|
||||
```python
|
||||
from veritext.metrics import Bleu
|
||||
|
||||
bleu = Bleu()
|
||||
result = bleu.score(
|
||||
candidate="The cat sat on the mat.",
|
||||
reference="A cat is sitting on the mat.",
|
||||
)
|
||||
print(f"BLEU-4: {result.bleu4:.3f}") # Uses 1-4 gram precision
|
||||
print(f"BLEU-1: {result.bleu1:.3f}") # Unigram precision only
|
||||
```
|
||||
|
||||
### ROUGE
|
||||
|
||||
Measures recall-oriented overlap with reference text. Useful for summarisation.
|
||||
|
||||
```python
|
||||
from veritext.metrics import Rouge
|
||||
|
||||
rouge = Rouge()
|
||||
result = rouge.score(
|
||||
candidate="Scientists found a new planet.",
|
||||
reference="Researchers discovered a new planet in the solar system.",
|
||||
)
|
||||
print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}") # Unigram overlap
|
||||
print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}") # Longest common subsequence
|
||||
```
|
||||
|
||||
### Lexical Similarity
|
||||
|
||||
Measures token overlap using Jaccard similarity.
|
||||
|
||||
```python
|
||||
from veritext.metrics import Lexical
|
||||
|
||||
lexical = Lexical()
|
||||
result = lexical.score(
|
||||
candidate="The quick brown fox",
|
||||
reference="The fast brown fox",
|
||||
)
|
||||
print(f"Jaccard: {result.jaccard:.3f}")
|
||||
print(f"Token overlap: {result.token_overlap:.3f}")
|
||||
```
|
||||
|
||||
### Readability
|
||||
|
||||
Computes Flesch-Kincaid scores for text complexity.
|
||||
|
||||
```python
|
||||
from veritext.metrics import Readability
|
||||
|
||||
readability = Readability()
|
||||
result = readability.score("This is a simple sentence.")
|
||||
print(f"Grade level: {result.flesch_kincaid_grade:.1f}")
|
||||
print(f"Reading ease: {result.flesch_reading_ease:.1f}")
|
||||
```
|
||||
|
||||
### Semantic Similarity (Optional)
|
||||
|
||||
Requires `pip install veritext[semantic]`.
|
||||
|
||||
```python
|
||||
from veritext.semantic import SemanticSimilarity
|
||||
|
||||
semantic = SemanticSimilarity()
|
||||
result = semantic.score(
|
||||
candidate="The dog is running in the park.",
|
||||
reference="A canine is jogging through the garden.",
|
||||
)
|
||||
print(f"Similarity: {result.score:.3f}")
|
||||
```
|
||||
|
||||
## Validators
|
||||
|
||||
Validators wrap metrics with thresholds to make pass/fail decisions.
|
||||
|
||||
### Metric-Based Validators
|
||||
|
||||
```python
|
||||
from veritext.core.types import ValidationContext
|
||||
from veritext.validators import bleu, lexical, rouge
|
||||
|
||||
context = ValidationContext(reference="Reference text here.")
|
||||
|
||||
# BLEU validation
|
||||
validator = bleu(min_score=0.5, variant=4) # BLEU-4
|
||||
result = validator.check("Candidate text here.", context)
|
||||
|
||||
# ROUGE validation
|
||||
validator = rouge(min_score=0.6, variant="l") # ROUGE-L
|
||||
result = validator.check("Candidate text here.", context)
|
||||
|
||||
# Lexical validation
|
||||
validator = lexical(min_jaccard=0.3, min_overlap=0.5)
|
||||
result = validator.check("Candidate text here.", context)
|
||||
```
|
||||
|
||||
### Constraint Validators
|
||||
|
||||
These don't require reference text.
|
||||
|
||||
```python
|
||||
from veritext.core.types import ValidationContext
|
||||
from veritext.validators import contains, excludes, length, readability
|
||||
|
||||
context = ValidationContext() # No reference needed
|
||||
|
||||
# Length constraints
|
||||
validator = length(min_chars=50, max_chars=500, min_words=10)
|
||||
result = validator.check("Your text here...", context)
|
||||
|
||||
# Readability constraints
|
||||
validator = readability(max_grade=8.0, min_ease=60.0)
|
||||
result = validator.check("Your text here...", context)
|
||||
|
||||
# Content requirements
|
||||
validator = contains(patterns=["important", "keyword"])
|
||||
result = validator.check("This important text has a keyword.", context)
|
||||
|
||||
# Content exclusions
|
||||
validator = excludes(patterns=["forbidden", "banned"])
|
||||
result = validator.check("This text is clean.", context)
|
||||
```
|
||||
|
||||
### Composite Validators
|
||||
|
||||
Combine multiple checks with logical operators.
|
||||
|
||||
```python
|
||||
from veritext.validators import all_of, any_of, bleu, length, rouge
|
||||
|
||||
# All checks must pass
|
||||
validator = all_of([
|
||||
bleu(min_score=0.5),
|
||||
rouge(min_score=0.6),
|
||||
length(max_chars=500),
|
||||
])
|
||||
|
||||
# At least one check must pass
|
||||
validator = any_of([
|
||||
bleu(min_score=0.7),
|
||||
rouge(min_score=0.7),
|
||||
])
|
||||
```
|
||||
|
||||
## Pytest Plugin
|
||||
|
||||
Veritext provides native pytest integration for testing text quality.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from veritext.pytest_plugin import validate_text
|
||||
|
||||
|
||||
def test_response_quality():
|
||||
response = "This is a helpful response to your question."
|
||||
|
||||
validate_text(
|
||||
response,
|
||||
min_length=20,
|
||||
max_length=200,
|
||||
max_reading_grade=10.0,
|
||||
must_contain=["helpful"],
|
||||
must_exclude=["error", "sorry"],
|
||||
)
|
||||
|
||||
|
||||
def test_summary_similarity():
|
||||
summary = "Scientists discovered a new planet."
|
||||
reference = "Researchers found a new planet in our solar system."
|
||||
|
||||
validate_text(
|
||||
summary,
|
||||
reference=reference,
|
||||
min_rouge=0.5,
|
||||
min_length=10,
|
||||
)
|
||||
```
|
||||
|
||||
### Available Parameters
|
||||
|
||||
| Parameter | Description |
|
||||
|-----------|-------------|
|
||||
| `reference` | Reference text for comparison metrics |
|
||||
| `min_bleu` | Minimum BLEU-4 score (0.0-1.0) |
|
||||
| `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) |
|
||||
| `min_semantic` | Minimum semantic similarity (0.0-1.0) |
|
||||
| `min_length` | Minimum character count |
|
||||
| `max_length` | Maximum character count |
|
||||
| `max_reading_grade` | Maximum Flesch-Kincaid grade level |
|
||||
| `must_contain` | List of required patterns |
|
||||
| `must_exclude` | List of forbidden patterns |
|
||||
|
||||
## Benchmarking
|
||||
|
||||
Track text quality over time and detect regressions.
|
||||
|
||||
### Running Benchmarks
|
||||
|
||||
```python
|
||||
from veritext.benchmark import Benchmark
|
||||
|
||||
# Create a benchmark suite
|
||||
bench = Benchmark("summariser_quality", storage_path="benchmarks/")
|
||||
|
||||
# Evaluate a batch of outputs
|
||||
candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
|
||||
references = ["Reference 1...", "Reference 2...", "Reference 3..."]
|
||||
|
||||
run = bench.evaluate(
|
||||
candidates=candidates,
|
||||
references=references,
|
||||
metrics=["rouge_l", "bleu4"],
|
||||
metadata={"model": "v1.2", "git_sha": "abc123"},
|
||||
)
|
||||
|
||||
print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
|
||||
print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
|
||||
```
|
||||
|
||||
### Regression Detection
|
||||
|
||||
```python
|
||||
from veritext.benchmark import Benchmark
|
||||
from veritext.core.exceptions import RegressionDetectedError
|
||||
|
||||
bench = Benchmark("summariser_quality")
|
||||
|
||||
# Check for regression against historical baseline
|
||||
report = bench.check_regression(tolerance=0.05, window=10)
|
||||
if report.detected:
|
||||
print("Quality regression detected!")
|
||||
for metric, delta in report.deltas.items():
|
||||
print(f" {metric}: {delta:+.4f}")
|
||||
|
||||
# Or raise an exception for CI integration
|
||||
try:
|
||||
bench.assert_no_regression(tolerance=0.05)
|
||||
except RegressionDetectedError as e:
|
||||
print(f"CI failure: {e}")
|
||||
exit(1)
|
||||
```
|
||||
|
||||
### Viewing History
|
||||
|
||||
```python
|
||||
bench = Benchmark("summariser_quality")
|
||||
|
||||
for run in bench.get_history(limit=10):
|
||||
print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
|
||||
```
|
||||
|
||||
## CLI
|
||||
|
||||
Veritext provides a command-line interface for validation and benchmarking.
|
||||
|
||||
### Validate Text
|
||||
|
||||
```bash
|
||||
# Inline validation
|
||||
veritext validate "Candidate text" -r "Reference text" -m bleu,rouge
|
||||
|
||||
# File-based batch validation (JSONL with "candidate" and "reference" fields)
|
||||
veritext validate -f outputs.jsonl -m bleu,rouge,lexical
|
||||
|
||||
# With threshold for pass/fail
|
||||
veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple
|
||||
|
||||
# Output formats: table (default), json, simple
|
||||
veritext validate "Text" -r "Reference" -m bleu -o json
|
||||
```
|
||||
|
||||
### Benchmark Commands
|
||||
|
||||
```bash
|
||||
# Run a benchmark evaluation
|
||||
veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
|
||||
|
||||
# View benchmark history
|
||||
veritext benchmark show my_bench --last 10
|
||||
|
||||
# Check for regression (exits with code 1 if detected)
|
||||
veritext benchmark check my_bench --tolerance 0.05 --window 10
|
||||
```
|
||||
|
||||
### JSONL Format
|
||||
|
||||
For file-based operations, use JSONL with `candidate` and `reference` fields:
|
||||
|
||||
```json
|
||||
{"candidate": "Model output 1", "reference": "Expected output 1"}
|
||||
{"candidate": "Model output 2", "reference": "Expected output 2"}
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Veritext uses environment variables for configuration:
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `VERITEXT_LOG_LEVEL` | `INFO` | Logging level |
|
||||
| `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) |
|
||||
|
||||
## Development
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
git clone https://gitea.kschappell.com/kschappell/veritext.git
|
||||
cd veritext
|
||||
uv sync --all-extras
|
||||
```
|
||||
|
||||
### Quality Checks
|
||||
|
||||
```bash
|
||||
# Linting
|
||||
uv run ruff check .
|
||||
|
||||
# Formatting
|
||||
uv run ruff format --check .
|
||||
|
||||
# Type checking
|
||||
uv run mypy src/
|
||||
|
||||
# Tests
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
### Running Examples
|
||||
|
||||
```bash
|
||||
uv run python examples/basic_validation.py
|
||||
uv run pytest examples/chatbot_testing.py -v
|
||||
uv run python examples/benchmark_regression.py
|
||||
```
|
||||
|
||||
## Licence
|
||||
|
||||
|
||||
@@ -11,11 +11,91 @@ from veritext.metrics.bleu import Bleu
|
||||
from veritext.metrics.lexical import Lexical
|
||||
from veritext.metrics.rouge import Rouge
|
||||
|
||||
# Available metrics mapped to their computation functions
|
||||
# Available metrics
|
||||
AVAILABLE_METRICS = frozenset(
|
||||
{"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"}
|
||||
)
|
||||
|
||||
# Lazily-initialised metric instances
|
||||
_bleu: Bleu | None = None
|
||||
_rouge: Rouge | None = None
|
||||
_lexical: Lexical | None = None
|
||||
|
||||
|
||||
def _get_bleu() -> Bleu:
|
||||
"""Get or create the BLEU metric instance."""
|
||||
global _bleu
|
||||
if _bleu is None:
|
||||
_bleu = Bleu()
|
||||
return _bleu
|
||||
|
||||
|
||||
def _get_rouge() -> Rouge:
|
||||
"""Get or create the ROUGE metric instance."""
|
||||
global _rouge
|
||||
if _rouge is None:
|
||||
_rouge = Rouge()
|
||||
return _rouge
|
||||
|
||||
|
||||
def _get_lexical() -> Lexical:
|
||||
"""Get or create the lexical metric instance."""
|
||||
global _lexical
|
||||
if _lexical is None:
|
||||
_lexical = Lexical()
|
||||
return _lexical
|
||||
|
||||
|
||||
# Metric registry: maps metric names to (result_keys, single_extractor, batch_extractor)
|
||||
# - result_keys: output keys to populate
|
||||
# - single_extractor: function(candidate, reference) -> dict of results
|
||||
# - batch_extractor: function(candidates, references) -> dict of results
|
||||
def _bleu_single(candidate: str, reference: str, key: str) -> dict[str, float]:
|
||||
"""Extract a BLEU score for single mode."""
|
||||
result = _get_bleu().score(candidate, reference)
|
||||
return {key: getattr(result, key)}
|
||||
|
||||
|
||||
def _bleu_batch(
|
||||
candidates: list[str], references: list[str], key: str
|
||||
) -> dict[str, float]:
|
||||
"""Extract a BLEU score for batch mode."""
|
||||
batch = _get_bleu().batch_score(candidates, references)
|
||||
stats = batch.stats.get(key)
|
||||
return {key: stats.mean} if stats else {}
|
||||
|
||||
|
||||
def _rouge_single(candidate: str, reference: str) -> dict[str, float]:
|
||||
"""Extract ROUGE-L F-measure for single mode."""
|
||||
result = _get_rouge().score(candidate, reference)
|
||||
return {"rouge_l": result.rouge_l.fmeasure}
|
||||
|
||||
|
||||
def _rouge_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
|
||||
"""Extract ROUGE-L F-measure for batch mode."""
|
||||
batch = _get_rouge().batch_score(candidates, references)
|
||||
stats = batch.stats.get("rouge_l_fmeasure")
|
||||
return {"rouge_l": stats.mean} if stats else {}
|
||||
|
||||
|
||||
def _lexical_single(candidate: str, reference: str) -> dict[str, float]:
|
||||
"""Extract lexical scores for single mode."""
|
||||
result = _get_lexical().score(candidate, reference)
|
||||
return {"jaccard": result.jaccard, "token_overlap": result.token_overlap}
|
||||
|
||||
|
||||
def _lexical_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
|
||||
"""Extract lexical scores for batch mode."""
|
||||
batch = _get_lexical().batch_score(candidates, references)
|
||||
results: dict[str, float] = {}
|
||||
jaccard_stats = batch.stats.get("jaccard")
|
||||
overlap_stats = batch.stats.get("token_overlap")
|
||||
if jaccard_stats:
|
||||
results["jaccard"] = jaccard_stats.mean
|
||||
if overlap_stats:
|
||||
results["token_overlap"] = overlap_stats.mean
|
||||
return results
|
||||
|
||||
|
||||
def _compute_metrics(
|
||||
candidate: str,
|
||||
@@ -24,30 +104,16 @@ def _compute_metrics(
|
||||
) -> dict[str, float]:
|
||||
"""Compute requested metrics for a single text pair."""
|
||||
results: dict[str, float] = {}
|
||||
bleu = Bleu()
|
||||
rouge = Rouge()
|
||||
lexical = Lexical()
|
||||
|
||||
for metric in metric_names:
|
||||
if metric == "bleu" or metric == "bleu4":
|
||||
bleu_result = bleu.score(candidate, reference)
|
||||
results["bleu4"] = bleu_result.bleu4
|
||||
elif metric == "bleu1":
|
||||
bleu_result = bleu.score(candidate, reference)
|
||||
results["bleu1"] = bleu_result.bleu1
|
||||
elif metric == "bleu2":
|
||||
bleu_result = bleu.score(candidate, reference)
|
||||
results["bleu2"] = bleu_result.bleu2
|
||||
elif metric == "bleu3":
|
||||
bleu_result = bleu.score(candidate, reference)
|
||||
results["bleu3"] = bleu_result.bleu3
|
||||
elif metric == "rouge" or metric == "rouge_l":
|
||||
rouge_result = rouge.score(candidate, reference)
|
||||
results["rouge_l"] = rouge_result.rouge_l.fmeasure
|
||||
if metric in ("bleu", "bleu4"):
|
||||
results.update(_bleu_single(candidate, reference, "bleu4"))
|
||||
elif metric in ("bleu1", "bleu2", "bleu3"):
|
||||
results.update(_bleu_single(candidate, reference, metric))
|
||||
elif metric in ("rouge", "rouge_l"):
|
||||
results.update(_rouge_single(candidate, reference))
|
||||
elif metric == "lexical":
|
||||
lexical_result = lexical.score(candidate, reference)
|
||||
results["jaccard"] = lexical_result.jaccard
|
||||
results["token_overlap"] = lexical_result.token_overlap
|
||||
results.update(_lexical_single(candidate, reference))
|
||||
|
||||
return results
|
||||
|
||||
@@ -58,46 +124,17 @@ def _compute_batch_metrics(
|
||||
metric_names: list[str],
|
||||
) -> dict[str, float]:
|
||||
"""Compute average metrics for a batch of text pairs."""
|
||||
bleu = Bleu()
|
||||
rouge = Rouge()
|
||||
lexical = Lexical()
|
||||
|
||||
results: dict[str, float] = {}
|
||||
|
||||
for metric in metric_names:
|
||||
if metric == "bleu" or metric == "bleu4":
|
||||
bleu_batch = bleu.batch_score(candidates, references)
|
||||
stats = bleu_batch.stats.get("bleu4")
|
||||
if stats:
|
||||
results["bleu4"] = stats.mean
|
||||
elif metric == "bleu1":
|
||||
bleu_batch = bleu.batch_score(candidates, references)
|
||||
stats = bleu_batch.stats.get("bleu1")
|
||||
if stats:
|
||||
results["bleu1"] = stats.mean
|
||||
elif metric == "bleu2":
|
||||
bleu_batch = bleu.batch_score(candidates, references)
|
||||
stats = bleu_batch.stats.get("bleu2")
|
||||
if stats:
|
||||
results["bleu2"] = stats.mean
|
||||
elif metric == "bleu3":
|
||||
bleu_batch = bleu.batch_score(candidates, references)
|
||||
stats = bleu_batch.stats.get("bleu3")
|
||||
if stats:
|
||||
results["bleu3"] = stats.mean
|
||||
elif metric == "rouge" or metric == "rouge_l":
|
||||
rouge_batch = rouge.batch_score(candidates, references)
|
||||
stats = rouge_batch.stats.get("rouge_l_fmeasure")
|
||||
if stats:
|
||||
results["rouge_l"] = stats.mean
|
||||
if metric in ("bleu", "bleu4"):
|
||||
results.update(_bleu_batch(candidates, references, "bleu4"))
|
||||
elif metric in ("bleu1", "bleu2", "bleu3"):
|
||||
results.update(_bleu_batch(candidates, references, metric))
|
||||
elif metric in ("rouge", "rouge_l"):
|
||||
results.update(_rouge_batch(candidates, references))
|
||||
elif metric == "lexical":
|
||||
lexical_batch = lexical.batch_score(candidates, references)
|
||||
jaccard_stats = lexical_batch.stats.get("jaccard")
|
||||
overlap_stats = lexical_batch.stats.get("token_overlap")
|
||||
if jaccard_stats:
|
||||
results["jaccard"] = jaccard_stats.mean
|
||||
if overlap_stats:
|
||||
results["token_overlap"] = overlap_stats.mean
|
||||
results.update(_lexical_batch(candidates, references))
|
||||
|
||||
return results
|
||||
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
"""Configuration management using pydantic-settings."""
|
||||
|
||||
from functools import lru_cache
|
||||
from pathlib import Path
|
||||
from typing import Literal
|
||||
|
||||
@@ -54,6 +55,7 @@ class VeritextSettings(BaseSettings):
|
||||
)
|
||||
|
||||
|
||||
@lru_cache
|
||||
def get_settings() -> VeritextSettings:
|
||||
"""Get the current settings instance."""
|
||||
"""Get the cached settings instance."""
|
||||
return VeritextSettings()
|
||||
|
||||
@@ -137,8 +137,8 @@ class Readability:
|
||||
flesch_reading_ease=0.0,
|
||||
)
|
||||
|
||||
# Count sentences
|
||||
sentence_count = _count_sentences(candidate)
|
||||
# Count sentences (ensure at least 1 to avoid division by zero)
|
||||
sentence_count = max(_count_sentences(candidate), 1)
|
||||
|
||||
# Count syllables
|
||||
syllable_count = sum(_count_syllables(word) for word in words)
|
||||
|
||||
@@ -40,6 +40,11 @@ class LexicalResult(BaseModel):
|
||||
token_overlap: float
|
||||
"""Proportion of candidate tokens found in reference."""
|
||||
|
||||
@property
|
||||
def score(self) -> float:
|
||||
"""Return Jaccard similarity as the primary score."""
|
||||
return self.jaccard
|
||||
|
||||
|
||||
class RougeScore(BaseModel):
|
||||
"""Individual ROUGE variant score with precision, recall, F-measure."""
|
||||
|
||||
@@ -107,9 +107,6 @@ def _compute_rouge_l(
|
||||
Returns:
|
||||
RougeScore with precision, recall, and F-measure.
|
||||
"""
|
||||
if not candidate_tokens and not reference_tokens:
|
||||
return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
|
||||
|
||||
if not candidate_tokens or not reference_tokens:
|
||||
return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
|
||||
|
||||
@@ -209,6 +206,10 @@ class Rouge:
|
||||
rouge2_scores.append(_compute_rouge_score(candidate_tokens, ref_tokens, 2))
|
||||
rouge_l_scores.append(_compute_rouge_l(candidate_tokens, ref_tokens))
|
||||
|
||||
# All references were empty after tokenisation
|
||||
if not rouge1_scores:
|
||||
raise ValueError("Reference text cannot be empty")
|
||||
|
||||
return RougeResult(
|
||||
rouge1=_max_rouge_scores(rouge1_scores),
|
||||
rouge2=_max_rouge_scores(rouge2_scores),
|
||||
|
||||
@@ -1,11 +1,15 @@
|
||||
"""Embedding-based semantic similarity using sentence-transformers."""
|
||||
|
||||
from collections import OrderedDict
|
||||
from typing import Any
|
||||
|
||||
from veritext.core.exceptions import DependencyError
|
||||
from veritext.metrics.base import AggregateStats, BatchResult
|
||||
from veritext.metrics.results import SemanticResult
|
||||
|
||||
# Default maximum cache size (number of embeddings to store)
|
||||
DEFAULT_CACHE_MAX_SIZE = 1000
|
||||
|
||||
|
||||
class SemanticSimilarity:
|
||||
"""
|
||||
@@ -21,6 +25,7 @@ class SemanticSimilarity:
|
||||
self,
|
||||
model: str = "all-MiniLM-L6-v2",
|
||||
cache_embeddings: bool = True,
|
||||
cache_max_size: int = DEFAULT_CACHE_MAX_SIZE,
|
||||
) -> None:
|
||||
"""
|
||||
Initialise the semantic similarity metric.
|
||||
@@ -30,6 +35,8 @@ class SemanticSimilarity:
|
||||
Defaults to "all-MiniLM-L6-v2" (22MB, good quality/size tradeoff).
|
||||
cache_embeddings: Whether to cache embeddings for repeated texts.
|
||||
Defaults to True.
|
||||
cache_max_size: Maximum number of embeddings to cache. Oldest entries
|
||||
are evicted when the limit is reached. Defaults to 1000.
|
||||
|
||||
Raises:
|
||||
DependencyError: If sentence-transformers is not installed.
|
||||
@@ -44,7 +51,10 @@ class SemanticSimilarity:
|
||||
|
||||
self._model_name = model
|
||||
self._model: Any = SentenceTransformer(model)
|
||||
self._cache: dict[str, Any] | None = {} if cache_embeddings else None
|
||||
self._cache: OrderedDict[str, Any] | None = (
|
||||
OrderedDict() if cache_embeddings else None
|
||||
)
|
||||
self._cache_max_size = cache_max_size
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
@@ -58,7 +68,7 @@ class SemanticSimilarity:
|
||||
|
||||
def _get_embedding(self, text: str) -> Any:
|
||||
"""
|
||||
Get embedding for text, using cache if available.
|
||||
Get embedding for text, using LRU cache if available.
|
||||
|
||||
Args:
|
||||
text: The text to embed.
|
||||
@@ -67,11 +77,16 @@ class SemanticSimilarity:
|
||||
The embedding tensor.
|
||||
"""
|
||||
if self._cache is not None and text in self._cache:
|
||||
# Move to end to mark as recently used
|
||||
self._cache.move_to_end(text)
|
||||
return self._cache[text]
|
||||
|
||||
embedding = self._model.encode(text, convert_to_tensor=True)
|
||||
|
||||
if self._cache is not None:
|
||||
# Evict oldest entries if cache is full
|
||||
while len(self._cache) >= self._cache_max_size:
|
||||
self._cache.popitem(last=False)
|
||||
self._cache[text] = embedding
|
||||
|
||||
return embedding
|
||||
|
||||
@@ -1,11 +1,20 @@
|
||||
"""Composite validators for combining multiple checks."""
|
||||
"""Composite validators for combining multiple checks.
|
||||
|
||||
Note: CompositeCheck classes (AllOf, AnyOf) intentionally return ValidationResult
|
||||
rather than CheckResult. This allows callers to inspect individual check results
|
||||
for detailed error reporting. They implement a compatible interface but are not
|
||||
substitutable where Check is expected as a type constraint.
|
||||
"""
|
||||
|
||||
from veritext.core.types import CheckResult, ValidationContext, ValidationResult
|
||||
from veritext.validators.base import Check
|
||||
|
||||
|
||||
class AllOf:
|
||||
"""Passes only if all checks pass."""
|
||||
"""Passes only if all checks pass.
|
||||
|
||||
Note: Returns ValidationResult (not CheckResult) to expose child results.
|
||||
"""
|
||||
|
||||
def __init__(self, checks: list[Check]) -> None:
|
||||
"""
|
||||
@@ -20,7 +29,7 @@ class AllOf:
|
||||
if not checks:
|
||||
raise ValueError("checks list cannot be empty")
|
||||
|
||||
self._checks = checks
|
||||
self._checks = list(checks)
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
@@ -48,7 +57,10 @@ class AllOf:
|
||||
|
||||
|
||||
class AnyOf:
|
||||
"""Passes if any check passes."""
|
||||
"""Passes if any check passes.
|
||||
|
||||
Note: Returns ValidationResult (not CheckResult) to expose child results.
|
||||
"""
|
||||
|
||||
def __init__(self, checks: list[Check]) -> None:
|
||||
"""
|
||||
@@ -63,7 +75,7 @@ class AnyOf:
|
||||
if not checks:
|
||||
raise ValueError("checks list cannot be empty")
|
||||
|
||||
self._checks = checks
|
||||
self._checks = list(checks)
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
|
||||
@@ -229,7 +229,7 @@ class ContainsValidator:
|
||||
case_sensitive: Whether matching is case-sensitive. Defaults to False.
|
||||
|
||||
Raises:
|
||||
InvalidThresholdError: If patterns list is empty.
|
||||
InvalidThresholdError: If patterns list is empty or contains invalid regex.
|
||||
"""
|
||||
if not patterns:
|
||||
raise InvalidThresholdError("patterns list cannot be empty")
|
||||
@@ -238,6 +238,15 @@ class ContainsValidator:
|
||||
self._case_sensitive = case_sensitive
|
||||
self._flags = 0 if case_sensitive else re.IGNORECASE
|
||||
|
||||
self._compiled_patterns: list[re.Pattern[str]] = []
|
||||
for pattern in patterns:
|
||||
try:
|
||||
self._compiled_patterns.append(re.compile(pattern, self._flags))
|
||||
except re.error as e:
|
||||
raise InvalidThresholdError(
|
||||
f"Invalid regex pattern '{pattern}': {e}"
|
||||
) from e
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
"""Return the name of this check."""
|
||||
@@ -255,8 +264,10 @@ class ContainsValidator:
|
||||
CheckResult with pass/fail status.
|
||||
"""
|
||||
missing = []
|
||||
for pattern in self._patterns:
|
||||
if not re.search(pattern, text, self._flags):
|
||||
for pattern, compiled in zip(
|
||||
self._patterns, self._compiled_patterns, strict=True
|
||||
):
|
||||
if not compiled.search(text):
|
||||
missing.append(pattern)
|
||||
|
||||
passed = len(missing) == 0
|
||||
@@ -291,7 +302,7 @@ class ExcludesValidator:
|
||||
case_sensitive: Whether matching is case-sensitive. Defaults to False.
|
||||
|
||||
Raises:
|
||||
InvalidThresholdError: If patterns list is empty.
|
||||
InvalidThresholdError: If patterns list is empty or contains invalid regex.
|
||||
"""
|
||||
if not patterns:
|
||||
raise InvalidThresholdError("patterns list cannot be empty")
|
||||
@@ -300,6 +311,15 @@ class ExcludesValidator:
|
||||
self._case_sensitive = case_sensitive
|
||||
self._flags = 0 if case_sensitive else re.IGNORECASE
|
||||
|
||||
self._compiled_patterns: list[re.Pattern[str]] = []
|
||||
for pattern in patterns:
|
||||
try:
|
||||
self._compiled_patterns.append(re.compile(pattern, self._flags))
|
||||
except re.error as e:
|
||||
raise InvalidThresholdError(
|
||||
f"Invalid regex pattern '{pattern}': {e}"
|
||||
) from e
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
"""Return the name of this check."""
|
||||
@@ -317,8 +337,10 @@ class ExcludesValidator:
|
||||
CheckResult with pass/fail status.
|
||||
"""
|
||||
found = []
|
||||
for pattern in self._patterns:
|
||||
if re.search(pattern, text, self._flags):
|
||||
for pattern, compiled in zip(
|
||||
self._patterns, self._compiled_patterns, strict=True
|
||||
):
|
||||
if compiled.search(text):
|
||||
found.append(pattern)
|
||||
|
||||
passed = len(found) == 0
|
||||
|
||||
73
tests/test_core/test_config.py
Normal file
73
tests/test_core/test_config.py
Normal file
@@ -0,0 +1,73 @@
|
||||
"""Tests for configuration module."""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from veritext.core.config import VeritextSettings, get_settings
|
||||
|
||||
|
||||
class TestVeritextSettings:
|
||||
"""Tests for VeritextSettings."""
|
||||
|
||||
def test_default_log_level(self) -> None:
|
||||
"""Test default log level is INFO."""
|
||||
settings = VeritextSettings()
|
||||
assert settings.log_level == "INFO"
|
||||
|
||||
def test_default_log_format(self) -> None:
|
||||
"""Test default log format is console."""
|
||||
settings = VeritextSettings()
|
||||
assert settings.log_format == "console"
|
||||
|
||||
def test_default_benchmark_path(self) -> None:
|
||||
"""Test default benchmark storage path."""
|
||||
settings = VeritextSettings()
|
||||
assert settings.benchmark_storage_path == Path("benchmarks")
|
||||
|
||||
def test_default_tokeniser_lowercase(self) -> None:
|
||||
"""Test default tokeniser lowercase setting."""
|
||||
settings = VeritextSettings()
|
||||
assert settings.tokeniser_lowercase is True
|
||||
|
||||
def test_default_tokeniser_remove_punctuation(self) -> None:
|
||||
"""Test default tokeniser remove punctuation setting."""
|
||||
settings = VeritextSettings()
|
||||
assert settings.tokeniser_remove_punctuation is True
|
||||
|
||||
def test_default_semantic_model(self) -> None:
|
||||
"""Test default semantic model name."""
|
||||
settings = VeritextSettings()
|
||||
assert settings.semantic_model == "all-MiniLM-L6-v2"
|
||||
|
||||
def test_default_semantic_cache_enabled(self) -> None:
|
||||
"""Test semantic cache is enabled by default."""
|
||||
settings = VeritextSettings()
|
||||
assert settings.semantic_cache_embeddings is True
|
||||
|
||||
def test_env_var_override(self, monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
"""Test environment variable overrides default settings."""
|
||||
monkeypatch.setenv("VERITEXT_LOG_LEVEL", "DEBUG")
|
||||
settings = VeritextSettings()
|
||||
assert settings.log_level == "DEBUG"
|
||||
|
||||
def test_env_var_override_log_format(self, monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
"""Test environment variable overrides log format."""
|
||||
monkeypatch.setenv("VERITEXT_LOG_FORMAT", "json")
|
||||
settings = VeritextSettings()
|
||||
assert settings.log_format == "json"
|
||||
|
||||
|
||||
class TestGetSettings:
|
||||
"""Tests for get_settings function."""
|
||||
|
||||
def test_get_settings_returns_instance(self) -> None:
|
||||
"""Test get_settings returns a VeritextSettings instance."""
|
||||
settings = get_settings()
|
||||
assert isinstance(settings, VeritextSettings)
|
||||
|
||||
def test_get_settings_returns_valid_defaults(self) -> None:
|
||||
"""Test get_settings returns instance with valid defaults."""
|
||||
settings = get_settings()
|
||||
assert settings.log_level in ("DEBUG", "INFO", "WARNING", "ERROR")
|
||||
assert settings.log_format in ("console", "json")
|
||||
56
tests/test_core/test_logging.py
Normal file
56
tests/test_core/test_logging.py
Normal file
@@ -0,0 +1,56 @@
|
||||
"""Tests for logging module."""
|
||||
|
||||
from veritext.core.logging import configure_logging, get_logger
|
||||
|
||||
|
||||
class TestGetLogger:
|
||||
"""Tests for get_logger function."""
|
||||
|
||||
def test_get_logger_returns_logger(self) -> None:
|
||||
"""Test get_logger returns a logger instance."""
|
||||
logger = get_logger()
|
||||
assert logger is not None
|
||||
|
||||
def test_get_logger_default_name(self) -> None:
|
||||
"""Test get_logger uses 'veritext' as default name."""
|
||||
logger = get_logger()
|
||||
# The logger should be a bound logger from structlog
|
||||
assert hasattr(logger, "info")
|
||||
assert hasattr(logger, "debug")
|
||||
assert hasattr(logger, "warning")
|
||||
assert hasattr(logger, "error")
|
||||
|
||||
def test_get_logger_custom_name(self) -> None:
|
||||
"""Test get_logger respects custom name parameter."""
|
||||
logger = get_logger("custom.module")
|
||||
assert logger is not None
|
||||
assert hasattr(logger, "info")
|
||||
|
||||
|
||||
class TestConfigureLogging:
|
||||
"""Tests for configure_logging function."""
|
||||
|
||||
def test_configure_logging_console_format(self) -> None:
|
||||
"""Test configure_logging with console format does not raise."""
|
||||
configure_logging(level="INFO", log_format="console")
|
||||
logger = get_logger()
|
||||
assert logger is not None
|
||||
|
||||
def test_configure_logging_json_format(self) -> None:
|
||||
"""Test configure_logging with json format does not raise."""
|
||||
configure_logging(level="DEBUG", log_format="json")
|
||||
logger = get_logger()
|
||||
assert logger is not None
|
||||
|
||||
def test_configure_logging_uses_defaults(self) -> None:
|
||||
"""Test configure_logging uses settings defaults when not provided."""
|
||||
configure_logging()
|
||||
logger = get_logger()
|
||||
assert logger is not None
|
||||
|
||||
def test_configure_logging_different_levels(self) -> None:
|
||||
"""Test configure_logging accepts different log levels."""
|
||||
for level in ("DEBUG", "INFO", "WARNING", "ERROR"):
|
||||
configure_logging(level=level)
|
||||
logger = get_logger()
|
||||
assert logger is not None
|
||||
@@ -5,12 +5,11 @@ import pytest
|
||||
|
||||
@pytest.fixture
|
||||
def plugin_pytester(pytester: pytest.Pytester) -> pytest.Pytester:
|
||||
"""Configure pytester to use the veritext plugin."""
|
||||
pytester.makeconftest(
|
||||
"""Configure pytester to use the veritext plugin.
|
||||
|
||||
Note: The plugin is already loaded via the entry point in pyproject.toml,
|
||||
so no explicit pytest_plugins declaration is needed.
|
||||
"""
|
||||
pytest_plugins = ['veritext.pytest_plugin']
|
||||
"""
|
||||
)
|
||||
return pytester
|
||||
|
||||
|
||||
|
||||
@@ -263,6 +263,11 @@ class TestContainsValidator:
|
||||
with pytest.raises(InvalidThresholdError, match="cannot be empty"):
|
||||
ContainsValidator(patterns=[])
|
||||
|
||||
def test_contains_validator_raises_on_invalid_regex(self) -> None:
|
||||
"""Test that invalid regex pattern raises error at init time."""
|
||||
with pytest.raises(InvalidThresholdError, match="Invalid regex"):
|
||||
ContainsValidator(patterns=[r"[invalid"])
|
||||
|
||||
def test_contains_factory_function(self) -> None:
|
||||
"""Test the contains() factory function."""
|
||||
validator = contains(patterns=["test"], case_sensitive=True)
|
||||
@@ -327,6 +332,11 @@ class TestExcludesValidator:
|
||||
with pytest.raises(InvalidThresholdError, match="cannot be empty"):
|
||||
ExcludesValidator(patterns=[])
|
||||
|
||||
def test_excludes_validator_raises_on_invalid_regex(self) -> None:
|
||||
"""Test that invalid regex pattern raises error at init time."""
|
||||
with pytest.raises(InvalidThresholdError, match="Invalid regex"):
|
||||
ExcludesValidator(patterns=[r"[invalid"])
|
||||
|
||||
def test_excludes_factory_function(self) -> None:
|
||||
"""Test the excludes() factory function."""
|
||||
validator = excludes(patterns=["test"], case_sensitive=True)
|
||||
|
||||
Reference in New Issue
Block a user