11 Commits

Author SHA1 Message Date
0699e97e1d refactor: CLI cleanup and documentation updates
- Refactor CLI metric computation to eliminate code duplication
- Update version format to PEP 440 compliance (0.1.0.dev0)
- Cache Settings instance via @lru_cache for performance
- Document composite validators' protocol deviation
- Consolidate redundant empty checks in ROUGE-L computation
- Add Phase 10 (Portfolio Demos) to implementation plan
2026-02-04 15:38:46 +00:00
7de4505e31 fix(pytest-plugin): remove duplicate plugin registration in tests
The pytest plugin is already loaded via the entry point, so explicitly
declaring it in conftest causes a duplicate registration error.
2026-02-04 00:43:20 +00:00
564d663c78 docs(changelog): update for QA fixes 2026-02-04 00:23:06 +00:00
0b2bc6c688 test(core): add coverage for config and logging modules
Adds tests for VeritextSettings defaults, env var overrides, and the
get_logger/configure_logging functions.
2026-02-04 00:22:57 +00:00
aa687f43cd fix(validators): validate regex patterns at init time
ContainsValidator and ExcludesValidator now pre-compile regex patterns
during initialisation and raise InvalidThresholdError if invalid.
2026-02-04 00:22:47 +00:00
f18427e123 fix: QA review fixes for 0.1.0 release
- Fix README readability example property names
- Add validation for empty references after tokenisation in ROUGE
- Guard against zero sentence count in readability metric
- Implement LRU cache with max size for semantic embeddings
- Add .score property to LexicalResult for API consistency
- Use defensive list copy in composite validators
2026-02-03 21:31:48 +00:00
1754556c99 docs(changelog): release 0.1.0
Initial release with metrics, validators, pytest plugin, benchmark
module, CLI, and comprehensive documentation.
2026-02-03 19:16:37 +00:00
13c869f5d6 docs(readme): comprehensive documentation
Expands readme with detailed coverage of metrics, validators, pytest
plugin, benchmark module, CLI commands, and development setup.
2026-02-03 19:16:14 +00:00
93515707cc docs(examples): add benchmark regression example
Demonstrates benchmark quality tracking with historical comparison and
CI integration using assert_no_regression() for exit code control.
2026-02-03 19:15:12 +00:00
3cde5aba77 docs(examples): add chatbot testing example
Demonstrates pytest integration for chatbot QA with validate_text()
assertions, fixtures, and parametrised content safety tests.
2026-02-03 19:14:25 +00:00
69966d171c docs(examples): add basic validation example
Demonstrates core Veritext functionality: metrics, validators, composites,
and constraint validators with runnable code.
2026-02-03 19:13:47 +00:00
20 changed files with 1275 additions and 103 deletions

View File

@@ -7,35 +7,85 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased] ## [Unreleased]
### Changed
- Refactored CLI metric computation to eliminate code duplication
- Version format updated from `0.1.0-dev` to `0.1.0.dev0` (PEP 440 compliance)
- Settings instance is now cached via `@lru_cache` for better performance
- Documented composite validators' intentional deviation from `Check` protocol return type
### Fixed
- Consolidated redundant empty checks in ROUGE-L computation
- Fixed README example using incorrect property names (`grade_level``flesch_kincaid_grade`, `reading_ease``flesch_reading_ease`)
### Documentation
- Added Phase 10 (Portfolio Demos) to implementation plan: Streamlit demo and Jupyter notebooks
- Updated project plan with portfolio demo section
- Fixed potential crash in ROUGE metric when all references are empty after tokenisation
- Fixed potential division by zero in readability metric when text has no sentence endings
- Fixed unbounded cache growth in `SemanticSimilarity` by implementing LRU eviction with configurable max size
- Fixed mutable list aliasing in `AllOf` and `AnyOf` composite validators
- Fixed regex pattern validation in `ContainsValidator` and `ExcludesValidator` to fail at init time rather than during `check()`
- Fixed pytest plugin tests failing with duplicate plugin registration error
### Added ### Added
- Added `.score` property to `LexicalResult` for API consistency with other result types
- Added `cache_max_size` parameter to `SemanticSimilarity` (default: 1000 embeddings)
- Added test coverage for `core/config.py` and `core/logging.py` modules
## [0.1.0] — 2026-02-03
Initial release of Veritext, a semantic text validation framework for Python.
### Added
#### Core
- Project scaffold with pyproject.toml and development tooling - Project scaffold with pyproject.toml and development tooling
- Core exception hierarchy (`VeritextError` and subclasses) - Core exception hierarchy (`VeritextError` and subclasses)
- Core types: `ValidationContext`, `CheckResult`, `ValidationResult` - Core types: `ValidationContext`, `CheckResult`, `ValidationResult`
- Word tokeniser with Unicode normalisation support - Word tokeniser with Unicode normalisation support
- Configuration module with pydantic-settings - Configuration module with pydantic-settings
- Structured logging with structlog - Structured logging with structlog
#### Metrics
- Metrics module with `Metric` protocol, `AggregateStats`, and `BatchResult` types - Metrics module with `Metric` protocol, `AggregateStats`, and `BatchResult` types
- BLEU metric implementation (BLEU-1 through BLEU-4 with brevity penalty) - BLEU metric implementation (BLEU-1 through BLEU-4 with brevity penalty)
- Lexical similarity metric (Jaccard similarity and token overlap)
- ROUGE metric (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F-measure) - ROUGE metric (ROUGE-1, ROUGE-2, ROUGE-L with precision/recall/F-measure)
- Lexical similarity metric (Jaccard similarity and token overlap)
- Flesch-Kincaid readability metrics (grade level and reading ease) - Flesch-Kincaid readability metrics (grade level and reading ease)
- Batch scoring with aggregate statistics for all metrics - Batch scoring with aggregate statistics for all metrics
#### Validators
- Validators module with `Check` protocol for validation checks - Validators module with `Check` protocol for validation checks
- Metric-based validators: `BleuValidator`, `RougeValidator`, `LexicalValidator` - Metric-based validators: `BleuValidator`, `RougeValidator`, `LexicalValidator`
- Constraint validators: `LengthValidator`, `ReadabilityValidator`, `ContainsValidator`, `ExcludesValidator` - Constraint validators: `LengthValidator`, `ReadabilityValidator`, `ContainsValidator`, `ExcludesValidator`
- Composite validators: `AllOf` (all checks must pass), `AnyOf` (any check must pass) - Composite validators: `AllOf` (all checks must pass), `AnyOf` (any check must pass)
- Factory functions for clean validator API (`bleu()`, `rouge()`, `lexical()`, `length()`, `readability()`, `contains()`, `excludes()`, `all_of()`, `any_of()`) - Factory functions for clean validator API (`bleu()`, `rouge()`, `lexical()`, `length()`, `readability()`, `contains()`, `excludes()`, `all_of()`, `any_of()`)
#### Semantic Similarity
- Semantic similarity module with embedding-based text comparison (requires `veritext[semantic]` extra) - Semantic similarity module with embedding-based text comparison (requires `veritext[semantic]` extra)
- `SemanticSimilarity` metric using sentence-transformers for semantic relatedness - `SemanticSimilarity` metric using sentence-transformers for semantic relatedness
- `SemanticValidator` for threshold-based semantic similarity validation - `SemanticValidator` for threshold-based semantic similarity validation
- `semantic()` factory function for creating semantic validators - `semantic()` factory function for creating semantic validators
- Embedding caching for performance optimisation in repeated comparisons - Embedding caching for performance optimisation in repeated comparisons
#### Pytest Plugin
- Native pytest plugin for CI/CD integration (entry point: `pytest11`) - Native pytest plugin for CI/CD integration (entry point: `pytest11`)
- `validate_text()` assertion function for expressive test assertions - `validate_text()` assertion function for expressive test assertions
- `text_validation` marker for filtering validation tests - `text_validation` marker for filtering validation tests
- Pytest fixtures: `text_validator` factory and `validation_context` helper - Pytest fixtures: `text_validator` factory and `validation_context` helper
- Detailed failure messages with text preview and check diagnostics - Detailed failure messages with text preview and check diagnostics
#### Benchmarking
- Benchmark module for quality tracking and regression detection - Benchmark module for quality tracking and regression detection
- `Benchmark` class for evaluating text quality over time with metric storage - `Benchmark` class for evaluating text quality over time with metric storage
- `BenchmarkRun` and `RegressionReport` data models for tracking runs - `BenchmarkRun` and `RegressionReport` data models for tracking runs
@@ -45,6 +95,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `assert_no_regression()` raises `RegressionDetectedError` for CI integration - `assert_no_regression()` raises `RegressionDetectedError` for CI integration
- Customisable tolerance threshold and window size for regression detection - Customisable tolerance threshold and window size for regression detection
- Metadata support for tracking git SHA, model versions, etc. - Metadata support for tracking git SHA, model versions, etc.
#### CLI
- Command-line interface (CLI) via `veritext` command - Command-line interface (CLI) via `veritext` command
- `veritext validate` command for inline and file-based text validation - `veritext validate` command for inline and file-based text validation
- JSONL input format support for batch validation (`--file` option) - JSONL input format support for batch validation (`--file` option)
@@ -54,3 +107,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `veritext benchmark show` command for viewing benchmark history - `veritext benchmark show` command for viewing benchmark history
- `veritext benchmark check` command for regression detection with exit code 1 on failure - `veritext benchmark check` command for regression detection with exit code 1 on failure
- Rich-formatted terminal output with tables and coloured panels - Rich-formatted terminal output with tables and coloured panels
#### Documentation
- Comprehensive readme with usage examples
- Example scripts: basic validation, chatbot testing, benchmark regression

View File

@@ -871,6 +871,59 @@ uv run pytest --cov=src/veritext --cov-report=term-missing
--- ---
### Phase 10: Portfolio Demos
**Goal:** Interactive demos for showcasing Veritext without installation.
**Step 1 — Streamlit Demo:**
Build a quick interactive web UI for general visitors.
- [ ] Create `demo/streamlit_app.py`
- [ ] Text input boxes (candidate + reference)
- [ ] Metric selector (BLEU, ROUGE, lexical, readability)
- [ ] Threshold sliders for pass/fail validation
- [ ] Results table with scores and status
- [ ] Deploy to homeserver (e.g., `veritext.kschappell.com`)
**Step 2 — Jupyter Notebook Collection:**
Deep-dive notebooks targeting data science and ML recruiters.
- [ ] Create `notebooks/` directory
- [ ] `01-metrics-overview.ipynb` — Introduction to each metric with visualisations
- [ ] `02-batch-evaluation.ipynb` — Evaluating model outputs at scale
- [ ] `03-regression-detection.ipynb` — Tracking quality over time
- [ ] `04-chatbot-validation.ipynb` — Real-world use case
**Step 3 — JupyterLite Deployment:**
Host notebooks as static files running in the browser.
- [ ] Configure JupyterLite build with veritext pre-installed
- [ ] Bundle notebooks into static site
- [ ] Deploy alongside Streamlit demo
**Files:**
- `demo/streamlit_app.py`
- `notebooks/01-metrics-overview.ipynb`
- `notebooks/02-batch-evaluation.ipynb`
- `notebooks/03-regression-detection.ipynb`
- `notebooks/04-chatbot-validation.ipynb`
- `notebooks/jupyterlite-config.json`
**Verification:**
```bash
# Streamlit
uv run streamlit run demo/streamlit_app.py
# JupyterLite (local preview)
jupyter lite build --contents notebooks/
jupyter lite serve
```
---
## Dependencies ## Dependencies
```toml ```toml

View File

@@ -488,3 +488,47 @@ benchmark.assert_no_regression(tolerance=0.03)
5. **Natural portfolio narrative** — "I was building X and needed a better way to test 5. **Natural portfolio narrative** — "I was building X and needed a better way to test
it, so I built this tool." Every interviewer has faced similar problems. it, so I built this tool." Every interviewer has faced similar problems.
---
## Portfolio Demos (Future)
Interactive demos to showcase Veritext without requiring installation.
### Streamlit Demo
A quick interactive web UI for general visitors and recruiters.
**Features:**
- Text input boxes (candidate + reference)
- Metric selector (BLEU, ROUGE, lexical, readability)
- Threshold sliders for pass/fail validation
- Results table with scores and status
**Deployment:** Self-hosted on homeserver (e.g., `veritext.kschappell.com`)
**Effort:** ~30 minutes
### Jupyter Notebook Collection
Deep-dive notebooks targeting data science and ML recruiters.
**Notebooks:**
| Notebook | Purpose |
|----------|---------|
| `01-metrics-overview.ipynb` | Introduction to each metric with visualisations |
| `02-batch-evaluation.ipynb` | Evaluating model outputs at scale, statistical analysis |
| `03-regression-detection.ipynb` | Tracking quality over time, detecting degradation |
| `04-chatbot-validation.ipynb` | Real-world use case: validating chatbot responses |
**Hosting:** JupyterLite (static files, runs in browser via WebAssembly)
**Deployment:** Self-hosted alongside Streamlit demo
**Why both:**
| Demo Type | Audience | Value |
|-----------|----------|-------|
| Streamlit | General visitors | Quick, interactive, no friction |
| Notebooks | Data/ML recruiters | Shows analytical depth, speaks their language |

View File

@@ -0,0 +1,135 @@
"""Basic text validation examples.
Demonstrates core Veritext functionality:
- Single metric scoring (BLEU, ROUGE)
- Validator usage with thresholds
- Composite validators (all_of, any_of)
- Constraint validators (length, readability)
"""
from veritext.core.types import ValidationContext
from veritext.metrics import Bleu, Rouge
from veritext.validators import (
all_of,
any_of,
bleu,
contains,
excludes,
length,
readability,
rouge,
)
def metric_scoring_example() -> None:
"""Score text using individual metrics."""
candidate = "The quick brown fox jumps over the lazy dog."
reference = "A fast brown fox leaps over a sleepy dog."
# BLEU scoring (translation quality)
bleu_metric = Bleu()
bleu_result = bleu_metric.score(candidate, reference)
print("BLEU Scores:")
print(f" BLEU-1: {bleu_result.bleu1:.3f}")
print(f" BLEU-4: {bleu_result.bleu4:.3f}")
print(f" Brevity penalty: {bleu_result.brevity_penalty:.3f}")
# ROUGE scoring (summary quality)
rouge_metric = Rouge()
rouge_result = rouge_metric.score(candidate, reference)
print("\nROUGE Scores:")
print(f" ROUGE-1 F1: {rouge_result.rouge1.fmeasure:.3f}")
print(f" ROUGE-L F1: {rouge_result.rouge_l.fmeasure:.3f}")
def validator_example() -> None:
"""Use validators to make pass/fail decisions."""
reference = "Machine learning models require training data."
candidate = "ML models need training data to learn patterns."
context = ValidationContext(reference=reference)
# BLEU validator with minimum threshold
bleu_validator = bleu(min_score=0.3)
result = bleu_validator.check(candidate, context)
print(f"\nBLEU validation (min 0.3): {'PASS' if result.passed else 'FAIL'}")
# ROUGE validator
rouge_validator = rouge(min_score=0.5)
result = rouge_validator.check(candidate, context)
print(f"ROUGE validation (min 0.5): {'PASS' if result.passed else 'FAIL'}")
def composite_validator_example() -> None:
"""Combine validators with all_of and any_of."""
reference = "The product launch exceeded all expectations."
candidate = "The product release performed beyond expectations."
context = ValidationContext(reference=reference)
# All checks must pass
strict_validator = all_of(
[
bleu(min_score=0.2),
rouge(min_score=0.4),
length(max_chars=100),
]
)
result = strict_validator.check(candidate, context)
print(f"\nStrict (all_of): {'PASS' if result.passed else 'FAIL'}")
if not result.passed:
print(f" Failures: {result.failure_summary}")
# At least one check must pass
flexible_validator = any_of(
[
bleu(min_score=0.8), # Unlikely to pass
rouge(min_score=0.4), # More likely
]
)
result = flexible_validator.check(candidate, context)
print(f"Flexible (any_of): {'PASS' if result.passed else 'FAIL'}")
def constraint_validator_example() -> None:
"""Use constraint validators for text properties."""
text = "This short guide explains the basics clearly."
context = ValidationContext() # No reference needed for constraints
# Length constraints
length_validator = length(min_chars=20, max_chars=100, min_words=5, max_words=20)
result = length_validator.check(text, context)
print(f"\nLength check: {'PASS' if result.passed else 'FAIL'}")
# Readability (Flesch-Kincaid)
readability_validator = readability(max_grade=10.0)
result = readability_validator.check(text, context)
print(f"Readability (grade <= 10): {'PASS' if result.passed else 'FAIL'}")
# Content patterns
contains_validator = contains(patterns=["guide", "basics"])
result = contains_validator.check(text, context)
print(f"Contains required terms: {'PASS' if result.passed else 'FAIL'}")
excludes_validator = excludes(patterns=["error", "warning"])
result = excludes_validator.check(text, context)
print(f"Excludes forbidden terms: {'PASS' if result.passed else 'FAIL'}")
def main() -> None:
"""Run all examples."""
print("=" * 60)
print("Veritext Basic Validation Examples")
print("=" * 60)
metric_scoring_example()
validator_example()
composite_validator_example()
constraint_validator_example()
print("\n" + "=" * 60)
print("All examples completed.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,160 @@
"""Benchmark quality tracking with regression detection.
Demonstrates Veritext's benchmark module for CI integration:
- Creating a benchmark suite
- Running evaluations and storing results
- Checking for quality regression
- CI integration pattern with exit codes
"""
import tempfile
from pathlib import Path
from veritext.benchmark import Benchmark
from veritext.core.exceptions import RegressionDetectedError
def create_sample_data() -> tuple[list[str], list[str]]:
"""Create sample candidate/reference pairs for benchmarking."""
# Simulated summarisation outputs and references
candidates = [
"The new policy aims to reduce carbon emissions by 50% by 2030.",
"Scientists discovered a new species of deep-sea fish.",
"The company reported record profits in the third quarter.",
"Researchers developed a breakthrough treatment for the disease.",
"The city plans to expand public transportation routes.",
]
references = [
"The policy targets a 50% reduction in carbon emissions by 2030.",
"A new deep-sea fish species was discovered by marine biologists.",
"Record profits were announced by the company for Q3.",
"A breakthrough disease treatment was developed by researchers.",
"Public transport expansion is planned for the city.",
]
return candidates, references
def run_benchmark_example() -> None:
"""Run a benchmark evaluation and view results."""
# Use a temp directory for this example
with tempfile.TemporaryDirectory() as tmpdir:
storage_path = Path(tmpdir) / "benchmarks"
# Create benchmark suite
bench = Benchmark("summariser_quality", storage_path=storage_path)
candidates, references = create_sample_data()
# Run evaluation
print("Running benchmark evaluation...")
run = bench.evaluate(
candidates=candidates,
references=references,
metrics=["rouge_l", "bleu4"],
metadata={"model": "v1.0", "dataset": "test"},
)
print("\nBenchmark run completed:")
print(f" Run ID: {run.id[:8]}...")
print(f" Samples: {run.sample_count}")
print(" Metrics:")
for name, value in run.metrics.items():
print(f" {name}: {value:.4f}")
def regression_detection_example() -> None:
"""Demonstrate regression detection with historical comparison."""
with tempfile.TemporaryDirectory() as tmpdir:
storage_path = Path(tmpdir) / "benchmarks"
bench = Benchmark("summariser_quality", storage_path=storage_path)
candidates, references = create_sample_data()
# Simulate historical runs with stable quality
print("\nBuilding baseline with historical runs...")
for i in range(5):
bench.evaluate(
candidates=candidates,
references=references,
metrics=["rouge_l", "bleu4"],
metadata={"run": f"baseline_{i}"},
)
print(f" Baseline run {i + 1} recorded")
# Check regression (no degradation expected)
report = bench.check_regression(tolerance=0.05, window=5)
print(f"\nRegression check: {'DETECTED' if report.detected else 'NONE'}")
# Simulate a degraded model
print("\nSimulating degraded model output...")
degraded_candidates = [
"Policy carbon emissions.", # Much shorter/worse
"Fish discovered.",
"Company profits.",
"Treatment developed.",
"Transport expansion.",
]
bench.evaluate(
candidates=degraded_candidates,
references=references,
metrics=["rouge_l", "bleu4"],
metadata={"model": "v1.1-broken"},
)
# Check regression (should detect)
report = bench.check_regression(tolerance=0.05, window=5)
print(f"Regression check: {'DETECTED' if report.detected else 'NONE'}")
if report.detected:
print("\nRegression details:")
for metric, delta in report.deltas.items():
baseline = report.baseline.get(metric, 0)
current = report.current.get(metric, 0)
print(f" {metric}: {baseline:.4f} -> {current:.4f} ({delta:+.4f})")
def ci_integration_example() -> None:
"""CI integration pattern using assert_no_regression()."""
with tempfile.TemporaryDirectory() as tmpdir:
storage_path = Path(tmpdir) / "benchmarks"
bench = Benchmark("ci_check", storage_path=storage_path)
candidates, references = create_sample_data()
# Build baseline
for _ in range(3):
bench.evaluate(candidates, references, metrics=["rouge_l"])
# Simulate CI check
print("\n" + "=" * 50)
print("CI Integration Example")
print("=" * 50)
print("\nRunning evaluation...")
bench.evaluate(candidates, references, metrics=["rouge_l"])
print("Checking for regression...")
try:
bench.assert_no_regression(tolerance=0.05, window=3)
print("No regression detected.")
print("CI status: EXIT 0")
except RegressionDetectedError as e:
print(f"Regression detected: {e}")
print("CI status: EXIT 1")
def main() -> None:
"""Run all benchmark examples."""
print("=" * 60)
print("Veritext Benchmark & Regression Detection Examples")
print("=" * 60)
run_benchmark_example()
regression_detection_example()
ci_integration_example()
print("\n" + "=" * 60)
print("All examples completed.")
if __name__ == "__main__":
main()

140
examples/chatbot_testing.py Normal file
View File

@@ -0,0 +1,140 @@
"""Pytest integration for chatbot testing.
Demonstrates Veritext's pytest plugin for testing chatbot responses:
- validate_text() assertion function
- Custom test fixtures
- Test organisation with markers
"""
import pytest
from veritext.pytest_plugin import validate_text
# Sample chatbot responses for testing
CHATBOT_RESPONSES = {
"greeting": {
"input": "Hello!",
"response": "Hi there! How can I help you today?",
"expected_keywords": ["help", "hi"],
},
"weather": {
"input": "What's the weather like?",
"response": "I don't have access to real-time weather data, but you can "
"check a weather service like weather.com for current conditions.",
"expected_keywords": ["weather", "check"],
},
"farewell": {
"input": "Goodbye!",
"response": "Goodbye! Have a great day!",
"expected_keywords": ["goodbye", "day"],
},
}
# Fixtures for common test setup
@pytest.fixture
def greeting_response() -> str:
"""Provide a sample greeting response."""
return CHATBOT_RESPONSES["greeting"]["response"]
@pytest.fixture
def weather_response() -> str:
"""Provide a sample weather response."""
return CHATBOT_RESPONSES["weather"]["response"]
# Basic validation tests
class TestResponseQuality:
"""Test chatbot response quality using Veritext."""
def test_greeting_length(self, greeting_response: str) -> None:
"""Greeting responses should be concise."""
validate_text(
greeting_response,
min_length=10,
max_length=100,
)
def test_greeting_readability(self, greeting_response: str) -> None:
"""Greeting responses should be easy to read."""
validate_text(
greeting_response,
max_reading_grade=8.0,
)
def test_greeting_contains_keywords(self, greeting_response: str) -> None:
"""Greeting should contain expected terms."""
validate_text(
greeting_response,
must_contain=["help"],
)
def test_weather_response_quality(self, weather_response: str) -> None:
"""Weather response should be informative and readable."""
validate_text(
weather_response,
min_length=50,
max_length=500,
max_reading_grade=10.0,
must_contain=["weather"],
)
# Tests with reference comparison
class TestResponseSimilarity:
"""Test response similarity against reference texts."""
def test_greeting_similarity(self) -> None:
"""Greeting should match expected style."""
reference = "Hello! How may I assist you today?"
response = CHATBOT_RESPONSES["greeting"]["response"]
validate_text(
response,
reference=reference,
min_rouge=0.3, # Allow variation in wording
min_length=10,
)
def test_farewell_similarity(self) -> None:
"""Farewell should match expected style."""
reference = "Goodbye! Have a wonderful day!"
response = CHATBOT_RESPONSES["farewell"]["response"]
validate_text(
response,
reference=reference,
min_rouge=0.5,
must_contain=["goodbye"],
)
# Content safety tests
class TestContentSafety:
"""Test responses for inappropriate content."""
@pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
def test_no_profanity(self, response_key: str) -> None:
"""Responses should not contain profanity."""
response = CHATBOT_RESPONSES[response_key]["response"]
validate_text(
response,
must_exclude=["damn", "hell", "crap"],
min_length=1,
)
@pytest.mark.parametrize("response_key", ["greeting", "weather", "farewell"])
def test_no_harmful_content(self, response_key: str) -> None:
"""Responses should not contain harmful instructions."""
response = CHATBOT_RESPONSES[response_key]["response"]
validate_text(
response,
must_exclude=["hack", "exploit", "attack"],
min_length=1,
)
# Run tests when executed directly
if __name__ == "__main__":
pytest.main([__file__, "-v"])

View File

@@ -1,6 +1,6 @@
[project] [project]
name = "veritext" name = "veritext"
version = "0.1.0-dev" version = "0.1.0.dev0"
description = "Semantic text validation framework" description = "Semantic text validation framework"
readme = "readme.md" readme = "readme.md"
requires-python = ">=3.11" requires-python = ">=3.11"

386
readme.md
View File

@@ -2,48 +2,398 @@
Semantic text validation framework for Python. Semantic text validation framework for Python.
Validates text outputs against quality criteria using metrics like BLEU, ROUGE, Veritext validates text outputs against quality criteria using metrics like BLEU,
and semantic similarity. Designed for developers building systems that produce ROUGE, and semantic similarity. Designed for developers building systems that produce
text (chatbots, content generators, summarisation tools) who need automated text (chatbots, content generators, summarisation tools) who need automated quality
quality assurance beyond simple string matching. assurance beyond simple string matching.
## Status ## Features
Under active development. See [changelog.md](changelog.md) for progress. - **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic
embeddings
- **Composable validators** — Build complex checks from simple primitives
- **Native pytest integration** — `validate_text()` assertion for test suites
- **Quality benchmarking** — Track metrics over time with regression detection
- **CLI tools** — Command-line validation and benchmark management
## Installation ## Installation
```bash ```bash
pip install veritext pip install veritext
# With semantic similarity support # With semantic similarity support (sentence-transformers)
pip install veritext[semantic] pip install veritext[semantic]
``` ```
## Quick Start ## Quick Start
```python ```python
from veritext import validators as v
from veritext.core.types import ValidationContext from veritext.core.types import ValidationContext
from veritext.validators import all_of, bleu, length, rouge
# Create validators # Create a validator
validator = v.all_of([ validator = all_of([
v.bleu(min_score=0.7), bleu(min_score=0.5),
v.length(max_chars=500), rouge(min_score=0.6),
length(max_chars=500),
]) ])
# Validate text # Validate text
context = ValidationContext(reference="The cat sat on the mat.") context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.")
result = validator.check("A cat is sitting on the mat.", context) result = validator.check("A fast brown fox leaps over a sleepy dog.", context)
if not result.passed: if result.passed:
print("Validation passed!")
else:
print(result.failure_summary) print(result.failure_summary)
``` ```
## Documentation ## Metrics
- [Project Plan](docs/project-plan.md) Veritext provides several metrics for text evaluation.
- [Implementation Plan](docs/implementation-plan.md)
### BLEU
Measures n-gram precision against reference text. Useful for translation and
generation quality.
```python
from veritext.metrics import Bleu
bleu = Bleu()
result = bleu.score(
candidate="The cat sat on the mat.",
reference="A cat is sitting on the mat.",
)
print(f"BLEU-4: {result.bleu4:.3f}") # Uses 1-4 gram precision
print(f"BLEU-1: {result.bleu1:.3f}") # Unigram precision only
```
### ROUGE
Measures recall-oriented overlap with reference text. Useful for summarisation.
```python
from veritext.metrics import Rouge
rouge = Rouge()
result = rouge.score(
candidate="Scientists found a new planet.",
reference="Researchers discovered a new planet in the solar system.",
)
print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}") # Unigram overlap
print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}") # Longest common subsequence
```
### Lexical Similarity
Measures token overlap using Jaccard similarity.
```python
from veritext.metrics import Lexical
lexical = Lexical()
result = lexical.score(
candidate="The quick brown fox",
reference="The fast brown fox",
)
print(f"Jaccard: {result.jaccard:.3f}")
print(f"Token overlap: {result.token_overlap:.3f}")
```
### Readability
Computes Flesch-Kincaid scores for text complexity.
```python
from veritext.metrics import Readability
readability = Readability()
result = readability.score("This is a simple sentence.")
print(f"Grade level: {result.flesch_kincaid_grade:.1f}")
print(f"Reading ease: {result.flesch_reading_ease:.1f}")
```
### Semantic Similarity (Optional)
Requires `pip install veritext[semantic]`.
```python
from veritext.semantic import SemanticSimilarity
semantic = SemanticSimilarity()
result = semantic.score(
candidate="The dog is running in the park.",
reference="A canine is jogging through the garden.",
)
print(f"Similarity: {result.score:.3f}")
```
## Validators
Validators wrap metrics with thresholds to make pass/fail decisions.
### Metric-Based Validators
```python
from veritext.core.types import ValidationContext
from veritext.validators import bleu, lexical, rouge
context = ValidationContext(reference="Reference text here.")
# BLEU validation
validator = bleu(min_score=0.5, variant=4) # BLEU-4
result = validator.check("Candidate text here.", context)
# ROUGE validation
validator = rouge(min_score=0.6, variant="l") # ROUGE-L
result = validator.check("Candidate text here.", context)
# Lexical validation
validator = lexical(min_jaccard=0.3, min_overlap=0.5)
result = validator.check("Candidate text here.", context)
```
### Constraint Validators
These don't require reference text.
```python
from veritext.core.types import ValidationContext
from veritext.validators import contains, excludes, length, readability
context = ValidationContext() # No reference needed
# Length constraints
validator = length(min_chars=50, max_chars=500, min_words=10)
result = validator.check("Your text here...", context)
# Readability constraints
validator = readability(max_grade=8.0, min_ease=60.0)
result = validator.check("Your text here...", context)
# Content requirements
validator = contains(patterns=["important", "keyword"])
result = validator.check("This important text has a keyword.", context)
# Content exclusions
validator = excludes(patterns=["forbidden", "banned"])
result = validator.check("This text is clean.", context)
```
### Composite Validators
Combine multiple checks with logical operators.
```python
from veritext.validators import all_of, any_of, bleu, length, rouge
# All checks must pass
validator = all_of([
bleu(min_score=0.5),
rouge(min_score=0.6),
length(max_chars=500),
])
# At least one check must pass
validator = any_of([
bleu(min_score=0.7),
rouge(min_score=0.7),
])
```
## Pytest Plugin
Veritext provides native pytest integration for testing text quality.
### Basic Usage
```python
from veritext.pytest_plugin import validate_text
def test_response_quality():
response = "This is a helpful response to your question."
validate_text(
response,
min_length=20,
max_length=200,
max_reading_grade=10.0,
must_contain=["helpful"],
must_exclude=["error", "sorry"],
)
def test_summary_similarity():
summary = "Scientists discovered a new planet."
reference = "Researchers found a new planet in our solar system."
validate_text(
summary,
reference=reference,
min_rouge=0.5,
min_length=10,
)
```
### Available Parameters
| Parameter | Description |
|-----------|-------------|
| `reference` | Reference text for comparison metrics |
| `min_bleu` | Minimum BLEU-4 score (0.0-1.0) |
| `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) |
| `min_semantic` | Minimum semantic similarity (0.0-1.0) |
| `min_length` | Minimum character count |
| `max_length` | Maximum character count |
| `max_reading_grade` | Maximum Flesch-Kincaid grade level |
| `must_contain` | List of required patterns |
| `must_exclude` | List of forbidden patterns |
## Benchmarking
Track text quality over time and detect regressions.
### Running Benchmarks
```python
from veritext.benchmark import Benchmark
# Create a benchmark suite
bench = Benchmark("summariser_quality", storage_path="benchmarks/")
# Evaluate a batch of outputs
candidates = ["Summary 1...", "Summary 2...", "Summary 3..."]
references = ["Reference 1...", "Reference 2...", "Reference 3..."]
run = bench.evaluate(
candidates=candidates,
references=references,
metrics=["rouge_l", "bleu4"],
metadata={"model": "v1.2", "git_sha": "abc123"},
)
print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}")
print(f"BLEU-4: {run.metrics['bleu4']:.4f}")
```
### Regression Detection
```python
from veritext.benchmark import Benchmark
from veritext.core.exceptions import RegressionDetectedError
bench = Benchmark("summariser_quality")
# Check for regression against historical baseline
report = bench.check_regression(tolerance=0.05, window=10)
if report.detected:
print("Quality regression detected!")
for metric, delta in report.deltas.items():
print(f" {metric}: {delta:+.4f}")
# Or raise an exception for CI integration
try:
bench.assert_no_regression(tolerance=0.05)
except RegressionDetectedError as e:
print(f"CI failure: {e}")
exit(1)
```
### Viewing History
```python
bench = Benchmark("summariser_quality")
for run in bench.get_history(limit=10):
print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}")
```
## CLI
Veritext provides a command-line interface for validation and benchmarking.
### Validate Text
```bash
# Inline validation
veritext validate "Candidate text" -r "Reference text" -m bleu,rouge
# File-based batch validation (JSONL with "candidate" and "reference" fields)
veritext validate -f outputs.jsonl -m bleu,rouge,lexical
# With threshold for pass/fail
veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple
# Output formats: table (default), json, simple
veritext validate "Text" -r "Reference" -m bleu -o json
```
### Benchmark Commands
```bash
# Run a benchmark evaluation
veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4
# View benchmark history
veritext benchmark show my_bench --last 10
# Check for regression (exits with code 1 if detected)
veritext benchmark check my_bench --tolerance 0.05 --window 10
```
### JSONL Format
For file-based operations, use JSONL with `candidate` and `reference` fields:
```json
{"candidate": "Model output 1", "reference": "Expected output 1"}
{"candidate": "Model output 2", "reference": "Expected output 2"}
```
## Configuration
Veritext uses environment variables for configuration:
| Variable | Default | Description |
|----------|---------|-------------|
| `VERITEXT_LOG_LEVEL` | `INFO` | Logging level |
| `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) |
## Development
### Setup
```bash
git clone https://gitea.kschappell.com/kschappell/veritext.git
cd veritext
uv sync --all-extras
```
### Quality Checks
```bash
# Linting
uv run ruff check .
# Formatting
uv run ruff format --check .
# Type checking
uv run mypy src/
# Tests
uv run pytest
```
### Running Examples
```bash
uv run python examples/basic_validation.py
uv run pytest examples/chatbot_testing.py -v
uv run python examples/benchmark_regression.py
```
## Licence ## Licence

View File

@@ -11,11 +11,91 @@ from veritext.metrics.bleu import Bleu
from veritext.metrics.lexical import Lexical from veritext.metrics.lexical import Lexical
from veritext.metrics.rouge import Rouge from veritext.metrics.rouge import Rouge
# Available metrics mapped to their computation functions # Available metrics
AVAILABLE_METRICS = frozenset( AVAILABLE_METRICS = frozenset(
{"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"} {"bleu", "bleu1", "bleu2", "bleu3", "bleu4", "rouge", "rouge_l", "lexical"}
) )
# Lazily-initialised metric instances
_bleu: Bleu | None = None
_rouge: Rouge | None = None
_lexical: Lexical | None = None
def _get_bleu() -> Bleu:
"""Get or create the BLEU metric instance."""
global _bleu
if _bleu is None:
_bleu = Bleu()
return _bleu
def _get_rouge() -> Rouge:
"""Get or create the ROUGE metric instance."""
global _rouge
if _rouge is None:
_rouge = Rouge()
return _rouge
def _get_lexical() -> Lexical:
"""Get or create the lexical metric instance."""
global _lexical
if _lexical is None:
_lexical = Lexical()
return _lexical
# Metric registry: maps metric names to (result_keys, single_extractor, batch_extractor)
# - result_keys: output keys to populate
# - single_extractor: function(candidate, reference) -> dict of results
# - batch_extractor: function(candidates, references) -> dict of results
def _bleu_single(candidate: str, reference: str, key: str) -> dict[str, float]:
"""Extract a BLEU score for single mode."""
result = _get_bleu().score(candidate, reference)
return {key: getattr(result, key)}
def _bleu_batch(
candidates: list[str], references: list[str], key: str
) -> dict[str, float]:
"""Extract a BLEU score for batch mode."""
batch = _get_bleu().batch_score(candidates, references)
stats = batch.stats.get(key)
return {key: stats.mean} if stats else {}
def _rouge_single(candidate: str, reference: str) -> dict[str, float]:
"""Extract ROUGE-L F-measure for single mode."""
result = _get_rouge().score(candidate, reference)
return {"rouge_l": result.rouge_l.fmeasure}
def _rouge_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
"""Extract ROUGE-L F-measure for batch mode."""
batch = _get_rouge().batch_score(candidates, references)
stats = batch.stats.get("rouge_l_fmeasure")
return {"rouge_l": stats.mean} if stats else {}
def _lexical_single(candidate: str, reference: str) -> dict[str, float]:
"""Extract lexical scores for single mode."""
result = _get_lexical().score(candidate, reference)
return {"jaccard": result.jaccard, "token_overlap": result.token_overlap}
def _lexical_batch(candidates: list[str], references: list[str]) -> dict[str, float]:
"""Extract lexical scores for batch mode."""
batch = _get_lexical().batch_score(candidates, references)
results: dict[str, float] = {}
jaccard_stats = batch.stats.get("jaccard")
overlap_stats = batch.stats.get("token_overlap")
if jaccard_stats:
results["jaccard"] = jaccard_stats.mean
if overlap_stats:
results["token_overlap"] = overlap_stats.mean
return results
def _compute_metrics( def _compute_metrics(
candidate: str, candidate: str,
@@ -24,30 +104,16 @@ def _compute_metrics(
) -> dict[str, float]: ) -> dict[str, float]:
"""Compute requested metrics for a single text pair.""" """Compute requested metrics for a single text pair."""
results: dict[str, float] = {} results: dict[str, float] = {}
bleu = Bleu()
rouge = Rouge()
lexical = Lexical()
for metric in metric_names: for metric in metric_names:
if metric == "bleu" or metric == "bleu4": if metric in ("bleu", "bleu4"):
bleu_result = bleu.score(candidate, reference) results.update(_bleu_single(candidate, reference, "bleu4"))
results["bleu4"] = bleu_result.bleu4 elif metric in ("bleu1", "bleu2", "bleu3"):
elif metric == "bleu1": results.update(_bleu_single(candidate, reference, metric))
bleu_result = bleu.score(candidate, reference) elif metric in ("rouge", "rouge_l"):
results["bleu1"] = bleu_result.bleu1 results.update(_rouge_single(candidate, reference))
elif metric == "bleu2":
bleu_result = bleu.score(candidate, reference)
results["bleu2"] = bleu_result.bleu2
elif metric == "bleu3":
bleu_result = bleu.score(candidate, reference)
results["bleu3"] = bleu_result.bleu3
elif metric == "rouge" or metric == "rouge_l":
rouge_result = rouge.score(candidate, reference)
results["rouge_l"] = rouge_result.rouge_l.fmeasure
elif metric == "lexical": elif metric == "lexical":
lexical_result = lexical.score(candidate, reference) results.update(_lexical_single(candidate, reference))
results["jaccard"] = lexical_result.jaccard
results["token_overlap"] = lexical_result.token_overlap
return results return results
@@ -58,46 +124,17 @@ def _compute_batch_metrics(
metric_names: list[str], metric_names: list[str],
) -> dict[str, float]: ) -> dict[str, float]:
"""Compute average metrics for a batch of text pairs.""" """Compute average metrics for a batch of text pairs."""
bleu = Bleu()
rouge = Rouge()
lexical = Lexical()
results: dict[str, float] = {} results: dict[str, float] = {}
for metric in metric_names: for metric in metric_names:
if metric == "bleu" or metric == "bleu4": if metric in ("bleu", "bleu4"):
bleu_batch = bleu.batch_score(candidates, references) results.update(_bleu_batch(candidates, references, "bleu4"))
stats = bleu_batch.stats.get("bleu4") elif metric in ("bleu1", "bleu2", "bleu3"):
if stats: results.update(_bleu_batch(candidates, references, metric))
results["bleu4"] = stats.mean elif metric in ("rouge", "rouge_l"):
elif metric == "bleu1": results.update(_rouge_batch(candidates, references))
bleu_batch = bleu.batch_score(candidates, references)
stats = bleu_batch.stats.get("bleu1")
if stats:
results["bleu1"] = stats.mean
elif metric == "bleu2":
bleu_batch = bleu.batch_score(candidates, references)
stats = bleu_batch.stats.get("bleu2")
if stats:
results["bleu2"] = stats.mean
elif metric == "bleu3":
bleu_batch = bleu.batch_score(candidates, references)
stats = bleu_batch.stats.get("bleu3")
if stats:
results["bleu3"] = stats.mean
elif metric == "rouge" or metric == "rouge_l":
rouge_batch = rouge.batch_score(candidates, references)
stats = rouge_batch.stats.get("rouge_l_fmeasure")
if stats:
results["rouge_l"] = stats.mean
elif metric == "lexical": elif metric == "lexical":
lexical_batch = lexical.batch_score(candidates, references) results.update(_lexical_batch(candidates, references))
jaccard_stats = lexical_batch.stats.get("jaccard")
overlap_stats = lexical_batch.stats.get("token_overlap")
if jaccard_stats:
results["jaccard"] = jaccard_stats.mean
if overlap_stats:
results["token_overlap"] = overlap_stats.mean
return results return results

View File

@@ -1,5 +1,6 @@
"""Configuration management using pydantic-settings.""" """Configuration management using pydantic-settings."""
from functools import lru_cache
from pathlib import Path from pathlib import Path
from typing import Literal from typing import Literal
@@ -54,6 +55,7 @@ class VeritextSettings(BaseSettings):
) )
@lru_cache
def get_settings() -> VeritextSettings: def get_settings() -> VeritextSettings:
"""Get the current settings instance.""" """Get the cached settings instance."""
return VeritextSettings() return VeritextSettings()

View File

@@ -137,8 +137,8 @@ class Readability:
flesch_reading_ease=0.0, flesch_reading_ease=0.0,
) )
# Count sentences # Count sentences (ensure at least 1 to avoid division by zero)
sentence_count = _count_sentences(candidate) sentence_count = max(_count_sentences(candidate), 1)
# Count syllables # Count syllables
syllable_count = sum(_count_syllables(word) for word in words) syllable_count = sum(_count_syllables(word) for word in words)

View File

@@ -40,6 +40,11 @@ class LexicalResult(BaseModel):
token_overlap: float token_overlap: float
"""Proportion of candidate tokens found in reference.""" """Proportion of candidate tokens found in reference."""
@property
def score(self) -> float:
"""Return Jaccard similarity as the primary score."""
return self.jaccard
class RougeScore(BaseModel): class RougeScore(BaseModel):
"""Individual ROUGE variant score with precision, recall, F-measure.""" """Individual ROUGE variant score with precision, recall, F-measure."""

View File

@@ -107,9 +107,6 @@ def _compute_rouge_l(
Returns: Returns:
RougeScore with precision, recall, and F-measure. RougeScore with precision, recall, and F-measure.
""" """
if not candidate_tokens and not reference_tokens:
return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
if not candidate_tokens or not reference_tokens: if not candidate_tokens or not reference_tokens:
return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0) return RougeScore(precision=0.0, recall=0.0, fmeasure=0.0)
@@ -209,6 +206,10 @@ class Rouge:
rouge2_scores.append(_compute_rouge_score(candidate_tokens, ref_tokens, 2)) rouge2_scores.append(_compute_rouge_score(candidate_tokens, ref_tokens, 2))
rouge_l_scores.append(_compute_rouge_l(candidate_tokens, ref_tokens)) rouge_l_scores.append(_compute_rouge_l(candidate_tokens, ref_tokens))
# All references were empty after tokenisation
if not rouge1_scores:
raise ValueError("Reference text cannot be empty")
return RougeResult( return RougeResult(
rouge1=_max_rouge_scores(rouge1_scores), rouge1=_max_rouge_scores(rouge1_scores),
rouge2=_max_rouge_scores(rouge2_scores), rouge2=_max_rouge_scores(rouge2_scores),

View File

@@ -1,11 +1,15 @@
"""Embedding-based semantic similarity using sentence-transformers.""" """Embedding-based semantic similarity using sentence-transformers."""
from collections import OrderedDict
from typing import Any from typing import Any
from veritext.core.exceptions import DependencyError from veritext.core.exceptions import DependencyError
from veritext.metrics.base import AggregateStats, BatchResult from veritext.metrics.base import AggregateStats, BatchResult
from veritext.metrics.results import SemanticResult from veritext.metrics.results import SemanticResult
# Default maximum cache size (number of embeddings to store)
DEFAULT_CACHE_MAX_SIZE = 1000
class SemanticSimilarity: class SemanticSimilarity:
""" """
@@ -21,6 +25,7 @@ class SemanticSimilarity:
self, self,
model: str = "all-MiniLM-L6-v2", model: str = "all-MiniLM-L6-v2",
cache_embeddings: bool = True, cache_embeddings: bool = True,
cache_max_size: int = DEFAULT_CACHE_MAX_SIZE,
) -> None: ) -> None:
""" """
Initialise the semantic similarity metric. Initialise the semantic similarity metric.
@@ -30,6 +35,8 @@ class SemanticSimilarity:
Defaults to "all-MiniLM-L6-v2" (22MB, good quality/size tradeoff). Defaults to "all-MiniLM-L6-v2" (22MB, good quality/size tradeoff).
cache_embeddings: Whether to cache embeddings for repeated texts. cache_embeddings: Whether to cache embeddings for repeated texts.
Defaults to True. Defaults to True.
cache_max_size: Maximum number of embeddings to cache. Oldest entries
are evicted when the limit is reached. Defaults to 1000.
Raises: Raises:
DependencyError: If sentence-transformers is not installed. DependencyError: If sentence-transformers is not installed.
@@ -44,7 +51,10 @@ class SemanticSimilarity:
self._model_name = model self._model_name = model
self._model: Any = SentenceTransformer(model) self._model: Any = SentenceTransformer(model)
self._cache: dict[str, Any] | None = {} if cache_embeddings else None self._cache: OrderedDict[str, Any] | None = (
OrderedDict() if cache_embeddings else None
)
self._cache_max_size = cache_max_size
@property @property
def name(self) -> str: def name(self) -> str:
@@ -58,7 +68,7 @@ class SemanticSimilarity:
def _get_embedding(self, text: str) -> Any: def _get_embedding(self, text: str) -> Any:
""" """
Get embedding for text, using cache if available. Get embedding for text, using LRU cache if available.
Args: Args:
text: The text to embed. text: The text to embed.
@@ -67,11 +77,16 @@ class SemanticSimilarity:
The embedding tensor. The embedding tensor.
""" """
if self._cache is not None and text in self._cache: if self._cache is not None and text in self._cache:
# Move to end to mark as recently used
self._cache.move_to_end(text)
return self._cache[text] return self._cache[text]
embedding = self._model.encode(text, convert_to_tensor=True) embedding = self._model.encode(text, convert_to_tensor=True)
if self._cache is not None: if self._cache is not None:
# Evict oldest entries if cache is full
while len(self._cache) >= self._cache_max_size:
self._cache.popitem(last=False)
self._cache[text] = embedding self._cache[text] = embedding
return embedding return embedding

View File

@@ -1,11 +1,20 @@
"""Composite validators for combining multiple checks.""" """Composite validators for combining multiple checks.
Note: CompositeCheck classes (AllOf, AnyOf) intentionally return ValidationResult
rather than CheckResult. This allows callers to inspect individual check results
for detailed error reporting. They implement a compatible interface but are not
substitutable where Check is expected as a type constraint.
"""
from veritext.core.types import CheckResult, ValidationContext, ValidationResult from veritext.core.types import CheckResult, ValidationContext, ValidationResult
from veritext.validators.base import Check from veritext.validators.base import Check
class AllOf: class AllOf:
"""Passes only if all checks pass.""" """Passes only if all checks pass.
Note: Returns ValidationResult (not CheckResult) to expose child results.
"""
def __init__(self, checks: list[Check]) -> None: def __init__(self, checks: list[Check]) -> None:
""" """
@@ -20,7 +29,7 @@ class AllOf:
if not checks: if not checks:
raise ValueError("checks list cannot be empty") raise ValueError("checks list cannot be empty")
self._checks = checks self._checks = list(checks)
@property @property
def name(self) -> str: def name(self) -> str:
@@ -48,7 +57,10 @@ class AllOf:
class AnyOf: class AnyOf:
"""Passes if any check passes.""" """Passes if any check passes.
Note: Returns ValidationResult (not CheckResult) to expose child results.
"""
def __init__(self, checks: list[Check]) -> None: def __init__(self, checks: list[Check]) -> None:
""" """
@@ -63,7 +75,7 @@ class AnyOf:
if not checks: if not checks:
raise ValueError("checks list cannot be empty") raise ValueError("checks list cannot be empty")
self._checks = checks self._checks = list(checks)
@property @property
def name(self) -> str: def name(self) -> str:

View File

@@ -229,7 +229,7 @@ class ContainsValidator:
case_sensitive: Whether matching is case-sensitive. Defaults to False. case_sensitive: Whether matching is case-sensitive. Defaults to False.
Raises: Raises:
InvalidThresholdError: If patterns list is empty. InvalidThresholdError: If patterns list is empty or contains invalid regex.
""" """
if not patterns: if not patterns:
raise InvalidThresholdError("patterns list cannot be empty") raise InvalidThresholdError("patterns list cannot be empty")
@@ -238,6 +238,15 @@ class ContainsValidator:
self._case_sensitive = case_sensitive self._case_sensitive = case_sensitive
self._flags = 0 if case_sensitive else re.IGNORECASE self._flags = 0 if case_sensitive else re.IGNORECASE
self._compiled_patterns: list[re.Pattern[str]] = []
for pattern in patterns:
try:
self._compiled_patterns.append(re.compile(pattern, self._flags))
except re.error as e:
raise InvalidThresholdError(
f"Invalid regex pattern '{pattern}': {e}"
) from e
@property @property
def name(self) -> str: def name(self) -> str:
"""Return the name of this check.""" """Return the name of this check."""
@@ -255,8 +264,10 @@ class ContainsValidator:
CheckResult with pass/fail status. CheckResult with pass/fail status.
""" """
missing = [] missing = []
for pattern in self._patterns: for pattern, compiled in zip(
if not re.search(pattern, text, self._flags): self._patterns, self._compiled_patterns, strict=True
):
if not compiled.search(text):
missing.append(pattern) missing.append(pattern)
passed = len(missing) == 0 passed = len(missing) == 0
@@ -291,7 +302,7 @@ class ExcludesValidator:
case_sensitive: Whether matching is case-sensitive. Defaults to False. case_sensitive: Whether matching is case-sensitive. Defaults to False.
Raises: Raises:
InvalidThresholdError: If patterns list is empty. InvalidThresholdError: If patterns list is empty or contains invalid regex.
""" """
if not patterns: if not patterns:
raise InvalidThresholdError("patterns list cannot be empty") raise InvalidThresholdError("patterns list cannot be empty")
@@ -300,6 +311,15 @@ class ExcludesValidator:
self._case_sensitive = case_sensitive self._case_sensitive = case_sensitive
self._flags = 0 if case_sensitive else re.IGNORECASE self._flags = 0 if case_sensitive else re.IGNORECASE
self._compiled_patterns: list[re.Pattern[str]] = []
for pattern in patterns:
try:
self._compiled_patterns.append(re.compile(pattern, self._flags))
except re.error as e:
raise InvalidThresholdError(
f"Invalid regex pattern '{pattern}': {e}"
) from e
@property @property
def name(self) -> str: def name(self) -> str:
"""Return the name of this check.""" """Return the name of this check."""
@@ -317,8 +337,10 @@ class ExcludesValidator:
CheckResult with pass/fail status. CheckResult with pass/fail status.
""" """
found = [] found = []
for pattern in self._patterns: for pattern, compiled in zip(
if re.search(pattern, text, self._flags): self._patterns, self._compiled_patterns, strict=True
):
if compiled.search(text):
found.append(pattern) found.append(pattern)
passed = len(found) == 0 passed = len(found) == 0

View File

@@ -0,0 +1,73 @@
"""Tests for configuration module."""
from pathlib import Path
import pytest
from veritext.core.config import VeritextSettings, get_settings
class TestVeritextSettings:
"""Tests for VeritextSettings."""
def test_default_log_level(self) -> None:
"""Test default log level is INFO."""
settings = VeritextSettings()
assert settings.log_level == "INFO"
def test_default_log_format(self) -> None:
"""Test default log format is console."""
settings = VeritextSettings()
assert settings.log_format == "console"
def test_default_benchmark_path(self) -> None:
"""Test default benchmark storage path."""
settings = VeritextSettings()
assert settings.benchmark_storage_path == Path("benchmarks")
def test_default_tokeniser_lowercase(self) -> None:
"""Test default tokeniser lowercase setting."""
settings = VeritextSettings()
assert settings.tokeniser_lowercase is True
def test_default_tokeniser_remove_punctuation(self) -> None:
"""Test default tokeniser remove punctuation setting."""
settings = VeritextSettings()
assert settings.tokeniser_remove_punctuation is True
def test_default_semantic_model(self) -> None:
"""Test default semantic model name."""
settings = VeritextSettings()
assert settings.semantic_model == "all-MiniLM-L6-v2"
def test_default_semantic_cache_enabled(self) -> None:
"""Test semantic cache is enabled by default."""
settings = VeritextSettings()
assert settings.semantic_cache_embeddings is True
def test_env_var_override(self, monkeypatch: pytest.MonkeyPatch) -> None:
"""Test environment variable overrides default settings."""
monkeypatch.setenv("VERITEXT_LOG_LEVEL", "DEBUG")
settings = VeritextSettings()
assert settings.log_level == "DEBUG"
def test_env_var_override_log_format(self, monkeypatch: pytest.MonkeyPatch) -> None:
"""Test environment variable overrides log format."""
monkeypatch.setenv("VERITEXT_LOG_FORMAT", "json")
settings = VeritextSettings()
assert settings.log_format == "json"
class TestGetSettings:
"""Tests for get_settings function."""
def test_get_settings_returns_instance(self) -> None:
"""Test get_settings returns a VeritextSettings instance."""
settings = get_settings()
assert isinstance(settings, VeritextSettings)
def test_get_settings_returns_valid_defaults(self) -> None:
"""Test get_settings returns instance with valid defaults."""
settings = get_settings()
assert settings.log_level in ("DEBUG", "INFO", "WARNING", "ERROR")
assert settings.log_format in ("console", "json")

View File

@@ -0,0 +1,56 @@
"""Tests for logging module."""
from veritext.core.logging import configure_logging, get_logger
class TestGetLogger:
"""Tests for get_logger function."""
def test_get_logger_returns_logger(self) -> None:
"""Test get_logger returns a logger instance."""
logger = get_logger()
assert logger is not None
def test_get_logger_default_name(self) -> None:
"""Test get_logger uses 'veritext' as default name."""
logger = get_logger()
# The logger should be a bound logger from structlog
assert hasattr(logger, "info")
assert hasattr(logger, "debug")
assert hasattr(logger, "warning")
assert hasattr(logger, "error")
def test_get_logger_custom_name(self) -> None:
"""Test get_logger respects custom name parameter."""
logger = get_logger("custom.module")
assert logger is not None
assert hasattr(logger, "info")
class TestConfigureLogging:
"""Tests for configure_logging function."""
def test_configure_logging_console_format(self) -> None:
"""Test configure_logging with console format does not raise."""
configure_logging(level="INFO", log_format="console")
logger = get_logger()
assert logger is not None
def test_configure_logging_json_format(self) -> None:
"""Test configure_logging with json format does not raise."""
configure_logging(level="DEBUG", log_format="json")
logger = get_logger()
assert logger is not None
def test_configure_logging_uses_defaults(self) -> None:
"""Test configure_logging uses settings defaults when not provided."""
configure_logging()
logger = get_logger()
assert logger is not None
def test_configure_logging_different_levels(self) -> None:
"""Test configure_logging accepts different log levels."""
for level in ("DEBUG", "INFO", "WARNING", "ERROR"):
configure_logging(level=level)
logger = get_logger()
assert logger is not None

View File

@@ -5,12 +5,11 @@ import pytest
@pytest.fixture @pytest.fixture
def plugin_pytester(pytester: pytest.Pytester) -> pytest.Pytester: def plugin_pytester(pytester: pytest.Pytester) -> pytest.Pytester:
"""Configure pytester to use the veritext plugin.""" """Configure pytester to use the veritext plugin.
pytester.makeconftest(
""" Note: The plugin is already loaded via the entry point in pyproject.toml,
pytest_plugins = ['veritext.pytest_plugin'] so no explicit pytest_plugins declaration is needed.
""" """
)
return pytester return pytester

View File

@@ -263,6 +263,11 @@ class TestContainsValidator:
with pytest.raises(InvalidThresholdError, match="cannot be empty"): with pytest.raises(InvalidThresholdError, match="cannot be empty"):
ContainsValidator(patterns=[]) ContainsValidator(patterns=[])
def test_contains_validator_raises_on_invalid_regex(self) -> None:
"""Test that invalid regex pattern raises error at init time."""
with pytest.raises(InvalidThresholdError, match="Invalid regex"):
ContainsValidator(patterns=[r"[invalid"])
def test_contains_factory_function(self) -> None: def test_contains_factory_function(self) -> None:
"""Test the contains() factory function.""" """Test the contains() factory function."""
validator = contains(patterns=["test"], case_sensitive=True) validator = contains(patterns=["test"], case_sensitive=True)
@@ -327,6 +332,11 @@ class TestExcludesValidator:
with pytest.raises(InvalidThresholdError, match="cannot be empty"): with pytest.raises(InvalidThresholdError, match="cannot be empty"):
ExcludesValidator(patterns=[]) ExcludesValidator(patterns=[])
def test_excludes_validator_raises_on_invalid_regex(self) -> None:
"""Test that invalid regex pattern raises error at init time."""
with pytest.raises(InvalidThresholdError, match="Invalid regex"):
ExcludesValidator(patterns=[r"[invalid"])
def test_excludes_factory_function(self) -> None: def test_excludes_factory_function(self) -> None:
"""Test the excludes() factory function.""" """Test the excludes() factory function."""
validator = excludes(patterns=["test"], case_sensitive=True) validator = excludes(patterns=["test"], case_sensitive=True)