# Veritext Semantic text validation framework for Python. Veritext validates text outputs against quality criteria using metrics like BLEU, ROUGE, and semantic similarity. Designed for developers building systems that produce text (chatbots, content generators, summarisation tools) who need automated quality assurance beyond simple string matching. ## Features - **Multiple metrics** — BLEU, ROUGE, lexical similarity, readability, semantic embeddings - **Composable validators** — Build complex checks from simple primitives - **Native pytest integration** — `validate_text()` assertion for test suites - **Quality benchmarking** — Track metrics over time with regression detection - **CLI tools** — Command-line validation and benchmark management ## Installation ```bash pip install veritext # With semantic similarity support (sentence-transformers) pip install veritext[semantic] ``` ## Quick Start ```python from veritext.core.types import ValidationContext from veritext.validators import all_of, bleu, length, rouge # Create a validator validator = all_of([ bleu(min_score=0.5), rouge(min_score=0.6), length(max_chars=500), ]) # Validate text context = ValidationContext(reference="The quick brown fox jumps over the lazy dog.") result = validator.check("A fast brown fox leaps over a sleepy dog.", context) if result.passed: print("Validation passed!") else: print(result.failure_summary) ``` ## Metrics Veritext provides several metrics for text evaluation. ### BLEU Measures n-gram precision against reference text. Useful for translation and generation quality. ```python from veritext.metrics import Bleu bleu = Bleu() result = bleu.score( candidate="The cat sat on the mat.", reference="A cat is sitting on the mat.", ) print(f"BLEU-4: {result.bleu4:.3f}") # Uses 1-4 gram precision print(f"BLEU-1: {result.bleu1:.3f}") # Unigram precision only ``` ### ROUGE Measures recall-oriented overlap with reference text. Useful for summarisation. ```python from veritext.metrics import Rouge rouge = Rouge() result = rouge.score( candidate="Scientists found a new planet.", reference="Researchers discovered a new planet in the solar system.", ) print(f"ROUGE-1 F1: {result.rouge1.fmeasure:.3f}") # Unigram overlap print(f"ROUGE-L F1: {result.rouge_l.fmeasure:.3f}") # Longest common subsequence ``` ### Lexical Similarity Measures token overlap using Jaccard similarity. ```python from veritext.metrics import Lexical lexical = Lexical() result = lexical.score( candidate="The quick brown fox", reference="The fast brown fox", ) print(f"Jaccard: {result.jaccard:.3f}") print(f"Token overlap: {result.token_overlap:.3f}") ``` ### Readability Computes Flesch-Kincaid scores for text complexity. ```python from veritext.metrics import Readability readability = Readability() result = readability.score("This is a simple sentence.") print(f"Grade level: {result.flesch_kincaid_grade:.1f}") print(f"Reading ease: {result.flesch_reading_ease:.1f}") ``` ### Semantic Similarity (Optional) Requires `pip install veritext[semantic]`. ```python from veritext.semantic import SemanticSimilarity semantic = SemanticSimilarity() result = semantic.score( candidate="The dog is running in the park.", reference="A canine is jogging through the garden.", ) print(f"Similarity: {result.score:.3f}") ``` ## Validators Validators wrap metrics with thresholds to make pass/fail decisions. ### Metric-Based Validators ```python from veritext.core.types import ValidationContext from veritext.validators import bleu, lexical, rouge context = ValidationContext(reference="Reference text here.") # BLEU validation validator = bleu(min_score=0.5, variant=4) # BLEU-4 result = validator.check("Candidate text here.", context) # ROUGE validation validator = rouge(min_score=0.6, variant="l") # ROUGE-L result = validator.check("Candidate text here.", context) # Lexical validation validator = lexical(min_jaccard=0.3, min_overlap=0.5) result = validator.check("Candidate text here.", context) ``` ### Constraint Validators These don't require reference text. ```python from veritext.core.types import ValidationContext from veritext.validators import contains, excludes, length, readability context = ValidationContext() # No reference needed # Length constraints validator = length(min_chars=50, max_chars=500, min_words=10) result = validator.check("Your text here...", context) # Readability constraints validator = readability(max_grade=8.0, min_ease=60.0) result = validator.check("Your text here...", context) # Content requirements validator = contains(patterns=["important", "keyword"]) result = validator.check("This important text has a keyword.", context) # Content exclusions validator = excludes(patterns=["forbidden", "banned"]) result = validator.check("This text is clean.", context) ``` ### Composite Validators Combine multiple checks with logical operators. ```python from veritext.validators import all_of, any_of, bleu, length, rouge # All checks must pass validator = all_of([ bleu(min_score=0.5), rouge(min_score=0.6), length(max_chars=500), ]) # At least one check must pass validator = any_of([ bleu(min_score=0.7), rouge(min_score=0.7), ]) ``` ## Pytest Plugin Veritext provides native pytest integration for testing text quality. ### Basic Usage ```python from veritext.pytest_plugin import validate_text def test_response_quality(): response = "This is a helpful response to your question." validate_text( response, min_length=20, max_length=200, max_reading_grade=10.0, must_contain=["helpful"], must_exclude=["error", "sorry"], ) def test_summary_similarity(): summary = "Scientists discovered a new planet." reference = "Researchers found a new planet in our solar system." validate_text( summary, reference=reference, min_rouge=0.5, min_length=10, ) ``` ### Available Parameters | Parameter | Description | |-----------|-------------| | `reference` | Reference text for comparison metrics | | `min_bleu` | Minimum BLEU-4 score (0.0-1.0) | | `min_rouge` | Minimum ROUGE-L F1 score (0.0-1.0) | | `min_semantic` | Minimum semantic similarity (0.0-1.0) | | `min_length` | Minimum character count | | `max_length` | Maximum character count | | `max_reading_grade` | Maximum Flesch-Kincaid grade level | | `must_contain` | List of required patterns | | `must_exclude` | List of forbidden patterns | ## Benchmarking Track text quality over time and detect regressions. ### Running Benchmarks ```python from veritext.benchmark import Benchmark # Create a benchmark suite bench = Benchmark("summariser_quality", storage_path="benchmarks/") # Evaluate a batch of outputs candidates = ["Summary 1...", "Summary 2...", "Summary 3..."] references = ["Reference 1...", "Reference 2...", "Reference 3..."] run = bench.evaluate( candidates=candidates, references=references, metrics=["rouge_l", "bleu4"], metadata={"model": "v1.2", "git_sha": "abc123"}, ) print(f"ROUGE-L: {run.metrics['rouge_l']:.4f}") print(f"BLEU-4: {run.metrics['bleu4']:.4f}") ``` ### Regression Detection ```python from veritext.benchmark import Benchmark from veritext.core.exceptions import RegressionDetectedError bench = Benchmark("summariser_quality") # Check for regression against historical baseline report = bench.check_regression(tolerance=0.05, window=10) if report.detected: print("Quality regression detected!") for metric, delta in report.deltas.items(): print(f" {metric}: {delta:+.4f}") # Or raise an exception for CI integration try: bench.assert_no_regression(tolerance=0.05) except RegressionDetectedError as e: print(f"CI failure: {e}") exit(1) ``` ### Viewing History ```python bench = Benchmark("summariser_quality") for run in bench.get_history(limit=10): print(f"{run.timestamp}: rouge_l={run.metrics.get('rouge_l', 0):.4f}") ``` ## CLI Veritext provides a command-line interface for validation and benchmarking. ### Validate Text ```bash # Inline validation veritext validate "Candidate text" -r "Reference text" -m bleu,rouge # File-based batch validation (JSONL with "candidate" and "reference" fields) veritext validate -f outputs.jsonl -m bleu,rouge,lexical # With threshold for pass/fail veritext validate "Text" -r "Reference" -m bleu -t 0.5 -o simple # Output formats: table (default), json, simple veritext validate "Text" -r "Reference" -m bleu -o json ``` ### Benchmark Commands ```bash # Run a benchmark evaluation veritext benchmark run my_bench -f data.jsonl -m rouge_l,bleu4 # View benchmark history veritext benchmark show my_bench --last 10 # Check for regression (exits with code 1 if detected) veritext benchmark check my_bench --tolerance 0.05 --window 10 ``` ### JSONL Format For file-based operations, use JSONL with `candidate` and `reference` fields: ```json {"candidate": "Model output 1", "reference": "Expected output 1"} {"candidate": "Model output 2", "reference": "Expected output 2"} ``` ## Configuration Veritext uses environment variables for configuration: | Variable | Default | Description | |----------|---------|-------------| | `VERITEXT_LOG_LEVEL` | `INFO` | Logging level | | `VERITEXT_LOG_FORMAT` | `console` | Log format (`console` or `json`) | ## Development ### Setup ```bash git clone https://gitea.kschappell.com/kschappell/veritext.git cd veritext uv sync --all-extras ``` ### Quality Checks ```bash # Linting uv run ruff check . # Formatting uv run ruff format --check . # Type checking uv run mypy src/ # Tests uv run pytest ``` ### Running Examples ```bash uv run python examples/basic_validation.py uv run pytest examples/chatbot_testing.py -v uv run python examples/benchmark_regression.py ``` ## Licence MIT