commit 72268ff440c2b4993f7248a69f23eee572e2e458 Author: Kai Chappell Date: Sat Mar 8 15:18:11 2025 +0000 initial project scaffold diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..1c64721 --- /dev/null +++ b/.gitignore @@ -0,0 +1,68 @@ +# Python +__pycache__/ +*.py[cod] +*$py.class +*.so +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +*.egg-info/ +.installed.cfg +*.egg +.venv/ +venv/ +ENV/ + +# Node +node_modules/ +npm-debug.log* +yarn-debug.log* +yarn-error.log* +.npm + +# Build outputs +*.js.map +.next/ +out/ + +# IDE +.idea/ +.vscode/ +*.swp +*.swo +*~ + +# Environment +.env +.env.local +.env.*.local + +# Testing +.coverage +htmlcov/ +.pytest_cache/ +.mypy_cache/ +.ruff_cache/ +coverage/ + +# Database +*.db +*.sqlite3 + +# OS +.DS_Store +Thumbs.db + +# Logs +*.log +logs/ diff --git a/changelog.md b/changelog.md new file mode 100644 index 0000000..2a5306a --- /dev/null +++ b/changelog.md @@ -0,0 +1,224 @@ +# Changelog + +All notable changes to Arbiter will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [Unreleased] + +## [0.5.0] - 2025-06-16 + +### Added + +- Conversational follow-up system for PR comment Q&A + - Question detection in PR comments with confidence scoring + - Agent routing based on keywords, finding references, and context + - Agent explain() method for providing detailed follow-up explanations + - Conversation storage in database (ConversationModel, ConversationMessageModel) + - Webhook handling for GitHub issue_comment and GitLab note events + - Worker task process_followup for async question processing + - REST API endpoints for conversations + - `GET /api/conversations` - List conversations with filtering + - `GET /api/conversations/{id}` - Conversation detail with messages + - `GET /api/conversations/review/{id}` - Conversation for specific review + - Explain prompt templates for each agent (security, style, complexity) + - Configuration settings for follow-up behaviour + - followup_enabled - Enable/disable follow-up processing + - followup_confidence_threshold - Minimum confidence to respond + - followup_max_tokens_per_response - Token limit for responses +- React dashboard for exploring reviews and monitoring metrics + - Review list page with filtering (repository, status, verdict, author) and pagination + - Review detail page with findings grouped by severity and expandable cards + - Deliberation timeline showing step-by-step decision process + - Metrics dashboard with charts for verdicts, severities, reviews over time + - VerdictBadge and SeverityBadge components with color-coded indicators + - TanStack Query for data fetching and caching + - Tailwind CSS for styling with responsive layouts + - Docker configuration with nginx for production builds +- `GET /api/reviews/metrics` endpoint for aggregate statistics + - Total/completed review counts + - Average cost per review + - Verdict and severity distribution + - Reviews over last 30 days + - Cost breakdown by agent +- Idempotent comment updates for re-reviews + - Comment model for representing PR/MR comments + - `get_comments()` method on platform clients to list PR comments + - `update_comment()` method to edit existing comments + - Arbiter marker (``) embedded in review comments + - Re-reviews now update existing Arbiter comment instead of posting new ones + - Graceful fallback: posts new comment if fetching comments fails +- Documentation + - Dashboard README with component architecture and usage + - Deployment guide with Docker, production, and scaling guidance + - API reference with endpoint documentation and webhook schemas + - Environment variable reference (.env.example) + - Troubleshooting section in main README + +### Changed + +- Increased test coverage from 74% to 86%+ + - LiteLLMClient initialization and complete method tests + - Static analysis runner tests with mocked subprocess execution + - API dependency injection tests (get_db, get_redis, close_redis) + - Extended test fixtures for platform client mocking +- Version bumped to 0.5.0 + +## [0.4.0] - 2025-04-13 + +### Added + +- GitHub integration client for fetching diffs and posting review comments + - Fetch PR diffs via GitHub API + - Post review comments on pull requests + - Update commit status checks (pending/success/failure/error) + - Automatic retry with exponential backoff on transient errors + - Rate limit monitoring with warnings when limits are low +- GitLab integration client with equivalent functionality + - Fetch MR diffs via GitLab API + - Post notes on merge requests + - Update commit pipeline status + - URL-encoded project path handling +- Review comment formatter for Markdown output + - Verdict header with icon (approve/changes/comment) + - Summary statistics table by severity + - Findings grouped by severity level + - Conflicts section for agent disagreements + - Optional cost/tokens footer + - Automatic truncation to stay within GitHub's 65535 char limit +- Platform integration exceptions + - AuthenticationError for 401/403 responses + - RateLimitError for 429 responses with retry_after + - NotFoundError for 404 responses + - PlatformError for 5xx server errors +- Extended configuration settings + - github_token and gitlab_token for API authentication + - github_base_url and gitlab_base_url for enterprise instances + - integration_timeout and integration_max_retries + - status_check_context for commit status naming + - post_comments and update_status feature flags +- Worker integration with platform clients + - Automatic diff fetching when not provided + - Commit status set to pending on job start + - Review comment posted on completion + - Status updated to success/failure based on verdict + - Graceful degradation on platform API failures + +### Changed + +- Worker process_review now accepts platform parameter +- Webhook routes pass platform identifier to queue +- Version bumped to 0.4.0 + +### Dependencies + +- Added httpx>=0.26.0 to main dependencies +- Added tenacity>=8.2.0 for retry logic +- Added respx>=0.21.0 to dev dependencies for HTTP mocking + +## [0.3.0] - 2025-03-29 + +### Added + +- FastAPI application with lifespan management, CORS, and exception handlers +- PostgreSQL database schema with SQLAlchemy async ORM + - ReviewModel for storing PR review metadata and results + - FindingModel for individual agent findings + - ConflictModel for detected conflicts between agents + - DeliberationStepModel for audit trail + - PolicyModel for organisation configurations +- Alembic migrations with async PostgreSQL support +- Redis-backed arq worker for async job processing + - Review task with full pipeline execution + - Job deduplication by repository/PR/commit + - Priority queuing (draft PRs get lower priority) +- REST API endpoints + - `GET /api/reviews` - List reviews with pagination and filtering + - `GET /api/reviews/{id}` - Review detail with findings + - `GET /api/reviews/{id}/deliberation` - Deliberation log + - `POST /api/reviews` - Trigger manual review +- Webhook routes for GitHub and GitLab + - `POST /webhooks/github` - GitHub PR events with HMAC-SHA256 signature validation + - `POST /webhooks/gitlab` - GitLab MR events with token validation +- Health and metrics endpoints + - `GET /health` - Liveness check + - `GET /health/ready` - Readiness check (database, Redis) + - `GET /health/live` - Kubernetes liveness probe + - `GET /metrics` - Prometheus metrics +- LLM response cache with Redis backend + - Cache key based on diff, agent, prompt version, and policy + - Configurable TTL (default 24 hours) + - Cache statistics tracking +- Cost tracking models + - ReviewCost for aggregate cost tracking + - AgentCost for per-agent breakdown + - CostEstimate for pre-review cost estimation +- Docker configuration + - Multi-stage Dockerfile for API and worker + - docker-compose.yml with PostgreSQL, Redis, API, worker, and migrate services +- Extended configuration settings + - Database connection (URL, pool size, max overflow) + - Redis connection (URL, max connections) + - Webhook secrets (GitHub HMAC, GitLab token) + - API settings (rate limits, CORS origins) + - Worker settings (max jobs, timeout, retries) + +### Changed + +- Configuration now uses `@lru_cache` for settings singleton +- Version bumped to 0.3.0 + +## [0.2.0] - 2025-03-15 + +### Added + +- Diff parser with line mapping for unified diff format +- Static analysis runners for ruff, mypy, bandit, and radon +- Finding merger with deduplication and proximity-based grouping +- Algorithmic conflict detection between agents +- LLM synthesis for ambiguous semantic conflicts +- Deliberation coordinator with merge, conflict detection, synthesis, and verdict determination +- Verdict rules based on severity counts and configurable thresholds +- CLI flags for static analysis (--static-analysis/--no-static-analysis) and work directory (--work-dir) +- Test fixtures for conflict scenarios (security vs complexity, overlapping, contradictory) +- Test suites for deliberation and static analysis (85% coverage) + +### Changed + +- CLI review command now outputs deliberation results with verdict, findings, conflicts, and resolution log +- Output formats (rich, JSON, markdown) updated to display verdict and deliberation steps + +## [0.1.0] - 2025-03-09 + +### Added + +- Initial project scaffold +- Project plan with architecture documentation +- Core data models: Finding, Policy, AgentConfig, ReviewResult, Severity, Verdict enums +- LLM client abstraction with LiteLLM integration +- Prompt template system with versioned templates and registry +- Agent framework with base Agent class and ReviewContext +- Security, Style, and Complexity review agents +- CLI with `arbiter review` command supporting diff files and stdin +- Output formats: rich (terminal), JSON, markdown +- Policy configuration via YAML files +- Model override via CLI flag +- Test suite with 92% coverage + +### Changed + +- Consolidated to single Python service (removed separate Node.js webhook handler) +- Reduced MVP scope to 3 agents: Security, Style, Complexity +- Added static analysis pre-pass (ruff, mypy, bandit, radon) +- Added cost management section with token budgets and model selection +- Added error handling strategy with circuit breaker +- Added observability section with Prometheus metrics +- Added security section for prompt injection mitigation +- Changed deliberation to use algorithmic conflict detection first +- Specified typed DeliberationStep structure with documented metadata keys +- Updated conflict examples to show real Security vs Complexity trade-offs +- Added prompt_version field to Finding model for traceability +- Moved caching and cost tracking to Phase 3 (requires database) +- Clarified false positive evaluation methodology +- Documented prompts versioning strategy (deployment-time config) diff --git a/readme.md b/readme.md new file mode 100644 index 0000000..4d19ae8 --- /dev/null +++ b/readme.md @@ -0,0 +1,278 @@ +# Arbiter + +A multi-agent code review system that shows its work. + +## What is this? + +Arbiter is a code review tool where specialised AI agents independently analyse pull +requests, then deliberate to produce unified feedback. Unlike black-box AI reviewers, +Arbiter exposes the reasoning process — you see how agents disagree, weigh trade-offs, +and reach consensus. + +## Why? + +Current AI code review tools give you a verdict but hide their reasoning. When they +flag something, you can't tell if it's a security expert's concern or a style nitpick. +Arbiter surfaces the editorial board's discussion. + +## Features + +- **Static analysis pre-pass** — ruff, mypy, bandit, radon run first +- **Specialised agents** — Security, Style, Complexity (LLM-powered) +- **Transparent deliberation** — See how agents reason and resolve conflicts +- **Configurable policies** — Adapt to your team's standards +- **Cost controls** — Token budgets, model selection, response caching +- **GitHub/GitLab integration** — Webhook-driven, posts comments to PRs + +## Architecture + +``` +GitHub/GitLab + │ + │ Webhook (PR opened/updated) + ▼ +┌─────────────────────────────────────────────┐ +│ FastAPI Application │ +│ │ +│ Webhook ──► Redis Queue ──► Worker │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────────────┐ │ +│ │ Review Orchestrator │ │ +│ │ │ │ +│ │ 1. Static analysis (ruff, mypy...) │ │ +│ │ 2. Agents in parallel │ │ +│ │ 3. Deliberation │ │ +│ │ 4. Post results │ │ +│ │ │ │ +│ │ ┌──────────┐ ┌───────┐ ┌──────────┐ │ │ +│ │ │ Security │ │ Style │ │Complexity│ │ │ +│ │ └────┬─────┘ └───┬───┘ └────┬─────┘ │ │ +│ │ └───────────┼──────────┘ │ │ +│ │ ▼ │ │ +│ │ ┌─────────────┐ │ │ +│ │ │ Coordinator │ │ │ +│ │ └─────────────┘ │ │ +│ └───────────────────────────────────────┘ │ +└─────────────────────────────────────────────┘ + │ + ├──► PR Comment + ├──► Database (history) + └──► Metrics +``` + +## Tech Stack + +| Component | Technology | +|-----------|------------| +| Backend | Python 3.12, FastAPI | +| Queue | Redis, arq | +| Database | PostgreSQL | +| LLM | LiteLLM (OpenAI, Anthropic, local) | +| Static analysis | ruff, mypy, bandit, radon | + +## Quick Start + +```bash +# Clone the repository +git clone https://gitea.kschappell.com/kschappell/arbiter.git +cd arbiter + +# Start infrastructure +docker compose up -d db redis + +# Install dependencies +pip install -e ".[dev]" + +# Run migrations +alembic upgrade head + +# Start API server +uvicorn src.arbiter.main:app --reload + +# Start worker (separate terminal) +arq src.arbiter.worker.tasks.WorkerSettings +``` + +## CLI Usage + +Review a local diff without running the full server: + +```bash +# Review a diff file +arbiter review changes.diff --policy .arbiter/policy.yaml + +# Review staged changes +git diff --cached | arbiter review - --policy .arbiter/policy.yaml +``` + +## Configuration + +Create `.arbiter/policy.yaml` in your repository: + +```yaml +version: "1.0" + +static_analysis: + ruff: + enabled: true + mypy: + enabled: true + bandit: + enabled: true + severity_threshold: medium + +agents: + security: + enabled: true + model: "gpt-4o" + severity_threshold: medium + + style: + enabled: true + model: "gpt-4o-mini" + config: + naming_convention: snake_case + + complexity: + enabled: true + model: "gpt-4o-mini" + thresholds: + max_cyclomatic: 10 + +deliberation: + conflict_resolution: security_first + minimum_confidence: 0.7 + +cost_controls: + max_tokens_per_review: 50000 + max_cost_per_review_usd: 0.50 + cache_similar_diffs: true +``` + +## Example Output + +```markdown +## Arbiter Review + +**Verdict:** Request changes (confidence: 92%) + +### Static Analysis +- **bandit** B105: Possible hardcoded password (line 52) +- **radon** CC: Function `process_data` has complexity 12 (threshold: 10) + +### Agent Findings + +🔒 **Security** (High) +Line 47: Endpoint `/api/admin/export` has no authentication decorator. +→ All admin endpoints should use `@require_admin` per project patterns. + +📐 **Style** (Low) +Line 23: Function name `getData` doesn't match snake_case convention. + +### Deliberation + +All agents agree authentication is missing. Static analysis confirms +hardcoded password on line 52. Both issues require resolution. +``` + +## Dashboard + +Arbiter includes a React dashboard for exploring reviews and monitoring metrics: + +- **Review List** — Browse all reviews with filtering by repository, status, verdict, and author +- **Review Detail** — View findings grouped by severity with expandable cards +- **Deliberation Explorer** — Step-by-step timeline of how agents reached their verdict +- **Metrics** — Charts showing verdicts, severities, and review trends over time + +Start the dashboard: + +```bash +cd dashboard +npm install +npm run dev +``` + +Access at `http://localhost:5173`. Configure the API URL via `VITE_API_URL` environment variable. + +## API Documentation + +The API server provides interactive documentation: + +- **Swagger UI** — `http://localhost:8000/docs` +- **ReDoc** — `http://localhost:8000/redoc` +- **OpenAPI Schema** — `http://localhost:8000/openapi.json` + +For detailed endpoint documentation, see [docs/api.md](docs/api.md). + +## Environment Variables + +Quick reference of key environment variables (prefix with `ARBITER_`): + +| Variable | Description | Default | +|----------|-------------|---------| +| `DATABASE_URL` | PostgreSQL connection URL | `postgresql+asyncpg://arbiter:arbiter@localhost:5432/arbiter` | +| `REDIS_URL` | Redis connection URL | `redis://localhost:6379/0` | +| `DEFAULT_MODEL` | LLM model for agents | `gpt-4o` | +| `GITHUB_TOKEN` | GitHub API token | - | +| `GITHUB_WEBHOOK_SECRET` | Webhook HMAC secret | - | +| `GITLAB_TOKEN` | GitLab API token | - | +| `GITLAB_WEBHOOK_TOKEN` | Webhook verification token | - | +| `POST_COMMENTS` | Post review comments to PRs | `true` | +| `UPDATE_STATUS` | Update commit status checks | `true` | + +See [.env.example](.env.example) for the complete list. + +## Deployment + +For production deployment instructions, see [docs/deployment.md](docs/deployment.md). + +## Troubleshooting + +### Worker not processing jobs + +Check that Redis is running and accessible: + +```bash +redis-cli ping # Should return PONG +``` + +Verify the worker is connected: + +```bash +arq src.arbiter.worker.tasks.WorkerSettings --check +``` + +### Webhook not receiving events + +1. Verify the webhook URL is publicly accessible +2. Check that webhook secrets match between GitHub/GitLab and your configuration +3. Inspect webhook deliveries in GitHub/GitLab settings for error responses + +### LLM timeouts + +Increase timeout and reduce model complexity: + +```bash +export ARBITER_LLM_TIMEOUT=120 +export ARBITER_DEFAULT_MODEL=gpt-4o-mini +``` + +### Database connection errors + +Ensure PostgreSQL is running and the connection URL is correct: + +```bash +psql $DATABASE_URL -c "SELECT 1" # Test connection +alembic upgrade head # Run pending migrations +``` + +### Review not appearing in dashboard + +1. Check that the API server is running +2. Verify CORS settings include your dashboard URL +3. Check browser console for API errors + +## License + +MIT