Evaluation Guide
Complete guide to running SF-Bench evaluations, from basic usage to advanced configurations.
Prerequisites
Before running evaluations, ensure you have:
Required Setup
- API Key - Set environment variable for your provider:
# RouteLLM (for Grok 4, GPT-5, Claude Opus 4) export ROUTELLM_API_KEY="your-key" # OpenRouter (for Claude, GPT-4, Llama) export OPENROUTER_API_KEY="your-key" # Google Gemini export GOOGLE_API_KEY="your-key" # Anthropic (direct) export ANTHROPIC_API_KEY="your-key" # OpenAI (direct) export OPENAI_API_KEY="your-key" - DevHub Authentication:
sf org login web --alias DevHub --set-default-dev-hub - Scratch Org Capacity:
- Minimum: 1 scratch org (with
--max-workers 1) - Recommended: 2-3 scratch orgs (with
--max-workers 2-3) - Maximum: 5 scratch orgs (with
--max-workers 5) - Note: Each worker needs its own scratch org. For 12 tasks, you’ll create 12 orgs total (sequentially or in parallel based on workers).
- Minimum: 1 scratch org (with
Resource Estimates
Full Evaluation (12 tasks, --functional):
- Scratch Orgs: 1-5 concurrent orgs (based on
--max-workers) - Token Usage: ~96,000 tokens (~0.1M tokens)
- Per task: ~8,000 tokens (input + output + context)
- Time: 1-2 hours
- Cost: $0.10-$2 (varies by model)
Lite Evaluation (5 tasks):
- Scratch Orgs: 1-3 concurrent orgs
- Token Usage: ~40,000 tokens
- Time: ~10-15 minutes
Quick Start
# Basic evaluation
python scripts/evaluate.py \
--model "anthropic/claude-3.5-sonnet" \
--tasks data/tasks/verified.json
# With functional validation
python scripts/evaluate.py \
--model "anthropic/claude-3.5-sonnet" \
--tasks data/tasks/verified.json \
--functional
Command-Line Options
Required Arguments
--model: Model name or identifier- Examples:
"anthropic/claude-3.5-sonnet","gemini-2.5-flash","gpt-4"
- Examples:
--tasks: Path to tasks JSON file- Examples:
data/tasks/verified.json,data/tasks/lite.json
- Examples:
Optional Arguments
--functional: Enable functional validation (default: disabled)- Validates actual business outcomes, not just deployment
- Adds 50% weight to functional tests
--max-workers: Number of parallel workers (default: 3)- Higher = faster but more scratch orgs needed
- Recommended: 2-4 for most setups
--output: Output directory (default:results/<model-name>)- Custom path for results
--provider: Explicitly specify AI provider- Options:
openrouter,routellm,gemini,anthropic,openai,ollama - Auto-detected if not specified
- Options:
--skip-devhub: Skip DevHub connectivity check- Use if DevHub check fails but orgs still work
--skip-preflight: Skip all pre-flight checks- Not recommended - may waste resources
--skip-llm-check: Skip LLM format validation- Faster startup, but may fail later if format is wrong
--interactive: Enable interactive prompts for missing config- Helpful for first-time setup
Evaluation Modes
1. Deployment-Only Validation (Default)
What it checks:
- ✅ Code deploys successfully
- ✅ No syntax errors
- ✅ Metadata is valid
Use when:
- Quick smoke test
- Testing deployment pipeline
- Limited scratch org capacity
Command:
python scripts/evaluate.py \
--model "your-model" \
--tasks data/tasks/verified.json
Time: ~30-45 minutes for 12 tasks
2. Functional Validation (Recommended)
What it checks:
- ✅ Code deploys (10%)
- ✅ Unit tests pass (20%)
- ✅ Business outcome achieved (50%)
- ✅ Bulk operations work (10%)
- ✅ No manual tweaks needed (10%)
Use when:
- Realistic performance measurement
- Comparing models objectively
- Production readiness assessment
Command:
python scripts/evaluate.py \
--model "your-model" \
--tasks data/tasks/verified.json \
--functional
Time: ~1-2 hours for 12 tasks
Understanding Results
Result Files
After evaluation, you’ll find:
results/<model-name>/
├── report.json # Schema v2 report (machine-readable)
├── summary.md # Human-readable summary
├── evaluation_*.json # Legacy format (backward compatibility)
└── <task-id>.json # Individual task results
Reading the Report
Schema v2 Report (report.json):
{
"schema_version": "2.0",
"run_id": "claude-sonnet-4.5-20251229",
"model_name": "anthropic/claude-3.5-sonnet",
"summary": {
"total_instances": 12,
"resolved_instances": 5,
"resolve_rate": 41.67,
"avg_score": 6.0
},
"instances": [...]
}
Markdown Summary (summary.md):
- Overall results table
- Component breakdown
- Individual task details
- Performance metrics
Scoring System
Component Weights
| Component | Weight | What It Measures |
|---|---|---|
| Deployment | 10% | Code deploys without errors |
| Unit Tests | 20% | All tests pass, coverage ≥80% |
| Functional | 50% | Business outcome achieved |
| Bulk Operations | 10% | Handles 200+ records |
| No Manual Tweaks | 10% | Works in one shot |
Total: 100 points
Success Criteria
A task is RESOLVED if:
- ✅ Deployment: PASS
- ✅ Unit Tests: PASS
- ✅ Functional: PASS
- ✅ Total Score ≥ 80/100
Advanced Usage
Using Pre-Generated Solutions
# Generate solutions first
python scripts/generate_solutions.py \
--model "your-model" \
--tasks data/tasks/verified.json \
--output solutions/
# Then evaluate
python scripts/evaluate.py \
--model "your-model" \
--tasks data/tasks/verified.json \
--solutions solutions/
Parallel Evaluation
# Use 4 workers for faster evaluation
python scripts/evaluate.py \
--model "your-model" \
--tasks data/tasks/verified.json \
--max-workers 4
Note: Each worker needs a scratch org. Ensure you have enough capacity.
Custom Scratch Org
# Use existing scratch org
python scripts/evaluate.py \
--model "your-model" \
--tasks data/tasks/verified.json \
--scratch-org-alias "my-existing-org"
Performance Tips
Faster Evaluations
- Use fewer workers if scratch org creation is slow
- Skip LLM format check if you’ve tested the model before
- Use deployment-only mode for quick tests
- Pre-generate solutions to avoid API rate limits
Resource Management
- Monitor scratch org limits before starting
- Clean up old orgs regularly
- Use multiple DevHubs for parallel runs
- Check API quotas for your provider
Troubleshooting
Common Issues
“All tasks failed with ERROR”
- Check pre-flight checks output
- Verify DevHub authentication
- Check scratch org limits
“Corrupt patch errors”
- Some models generate invalid patches
- Try a different model or provider
- See Troubleshooting Guide for details
“Scratch org creation timeout”
- Check network connectivity
- Verify DevHub limits
- Try with fewer workers
See Troubleshooting Guide for more solutions.
Best Practices
1. Always Use Pre-Flight Checks
# Pre-flight checks run automatically
# They catch issues before wasting resources
2. Start with Lite Dataset
# Test with 5 tasks first
python scripts/evaluate.py \
--model "your-model" \
--tasks data/tasks/lite.json
3. Use Functional Validation for Real Results
# Deployment-only can be misleading
# Functional validation shows real capability
--functional
4. Monitor Resources
# Check scratch org capacity
python -c "from sfbench.utils.inventory import ScratchOrgInventory; ScratchOrgInventory().print_inventory_report()"
Next Steps
Last updated: December 2025