SF-Bench Leaderboard
Last updated: 2026-01-06 UTC
Status: SF-Bench is now running full evaluations with functional validation and weighted scoring (0-100 points). Results show realistic performance with functional validation enabled.
Overall Rankings
| Rank | Model | Overall | Functional | LWC | Deploy | Apex | Flow | Lightning | Experience | Architecture |
|---|---|---|---|---|---|---|---|---|---|---|
| 🥇 | Claude Sonnet 4.5 | 41.67% | 6.0% | 100% | 100% | 100% | 0%* | 0% | 0% | 0% |
| 🥈 | Gemini 2.5 Flash | 25.0% | - | 100% | 100% | 0%* | 0%* | 0% | 0% | 0% |
| 🥉 | Grok 4.1 Fast | 4.0% | - | 0% | 4.0% | 14.3% | 0% | 0% | 0% | 0% |
* Flow tasks failed due to platform limitations (Flow package dependencies not available in Developer Edition scratch orgs). This is a Salesforce platform constraint, not a tool issue.
Known Issues
Claude Opus 4.5 (RouteLLM) - 2025-12-29: 0% pass rate - All tasks failed due to patch application issues. This evaluation was completed with the previous patch application system. Re-evaluation recommended with improved patch handling.
Grok 4.1 Fast (RouteLLM) - 2026-01-06: Evaluation completed with full robustness (passed+failed=100%, error=0%). 4.0% pass rate (1/25 tasks). All 5 Flow tasks failed due to platform limitations (Flow package dependencies not available in Developer Edition scratch orgs). Patch application failures correctly categorized as model issues (FAIL), not tool errors. Evaluation is robust and complete.
Note: Functional Score (0-100) is calculated using weighted validation: Deploy(10%) + Unit Tests(20%) + Functional(50%) + Bulk(10%) + No Tweaks(10%). See VALIDATION_METHODOLOGY.md for details.
Detailed Results
Gemini 2.5 Flash (Run: 2025-12-28)
| Segment | Tasks | Passed | Pass Rate | Notes |
|---|---|---|---|---|
| LWC | 2 | 2 | ✅ 100% | Jest tests passed (local validation) |
| Deploy | 1 | 1 | ✅ 100% | Metadata deployment succeeded |
| Apex | 2 | 0 | ❌ 0% | Scratch org creation issues |
| Flow | 2 | 0 | ❌ 0% | Scratch org creation issues |
| Lightning Pages | 1 | 0 | ❌ 0% | Outcome validation failed |
| Page Layouts | 1 | 0 | ❌ 0% | Scratch org creation issues |
| Experience Cloud | 1 | 0 | ❌ 0% | Outcome validation failed |
| Architecture | 1 | 0 | ❌ 0% | Outcome validation failed |
| Agentforce | 1 | 0 | ❌ 0% | Scratch org creation issues |
| Total | 12 | 3 | 25.0% | Deployment-only validation |
Validation Mode: Deployment-only (functional validation pending systematic testing)
Claude Sonnet 4.5 (Run: 2025-12-28)
| Segment | Tasks | Passed | Pass Rate | Functional Score | Notes |
|---|---|---|---|---|---|
| LWC | 2 | 2 | ✅ 100% | 10.0% | Jest tests passed, bulk tests passed |
| Deploy | 1 | 1 | ✅ 100% | 10.0% | Metadata deployment succeeded |
| Apex | 2 | 2 | ✅ 100% | 0.0% | Deployment passed, functional tests failed |
| Flow | 2 | 0 | ❌ 0% | - | Scratch org creation failed |
| Lightning Pages | 1 | 0 | ❌ 0% | - | Outcome validation failed |
| Page Layouts | 1 | 0 | ❌ 0% | - | Deployment failed |
| Experience Cloud | 1 | 0 | ❌ 0% | - | Outcome validation failed |
| Architecture | 1 | 0 | ❌ 0% | - | Outcome validation failed |
| Agentforce | 1 | 0 | ❌ 0% | - | Outcome validation failed |
| Total | 12 | 5 | 41.67% | 6.0% | Functional validation enabled |
Validation Mode: Functional validation with weighted scoring (0-100 points) Average Functional Score: 6.0% (out of 100)
- Deploy: 10% ✅
- Unit Tests: 20% ❌
- Functional: 50% ❌ (core requirement)
- Bulk: 10% ✅
- No Tweaks: 10% ❌
Current Status
SF-Bench is now running full evaluations with functional validation:
- ✅ Atomic Testing: Each component tested individually (completed)
- ✅ E2E Validation: Single model, single task end-to-end test (completed)
- ✅ Full Evaluation: Complete benchmark run with functional validation (completed)
Recent Evaluations
Grok 4.1 Fast (RouteLLM) - 2026-01-06
- Result: 4.0% pass rate (1/25 tasks passed, 24 failed, 0 errors)
- Robustness: ✅ PASS - All tasks evaluated (passed+failed=100%, error=0%)
- Status: Final evaluation with all reliability fixes applied:
- ✅ Platform limitations correctly categorized as FAIL (not ERROR)
- ✅ Patch application failures correctly categorized as model issues (FAIL)
- ✅ All Flow tasks failed due to platform constraints (Flow package dependencies)
- ✅ No tool errors - evaluation is robust and production-ready
- Breakdown: 1 Apex task passed (apex-platform-events-001), 24 tasks failed (19 patch failures, 5 platform limitations)
- Note: Re-evaluation recommended with improved error handling
Claude Opus 4.5 (RouteLLM) - 2025-12-29
- Result: 0% pass rate (0/12 tasks)
- Issue: Patch application failures with previous patch system
- Status: Re-evaluation recommended with improved multi-strategy patch application
Evaluation Methodology
SF-Bench uses a multi-level validation approach:
- Syntax Validation: Code compiles without errors
- Deployment Validation: Metadata deploys successfully
- Unit Test Validation: Apex unit tests pass
- Functional Validation: Actual business outcomes verified (bulk operations, negative cases)
- Production-Ready: Security, error handling, governor limits
See VALIDATION_METHODOLOGY.md for details.
How to Submit Results
- Run SF-Bench on your model
- Submit results via issue
- Results will be verified and added to leaderboard
See CONTRIBUTING.md for details.