SF-Bench Leaderboard

Last updated: 2026-01-06 UTC

Status: SF-Bench is now running full evaluations with functional validation and weighted scoring (0-100 points). Results show realistic performance with functional validation enabled.

Overall Rankings

Sort by: Order:

Rank	Model	Overall	Functional	LWC	Deploy	Apex	Flow	Lightning	Experience	Architecture
🥇	Claude Sonnet 4.5	41.67%	6.0%	100%	100%	100%	0%*	0%	0%	0%
🥈	Gemini 2.5 Flash	25.0%	-	100%	100%	0%*	0%*	0%	0%	0%
🥉	Grok 4.1 Fast	4.0%	-	0%	4.0%	14.3%	0%	0%	0%	0%

* Flow tasks failed due to platform limitations (Flow package dependencies not available in Developer Edition scratch orgs). This is a Salesforce platform constraint, not a tool issue.

Known Issues

Claude Opus 4.5 (RouteLLM) - 2025-12-29: 0% pass rate - All tasks failed due to patch application issues. This evaluation was completed with the previous patch application system. Re-evaluation recommended with improved patch handling.

Grok 4.1 Fast (RouteLLM) - 2026-01-06: Evaluation completed with full robustness (passed+failed=100%, error=0%). 4.0% pass rate (1/25 tasks). All 5 Flow tasks failed due to platform limitations (Flow package dependencies not available in Developer Edition scratch orgs). Patch application failures correctly categorized as model issues (FAIL), not tool errors. Evaluation is robust and complete.

Note: Functional Score (0-100) is calculated using weighted validation: Deploy(10%) + Unit Tests(20%) + Functional(50%) + Bulk(10%) + No Tweaks(10%). See VALIDATION_METHODOLOGY.md for details.

Detailed Results

Gemini 2.5 Flash (Run: 2025-12-28)

Segment	Tasks	Passed	Pass Rate	Notes
LWC	2	2	✅ 100%	Jest tests passed (local validation)
Deploy	1	1	✅ 100%	Metadata deployment succeeded
Apex	2	0	❌ 0%	Scratch org creation issues
Flow	2	0	❌ 0%	Scratch org creation issues
Lightning Pages	1	0	❌ 0%	Outcome validation failed
Page Layouts	1	0	❌ 0%	Scratch org creation issues
Experience Cloud	1	0	❌ 0%	Outcome validation failed
Architecture	1	0	❌ 0%	Outcome validation failed
Agentforce	1	0	❌ 0%	Scratch org creation issues
Total	12	3	25.0%	Deployment-only validation

Validation Mode: Deployment-only (functional validation pending systematic testing)

Claude Sonnet 4.5 (Run: 2025-12-28)

Segment	Tasks	Passed	Pass Rate	Functional Score	Notes
LWC	2	2	✅ 100%	10.0%	Jest tests passed, bulk tests passed
Deploy	1	1	✅ 100%	10.0%	Metadata deployment succeeded
Apex	2	2	✅ 100%	0.0%	Deployment passed, functional tests failed
Flow	2	0	❌ 0%	-	Scratch org creation failed
Lightning Pages	1	0	❌ 0%	-	Outcome validation failed
Page Layouts	1	0	❌ 0%	-	Deployment failed
Experience Cloud	1	0	❌ 0%	-	Outcome validation failed
Architecture	1	0	❌ 0%	-	Outcome validation failed
Agentforce	1	0	❌ 0%	-	Outcome validation failed
Total	12	5	41.67%	6.0%	Functional validation enabled

Validation Mode: Functional validation with weighted scoring (0-100 points) Average Functional Score: 6.0% (out of 100)

Deploy: 10% ✅
Unit Tests: 20% ❌
Functional: 50% ❌ (core requirement)
Bulk: 10% ✅
No Tweaks: 10% ❌

Current Status

SF-Bench is now running full evaluations with functional validation:

✅ Atomic Testing: Each component tested individually (completed)
✅ E2E Validation: Single model, single task end-to-end test (completed)
✅ Full Evaluation: Complete benchmark run with functional validation (completed)

Recent Evaluations

Grok 4.1 Fast (RouteLLM) - 2026-01-06

Result: 4.0% pass rate (1/25 tasks passed, 24 failed, 0 errors)
Robustness: ✅ PASS - All tasks evaluated (passed+failed=100%, error=0%)
Status: Final evaluation with all reliability fixes applied:
- ✅ Platform limitations correctly categorized as FAIL (not ERROR)
- ✅ Patch application failures correctly categorized as model issues (FAIL)
- ✅ All Flow tasks failed due to platform constraints (Flow package dependencies)
- ✅ No tool errors - evaluation is robust and production-ready
Breakdown: 1 Apex task passed (apex-platform-events-001), 24 tasks failed (19 patch failures, 5 platform limitations)
Note: Re-evaluation recommended with improved error handling

Claude Opus 4.5 (RouteLLM) - 2025-12-29

Result: 0% pass rate (0/12 tasks)
Issue: Patch application failures with previous patch system
Status: Re-evaluation recommended with improved multi-strategy patch application

Evaluation Methodology

SF-Bench uses a multi-level validation approach:

Syntax Validation: Code compiles without errors
Deployment Validation: Metadata deploys successfully
Unit Test Validation: Apex unit tests pass
Functional Validation: Actual business outcomes verified (bulk operations, negative cases)
Production-Ready: Security, error handling, governor limits

See VALIDATION_METHODOLOGY.md for details.

How to Submit Results

Run SF-Bench on your model
Submit results via issue
Results will be verified and added to leaderboard

See CONTRIBUTING.md for details.