SF-Bench Evaluation Results

Leaderboard by Segment

Segment	Description	GPT-4o	Claude 3.5	Gemini 2.0	Llama 3.3
Apex	Triggers, Classes, Tests	-%	-%	-%	-%
LWC	Lightning Web Components	-%	-%	-%	-%
Flow	Screen Components, Invocable Actions	-%	-%	-%	-%
Lightning Pages	FlexiPages, Dynamic Forms	-%	-%	-%	-%
Page Layouts	Record Layouts, Compact Layouts	-%	-%	-%	-%
Experience Cloud	Sites, Communities	-%	-%	-%	-%
Architecture	Full-stack, System Design	-%	-%	-%	-%
Deployment	Metadata, Dependencies	-%	-%	-%	-%
Agentforce	Agent Scripts, Prompts	-%	-%	-%	-%

Overall	All Tasks	-%	-%	-%	-%

Difficulty	Total Tasks	Description
Easy	2	Basic configurations, simple fixes
Medium	5	Multi-step implementations, integrations
Hard	4	Complex components, advanced patterns
Expert	1	Full architecture, multi-layer solutions

Repository	Stars	Categories	Status
trailheadapps/apex-recipes	1,059	Apex	✅ Active
trailheadapps/lwc-recipes	2,805	LWC	✅ Active
trailheadapps/dreamhouse-lwc	469	LWC, Architecture	✅ Active
trailheadapps/automation-components	384	Flow	✅ Active
trailheadapps/ebikes-lwc	830	Experience Cloud	✅ Active
trailheadapps/agent-script-recipes	53	Agentforce	✅ Active
trailheadapps/coral-cloud	138	Data Cloud, AI	✅ Active

Each task is evaluated on multiple dimensions:

Functional Correctness (40%)
- Tests pass
- Deployment succeeds
- Expected behavior achieved
Code Quality (30%)
- No hardcoded values
- Proper error handling
- Follows Salesforce best practices
Anti-Gaming Checks (20%)
- No test-specific hacks
- Solution addresses root cause
- Maintainable code
Documentation (10%)
- Clear comments
- README updates where applicable

Run SF-Bench on your model and submit results:

python scripts/evaluate.py --model <your-model> --tasks data/tasks/verified.json

Then submit your results to be added to the leaderboard.

Last updated: December 2025