Salesforce AI Benchmark Guide
Understand why Salesforce teams, AI researchers, and platform vendors rely on SF-Bench to evaluate coding agents.
Why Salesforce Needs Its Own AI Benchmark
Generic coding benchmarks (HumanEval, SWE-bench) rarely capture the realities of Salesforce development:
- Multi-artifact delivery: Real solutions span Apex classes, triggers, Lightning Web Components, Flows, metadata, and declarative configs.
- Platform limits: Governor limits, security requirements, and dependency ordering often break naive AI output.
- Functional validation: Deploying code is not enough; teams must prove that business outcomes occur inside a scratch org.
SF-Bench is purpose-built to close that gap with domain-specific tasks, verified scripts, and repeatable scoring.
What SF-Bench Measures
| Layer | What We Validate | Why It Matters |
|---|---|---|
| Deployment | Metadata deploys to a fresh scratch org | Confirms generated artifacts compile and can be installed |
| Unit & Integration Tests | Auto-generated or provided tests pass | Ensures code integrity and coverage |
| Functional Outcomes | Business requirement succeeds via CLI checks | Captures real-world success criteria |
| Bulk & Resilience | Tasks run at scale without governor violations | Proves readiness for enterprise workloads |
👉 See the Validation Methodology for the exact scoring rubric.
Datasets for Every Journey
| Dataset | Tasks | Time | Perfect For |
|---|---|---|---|
Lite (data/tasks/lite.json) |
5 | ~10 min | Quick proof-of-concept, demos |
Verified (data/tasks/verified.json) |
12 | ~60 min | Official leaderboard submissions |
Realistic (data/tasks/realistic.json) |
30+ | 2-3 hrs | Deep vendor or research evaluations |
Each dataset ships with task prompts, acceptance tests, and validation scripts so that results remain comparable across runs.
How to Run a Salesforce AI Benchmark in 5 Steps
- Install & Authenticate
Follow the Quick Start to install SF-Bench, log in to your DevHub, and set API keys. - Select a Dataset
Start withdata/tasks/lite.jsonfor a rapid sanity check, then graduate todata/tasks/verified.jsonfor leaderboard-ready numbers. - Pick a Model Provider
Use RouteLLM, OpenRouter, Google Gemini, Anthropic, OpenAI, or local Ollama models. Set the appropriate environment variables. - Run Evaluate Script
python scripts/evaluate.py \ --model "anthropic/claude-3.5-sonnet" \ --tasks data/tasks/verified.json \ --functional \ --max-workers 2 - Review Reports
SF-Bench emits JSON + Markdown summaries underresults/andevaluation_results/for easy sharing, diffing, and submissions.
Need more detail? Jump to the Evaluation Guide for advanced orchestration tips.
Comparing SF-Bench to SWE-bench
| Area | SF-Bench | SWE-bench |
|---|---|---|
| Domain | Salesforce (Apex, LWC, Flow) | Python OSS issues |
| Execution | Scratch orgs, CLI validators | Docker containers |
| Functional Checks | Business outcome verification | Patch applies + tests |
| Audience | Enterprises, Salesforce partners, AI vendors | General-purpose LLM researchers |
Read the full comparison guide to understand why both benchmarks complement each other.
Use Cases
- Salesforce COEs: Compare multiple AI copilots before rolling out to thousands of admins and developers.
- Model Vendors: Publish transparent Salesforce-specific scores to win enterprise trust.
- Researchers: Stress-test agent frameworks on metadata-heavy, multi-step deployments.
- Consultants & ISVs: Validate that accelerators and packaged AI assistants meet client-grade standards.
Results & Submissions
- Explore the live Leaderboard for current model standings.
- Package your
report.jsonandsummary.mdfromevaluation_results/. - Submit results with reproduction steps and model metadata.
Every accepted submission improves community trust and builds the public record for Salesforce AI benchmarking.
Resources & Next Steps
- 📚 What is SF-Bench?
- 🚀 Quick Start Guide
- 🧪 Evaluation Guide
- 🧰 Troubleshooting
- 🧾 Result Schema Reference
✨ Need talking points for stakeholders? Share this page or link directly using /salesforce-ai-benchmark/ for an easy-to-remember URL.