SF-Bench: The Salesforce AI Benchmark
The open, objective Salesforce AI benchmark for measuring AI coding agents on Salesforce development tasks.
Note: SF-Bench evaluates AI coding agents for Salesforce development (Apex, LWC, Flow), while Salesforceβs CRM benchmark evaluates AI models for business use cases. SF-Bench is the complementary benchmark for developers.
π― I Am Aβ¦
Choose your path:
| π€ Iβmβ¦ | π― I Want Toβ¦ | β‘οΈ Go Toβ¦ |
|---|---|---|
| New to SF-Bench | Understand what this is | What is SF-Bench? |
| New to Salesforce | Learn about Salesforce | What is Salesforce? |
| Company/Enterprise | Evaluate AI tools for my team | For Companies |
| Salesforce Developer | Test AI models on Salesforce | Quick Start |
| Researcher | Benchmark AI models | Evaluation Guide |
| SWE-bench User | Compare with SWE-bench | Comparison |
| Open Source Enthusiast | Contribute to SF-Bench | Contributing |
| Need a Salesforce AI Benchmark? | Share the story & methodology | Salesforce AI Benchmark Guide |
π Quick Navigation
| I want to⦠| Link |
|---|---|
| π Get started in 5 min | Quick Start |
| π£ Understand the Salesforce AI Benchmark | Salesforce AI Benchmark Guide |
| π See results | Leaderboard |
| π§ͺ Test my model | Testing Your Model |
| β Get help | FAQ | Troubleshooting |
| β Add tasks | Contributing |
| π Submit results | Submit Results |
π Salesforce AI Benchmark Overview
The Salesforce AI Benchmark Guide provides a comprehensive overview of SF-Bench, including:
- Purpose and methodology: Why SF-Bench exists and how it works
- Dataset structure: How tasks are organized and validated
- Salesforce-specific validation: How scoring differs from general coding benchmarks
- Technical details: Links to Evaluation Guide and Validation Methodology
This guide serves as a central reference for understanding SF-Benchβs approach to evaluating AI models on Salesforce development tasks.
π― What We Do
SF-Bench measures and reports. We donβt predict or claim expected outcomes.
| We Do | We Donβt |
|---|---|
| β Measure actual performance | β Predict success rates |
| β Report objective results | β Claim what models βshouldβ score |
| β Verify functional outcomes | β Interpret or editorialize |
| β Test in real Salesforce orgs | β Just check syntax |
π Leaderboard
December 2025
| Rank | Model | Overall | Functional Score | LWC | Deploy | Apex | Flow | Lightning Pages | Experience Cloud | Architecture |
|---|---|---|---|---|---|---|---|---|---|---|
| π₯ | Claude Sonnet 4.5 | 41.67% | 6.0% | 100% | 100% | 100% | 0%* | 0% | 0% | 0% |
| π₯ | Gemini 2.5 Flash | 25.0% | - | 100% | 100% | 0%* | 0%* | 0% | 0% | 0% |
| - | More results pending | -% | - | -% | -% | -% | -% | -% | -% | -% |
* Flow tasks failed due to scratch org creation issues (being fixed)
Note: Functional Score (0-100) uses weighted validation. See VALIDATION_METHODOLOGY.md for details.
π§ͺ Testing Your Model
Supported Providers
| Provider | Models | Setup |
|---|---|---|
| OpenRouter | 100+ models | OPENROUTER_API_KEY |
| RouteLLM | Gemini 3, Grok, GPT-5 | ROUTELLM_API_KEY |
| OpenAI | GPT-4, GPT-3.5 | OPENAI_API_KEY |
| Anthropic | Claude 3.5, 3 | ANTHROPIC_API_KEY |
| Gemini 2.5, Pro | GOOGLE_API_KEY |
|
| Ollama | Local models | No key needed |
Quick Start
git clone https://github.com/yasarshaikh/SF-bench.git
cd SF-bench
pip install -e .
# Run with OpenRouter (access to all models)
export OPENROUTER_API_KEY="your-key"
python scripts/evaluate.py --model anthropic/claude-3.5-sonnet
# Run with Gemini
export GOOGLE_API_KEY="your-key"
python scripts/evaluate.py --model gemini-2.5-flash
# Run with local Ollama
python scripts/evaluate.py --model codellama --provider ollama
ποΈ SF-Bench Architecture
π§ How Validation Works
We Check Outcomes, Not Just Deployment
Standard Benchmark:
Deploy succeeded? β PASS β
SF-Bench:
Deploy succeeded? β Step 1 of 3
Tests passed? β Step 2 of 3
Business outcome achieved? β Step 3 of 3 β PASS/FAIL
Example: Flow Task
# Step 1: Deploy
sf project deploy start # β
# Step 2: Create test data
sf apex run -c "insert new Account(Name='Test', Type='Customer');"
# Step 3: Verify outcome
sf data query -q "SELECT Id FROM Task WHERE WhatId = :accId"
# 1 Task created β PASS
# 0 Tasks β FAIL (Flow didn't work)
π Task Categories
SF-Bench includes 12 verified tasks across Salesforce development domains:
| Category | Tasks | Description | Lite Dataset |
|---|---|---|---|
| Apex | 2 | Triggers, Classes, Integrations | β |
| LWC | 2 | Lightning Components | β |
| Flow | 2 | Record-Triggered Flows, Invocable Actions | β |
| Lightning Pages | 1 | Dynamic Forms | β |
| Experience Cloud | 1 | Guest Access | β |
| Architecture | 4 | Full-stack Design | β |
Datasets
- Lite (5 tasks): Quick validation in ~10 minutes - data/tasks/lite.json
- Verified (12 tasks): Full evaluation in ~1 hour - data/tasks/verified.json
- Realistic: Challenging scenarios - data/tasks/realistic.json
π Documentation
Getting Started
- π Quick Start Guide - Get running in 5 minutes
- π What is SF-Bench? - Complete overview
- π’ What is Salesforce? - For beginners
- β FAQ - Common questions and answers
- π§ Troubleshooting - Common issues and solutions
For Different Audiences
- πΌ For Companies - Business case & ROI
- π¨βπ» For Salesforce Developers - Evaluation guide
- π¬ For Researchers - Methodology details
- π SWE-bench Comparison - Benchmark comparison
Reference
- π Validation Methodology - How we validate results
- π Benchmark Details - Technical specifications
- π Full Leaderboard - Complete model rankings
- π Evaluation Guide - Complete evaluation guide
- π Result Schema - Result format reference
Contributing
- β Contributing Guide
- π― Task Guidelines - Creating new tasks
- π Submitting Results
π€ Get Involved
| Action | Link |
|---|---|
| β Star the repo | GitHub |
| π Submit results | Submit |
| π Report bugs | Issues |
| β Add tasks | Contributing |
β Star us on GitHub if you find SF-Bench useful!