SF-Bench: The Salesforce AI Benchmark

The open, objective Salesforce AI benchmark for measuring AI coding agents on Salesforce development tasks.

Note: SF-Bench evaluates AI coding agents for Salesforce development (Apex, LWC, Flow), while Salesforce’s CRM benchmark evaluates AI models for business use cases. SF-Bench is the complementary benchmark for developers.

🎯 I Am A…

Choose your path:

👤 I’m…	🎯 I Want To…	➡️ Go To…
New to SF-Bench	Understand what this is	What is SF-Bench?
New to Salesforce	Learn about Salesforce	What is Salesforce?
Company/Enterprise	Evaluate AI tools for my team	For Companies
Salesforce Developer	Test AI models on Salesforce	Quick Start
Researcher	Benchmark AI models	Evaluation Guide
SWE-bench User	Compare with SWE-bench	Comparison
Open Source Enthusiast	Contribute to SF-Bench	Contributing
Need a Salesforce AI Benchmark?	Share the story & methodology	Salesforce AI Benchmark Guide

I want to…	Link
🚀 Get started in 5 min	Quick Start
📣 Understand the Salesforce AI Benchmark	Salesforce AI Benchmark Guide
🏆 See results	Leaderboard
🧪 Test my model	Testing Your Model
❓ Get help	FAQ \| Troubleshooting
➕ Add tasks	Contributing
📊 Submit results	Submit Results

🔍 Salesforce AI Benchmark Overview

The Salesforce AI Benchmark Guide provides a comprehensive overview of SF-Bench, including:

Purpose and methodology: Why SF-Bench exists and how it works
Dataset structure: How tasks are organized and validated
Salesforce-specific validation: How scoring differs from general coding benchmarks
Technical details: Links to Evaluation Guide and Validation Methodology

This guide serves as a central reference for understanding SF-Bench’s approach to evaluating AI models on Salesforce development tasks.

🎯 What We Do

SF-Bench measures and reports. We don’t predict or claim expected outcomes.

We Do	We Don’t
✅ Measure actual performance	❌ Predict success rates
✅ Report objective results	❌ Claim what models “should” score
✅ Verify functional outcomes	❌ Interpret or editorialize
✅ Test in real Salesforce orgs	❌ Just check syntax

🏆 Leaderboard

December 2025

Rank	Model	Overall	Functional Score	LWC	Deploy	Apex	Flow	Lightning Pages	Experience Cloud	Architecture
🥇	Claude Sonnet 4.5	41.67%	6.0%	100%	100%	100%	0%*	0%	0%	0%
🥈	Gemini 2.5 Flash	25.0%	-	100%	100%	0%*	0%*	0%	0%	0%
-	More results pending	-%	-	-%	-%	-%	-%	-%	-%	-%

* Flow tasks failed due to scratch org creation issues (being fixed)

Note: Functional Score (0-100) uses weighted validation. See VALIDATION_METHODOLOGY.md for details.

Full Leaderboard →

🧪 Testing Your Model

Supported Providers

Provider	Models	Setup
OpenRouter	100+ models	`OPENROUTER_API_KEY`
RouteLLM	Gemini 3, Grok, GPT-5	`ROUTELLM_API_KEY`
OpenAI	GPT-4, GPT-3.5	`OPENAI_API_KEY`
Anthropic	Claude 3.5, 3	`ANTHROPIC_API_KEY`
Google	Gemini 2.5, Pro	`GOOGLE_API_KEY`
Ollama	Local models	No key needed

Quick Start

git clone https://github.com/yasarshaikh/SF-bench.git
cd SF-bench
pip install -e .

# Run with OpenRouter (access to all models)
export OPENROUTER_API_KEY="your-key"
python scripts/evaluate.py --model anthropic/claude-3.5-sonnet

# Run with Gemini
export GOOGLE_API_KEY="your-key"
python scripts/evaluate.py --model gemini-2.5-flash

# Run with local Ollama
python scripts/evaluate.py --model codellama --provider ollama

🏗️ SF-Bench Architecture

View System Architecture

🔧 How Validation Works

We Check Outcomes, Not Just Deployment

Standard Benchmark:
  Deploy succeeded? → PASS ✅
  
SF-Bench:
  Deploy succeeded? → Step 1 of 3
  Tests passed? → Step 2 of 3
  Business outcome achieved? → Step 3 of 3 → PASS/FAIL

Example: Flow Task

# Step 1: Deploy
sf project deploy start  # ✅

# Step 2: Create test data
sf apex run -c "insert new Account(Name='Test', Type='Customer');"

# Step 3: Verify outcome
sf data query -q "SELECT Id FROM Task WHERE WhatId = :accId"
# 1 Task created → PASS
# 0 Tasks → FAIL (Flow didn't work)

📊 Task Categories

SF-Bench includes 12 verified tasks across Salesforce development domains:

Category	Tasks	Description	Lite Dataset
Apex	2	Triggers, Classes, Integrations	✅
LWC	2	Lightning Components	✅
Flow	2	Record-Triggered Flows, Invocable Actions	✅
Lightning Pages	1	Dynamic Forms	✅
Experience Cloud	1	Guest Access	❌
Architecture	4	Full-stack Design	✅

Datasets

Lite (5 tasks): Quick validation in ~10 minutes - data/tasks/lite.json
Verified (12 tasks): Full evaluation in ~1 hour - data/tasks/verified.json
Realistic: Challenging scenarios - data/tasks/realistic.json

📖 Documentation

Getting Started

🚀 Quick Start Guide - Get running in 5 minutes
📚 What is SF-Bench? - Complete overview
🏢 What is Salesforce? - For beginners
❓ FAQ - Common questions and answers
🔧 Troubleshooting - Common issues and solutions

For Different Audiences

💼 For Companies - Business case & ROI
👨‍💻 For Salesforce Developers - Evaluation guide
🔬 For Researchers - Methodology details
🔄 SWE-bench Comparison - Benchmark comparison

Reference

📋 Validation Methodology - How we validate results
📊 Benchmark Details - Technical specifications
🏆 Full Leaderboard - Complete model rankings
📈 Evaluation Guide - Complete evaluation guide
📄 Result Schema - Result format reference

Contributing

➕ Contributing Guide
🎯 Task Guidelines - Creating new tasks
📊 Submitting Results

🤝 Get Involved

Action	Link
⭐ Star the repo	GitHub
📊 Submit results	Submit
🐛 Report bugs	Issues
➕ Add tasks	Contributing

⭐ Star us on GitHub if you find SF-Bench useful!

SF-Bench - Salesforce AI Benchmark | Evaluate AI Coding Agents