SF-Bench: The Salesforce AI Benchmark

The open, objective Salesforce AI benchmark for measuring AI coding agents on Salesforce development tasks.

Note: SF-Bench evaluates AI coding agents for Salesforce development (Apex, LWC, Flow), while Salesforce’s CRM benchmark evaluates AI models for business use cases. SF-Bench is the complementary benchmark for developers.


🎯 I Am A…

Choose your path:

πŸ‘€ I’m… 🎯 I Want To… ➑️ Go To…
New to SF-Bench Understand what this is What is SF-Bench?
New to Salesforce Learn about Salesforce What is Salesforce?
Company/Enterprise Evaluate AI tools for my team For Companies
Salesforce Developer Test AI models on Salesforce Quick Start
Researcher Benchmark AI models Evaluation Guide
SWE-bench User Compare with SWE-bench Comparison
Open Source Enthusiast Contribute to SF-Bench Contributing
Need a Salesforce AI Benchmark? Share the story & methodology Salesforce AI Benchmark Guide

πŸ“Œ Quick Navigation

I want to… Link
πŸš€ Get started in 5 min Quick Start
πŸ“£ Understand the Salesforce AI Benchmark Salesforce AI Benchmark Guide
πŸ† See results Leaderboard
πŸ§ͺ Test my model Testing Your Model
❓ Get help FAQ | Troubleshooting
βž• Add tasks Contributing
πŸ“Š Submit results Submit Results

πŸ” Salesforce AI Benchmark Overview

The Salesforce AI Benchmark Guide provides a comprehensive overview of SF-Bench, including:

  • Purpose and methodology: Why SF-Bench exists and how it works
  • Dataset structure: How tasks are organized and validated
  • Salesforce-specific validation: How scoring differs from general coding benchmarks
  • Technical details: Links to Evaluation Guide and Validation Methodology

This guide serves as a central reference for understanding SF-Bench’s approach to evaluating AI models on Salesforce development tasks.


🎯 What We Do

SF-Bench measures and reports. We don’t predict or claim expected outcomes.

We Do We Don’t
βœ… Measure actual performance ❌ Predict success rates
βœ… Report objective results ❌ Claim what models β€œshould” score
βœ… Verify functional outcomes ❌ Interpret or editorialize
βœ… Test in real Salesforce orgs ❌ Just check syntax

πŸ† Leaderboard

December 2025

Rank Model Overall Functional Score LWC Deploy Apex Flow Lightning Pages Experience Cloud Architecture
πŸ₯‡ Claude Sonnet 4.5 41.67% 6.0% 100% 100% 100% 0%* 0% 0% 0%
πŸ₯ˆ Gemini 2.5 Flash 25.0% - 100% 100% 0%* 0%* 0% 0% 0%
- More results pending -% - -% -% -% -% -% -% -%

* Flow tasks failed due to scratch org creation issues (being fixed)

Note: Functional Score (0-100) uses weighted validation. See VALIDATION_METHODOLOGY.md for details.

Full Leaderboard β†’


πŸ§ͺ Testing Your Model

Supported Providers

Provider Models Setup
OpenRouter 100+ models OPENROUTER_API_KEY
RouteLLM Gemini 3, Grok, GPT-5 ROUTELLM_API_KEY
OpenAI GPT-4, GPT-3.5 OPENAI_API_KEY
Anthropic Claude 3.5, 3 ANTHROPIC_API_KEY
Google Gemini 2.5, Pro GOOGLE_API_KEY
Ollama Local models No key needed

Quick Start

git clone https://github.com/yasarshaikh/SF-bench.git
cd SF-bench
pip install -e .

# Run with OpenRouter (access to all models)
export OPENROUTER_API_KEY="your-key"
python scripts/evaluate.py --model anthropic/claude-3.5-sonnet

# Run with Gemini
export GOOGLE_API_KEY="your-key"
python scripts/evaluate.py --model gemini-2.5-flash

# Run with local Ollama
python scripts/evaluate.py --model codellama --provider ollama

πŸ—οΈ SF-Bench Architecture

View System Architecture
SF-Bench Evaluation Flow Task Definition JSON Schema AI Agent Generate Solution Patch Application Multi-Strategy Scratch Org Salesforce Env Deployment sf project deploy Unit Tests Coverage β‰₯80% Functional Validation Business Outcome Verification Results Schema v2 Components Setup/Deploy AI Generation Validation Results

πŸ”§ How Validation Works

We Check Outcomes, Not Just Deployment

Standard Benchmark:
  Deploy succeeded? β†’ PASS βœ…
  
SF-Bench:
  Deploy succeeded? β†’ Step 1 of 3
  Tests passed? β†’ Step 2 of 3
  Business outcome achieved? β†’ Step 3 of 3 β†’ PASS/FAIL

Example: Flow Task

# Step 1: Deploy
sf project deploy start  # βœ…

# Step 2: Create test data
sf apex run -c "insert new Account(Name='Test', Type='Customer');"

# Step 3: Verify outcome
sf data query -q "SELECT Id FROM Task WHERE WhatId = :accId"
# 1 Task created β†’ PASS
# 0 Tasks β†’ FAIL (Flow didn't work)

πŸ“Š Task Categories

SF-Bench includes 12 verified tasks across Salesforce development domains:

Category Tasks Description Lite Dataset
Apex 2 Triggers, Classes, Integrations βœ…
LWC 2 Lightning Components βœ…
Flow 2 Record-Triggered Flows, Invocable Actions βœ…
Lightning Pages 1 Dynamic Forms βœ…
Experience Cloud 1 Guest Access ❌
Architecture 4 Full-stack Design βœ…

Datasets


πŸ“– Documentation

Getting Started

For Different Audiences

Reference

Contributing


🀝 Get Involved

Action Link
⭐ Star the repo GitHub
πŸ“Š Submit results Submit
πŸ› Report bugs Issues
βž• Add tasks Contributing

⭐ Star us on GitHub if you find SF-Bench useful!