How is SF-Bench different from Salesforce's CRM benchmark?

SF-Bench focuses on evaluating AI coding agents for Salesforce development (Apex, LWC, Flow), while Salesforce's CRM benchmark evaluates AI models for business use cases (sales, service). SF-Bench tests actual code generation and execution in scratch orgs.

Frequently Asked Questions (FAQ)

Q: What is SF-Bench?

SF-Bench is a benchmark for evaluating AI models on real-world Salesforce development tasks. It tests whether models can generate working Apex, Flow, and Lightning Web Component code that meets functional business requirements.

Q: What do I need to run SF-Bench?

You need: 1) Salesforce DevHub with scratch org allocation, 2) Python 3.10+, 3) Salesforce CLI (sf command), 4) API key for an AI model (OpenRouter, Gemini, Claude, etc.). See the Quick Start Guide for setup instructions.

Quick answers to common questions about SF-Bench.

General Questions

What is SF-Bench?

SF-Bench is a benchmark for evaluating AI models on real-world Salesforce development tasks. It tests whether models can generate working Apex, Flow, and Lightning Web Component code that meets functional business requirements.

Why SF-Bench?

Existing AI benchmarks (SWE-bench, HumanEval, etc.) focus on general programming. SF-Bench focuses on Salesforce-specific challenges:

Domain-specific knowledge (Salesforce APIs, governor limits, best practices)
Multi-file solutions (Apex classes, triggers, test classes)
Functional validation (does it actually work in a scratch org?)
Real business outcomes (not just syntax correctness)

Who is SF-Bench for?

AI researchers: Benchmark model performance on domain-specific tasks
Salesforce teams: Evaluate AI models for Salesforce development using objective metrics
Model providers: Test and improve models for enterprise use cases
Developers: Understand AI capabilities and limitations for Salesforce

Getting Started

What do I need to run SF-Bench?

Salesforce DevHub with scratch org allocation
Python 3.10+
Salesforce CLI (sf command)
API key for an AI model (OpenRouter, Gemini, Claude, etc.)

See the Quick Start Guide for setup instructions.

How long does an evaluation take?

Single task: 5-10 minutes
Lite dataset (5 tasks): ~30 minutes
Verified dataset (12 tasks): ~1 hour
Full dataset (50+ tasks): 3-5 hours

Time varies based on scratch org creation speed and model response time.

How much does it cost?

Scratch Orgs: Free (included with Developer Edition or DevHub)

AI Model Costs:

Gemini 2.5 Flash: Free tier available (AI Studio)
OpenRouter: $0.10-$2 per evaluation (depending on model)
Claude Sonnet: ~$1-3 per evaluation
Ollama (local): Free (requires local GPU)

Typical cost: $0.50-$2 per full evaluation

Technical Questions

What tasks are included?

SF-Bench includes tasks across three categories:

Apex: Classes, triggers, bulk processing
Flow: Screen flows, record-triggered flows, scheduled flows
Lightning Web Components: UI components, event handling, data binding

See data/tasks/verified.json for the full list.

How are tasks scored?

Each task is scored out of 100 points:

Deploy (10%): Code deploys successfully
Unit Tests (20%): All tests pass
Functional (50%): Business requirement met ← Most important!
Bulk Data (10%): Handles 200+ records
No Manual Tweaks (10%): Works without human fixes

A task is RESOLVED if:

All tests pass
Functional validation passes
Score ≥ 80/100

What is “functional validation”?

Functional validation checks if the business outcome was achieved, not just if the code compiles.

Example:

Task: "Create a Flow that creates a Task when Account Type changes to Customer"

❌ FAIL: Flow deploys, tests pass, but no Task is created
✅ PASS: Flow deploys, tests pass, AND Task is created

This is what makes SF-Bench different from syntax-only benchmarks.

Can I add my own tasks?

Yes! SF-Bench is extensible. To add tasks:

Create a JSON file following the schema in data/tasks/verified.json
Include test cases for functional validation
Submit a PR or use it locally

See CONTRIBUTING.md for guidelines.

Model Questions

Which models perform best?

SF-Bench measures and reports objective results. We don’t make judgments about which models are “best.”

See the Leaderboard for current evaluation results and rankings.

Why do some models fail functional validation?

Common reasons:

Misunderstood requirements: Generated code doesn’t match the business logic
Salesforce-specific errors: Violates governor limits, uses deprecated APIs
Incomplete solutions: Only partial implementation (e.g., class without trigger)
Test-only solutions: Passes unit tests but doesn’t work in practice

This is expected and shows the benchmark is challenging!

Can I use local models (Ollama)?

Yes! SF-Bench supports Ollama for local model testing:

# Start Ollama
ollama serve

# Run evaluation
python scripts/evaluate.py --model "codellama" --provider ollama

Note: Local models generally have lower success rates (~20-40%) due to smaller parameter counts.

How do I test multiple models?

Use parallel evaluation:

# Test multiple models in sequence
for model in "gemini-2.5-flash" "anthropic/claude-3.5-sonnet"; do
  python scripts/evaluate.py --model "$model" --tasks data/tasks/verified.json
done

For parallel evaluations, run multiple instances with different models:

# Terminal 1
python scripts/evaluate.py --model model1 --tasks data/tasks/verified.json

# Terminal 2
python scripts/evaluate.py --model model2 --tasks data/tasks/verified.json

Results & Leaderboard

How do I submit results to the leaderboard?

Run an evaluation and save results
Submit a GitHub issue with:
- Model name and version
- Evaluation date
- Link to results JSON
- Reproduction steps
We’ll verify and add to the leaderboard

See Submitting Results for details.

Are results reproducible?

Mostly, but not always. Factors affecting reproducibility:

Model version: Models are updated frequently
Temperature: Non-zero temperature introduces randomness
Scratch org state: Subtle differences in org setup
Timing: Rate limits, org creation delays

For reproducibility:

Use temperature=0
Specify exact model version
Document environment (Python, sf CLI versions)
Run multiple times and report average

Can I compare my results to others?

Yes! All results use a standardized schema (v2) for easy comparison.

Tools:

scripts/leaderboard.py - Generate comparison tables
sfbench/utils/reporting.py - Compare two reports

Troubleshooting

“Scratch org creation failed”

Common causes:

Daily scratch org limit reached (wait 24 hours)
DevHub not authenticated (re-run sf org login)
Invalid scratch org definition (check data/templates/project-scratch-def.json)

See Troubleshooting Guide for solutions.

“API rate limit exceeded”

Solutions:

Reduce parallelization: --max-workers 1
Use a different provider (e.g., Gemini instead of OpenRouter)
Wait and retry

Tasks fail with “Deployment failed”

Common causes:

Invalid Apex syntax (model generated broken code)
Missing dependencies (e.g., missing @future methods)
Test coverage too low (<75%)

Check logs/run_evaluation/<run-id>/<model>/<task-id>/deployment.log for details.

Results show “ERROR” status

This means something went wrong before validation:

Scratch org creation failed
API error
Patch application failed

Check logs/run_evaluation/<run-id>/<model>/<task-id>/run_instance.log for root cause.

Advanced Usage

Can I customize the scoring system?

Yes! Edit sfbench/utils/schema.py:

# Change point allocation
deployment_points: int = 10  # Reduce to 5
functional_points: int = 50   # Increase to 60

Then re-run evaluations with the new schema.

Can I run evaluations in CI/CD?

Yes! Example GitHub Actions workflow:

- name: Run SF-Bench
  run: |
    python scripts/evaluate.py \
      --model "$" \
      --tasks data/tasks/lite.json
  env:
    OPENROUTER_API_KEY: $
    SF_USERNAME: $

CI/CD integration guide coming soon. For now, see Evaluation Guide for running evaluations.

How do I analyze failure patterns?

Use the reporting tools:

from sfbench.utils.reporting import compare_reports, generate_markdown_summary

# Load two reports
report1 = load_report("results/model-a/report.json")
report2 = load_report("results/model-b/report.json")

# Generate comparison
comparison = compare_reports(report1, report2)
print(comparison)

Contributing

How can I contribute?

We welcome contributions!

Add tasks: Create new Salesforce scenarios
Improve validation: Enhance functional validation logic
Fix bugs: Report issues or submit PRs
Documentation: Improve guides and examples

See CONTRIBUTING.md for guidelines.

Can I use SF-Bench in my research?

Yes! SF-Bench is open source and free to use for academic research.

Citation:

@misc{sfbench2025,
  title={SF-Bench: A Benchmark for Salesforce Development Tasks},
  author={Yasar Shaikh},
  year={2025},
  url={https://github.com/yasarshaikh/SF-bench}
}

How is SF-Bench different from SWE-bench?

Similarities:

Real-world task focus
Functional validation (not just syntax)
Standardized result schema (v2, SWE-bench compatible)
Multi-strategy patch application for robust evaluation
Hierarchical log organization

Differences:

Domain: Salesforce vs. Python open-source
Validation: Scratch orgs vs. Docker containers
Tasks: 12 verified Salesforce scenarios (expanding) vs. 2,000+ GitHub issues
Focus: Enterprise development vs. open-source contributions

SF-Bench is aligned with SWE-bench standards and best practices, tailored for Salesforce.

Future Plans

What’s coming next?

Phase 2 (Q1 2025):

Lite dataset (5 tasks for quick validation)
Enhanced analysis tools
Web-based result viewer

Phase 3 (Q2 2025):

Integration test scenarios
Multi-org workflows
Community task contributions

See GitHub Issues for roadmap discussions.

Will there be a hosted version?

We’re considering a hosted version where you can:

Run evaluations without a DevHub
View real-time results
Compare models instantly

Interested? Open an issue to vote or discuss.

Contact & Support

Where can I get help?

🐛 Bug reports: GitHub Issues
💬 Questions: GitHub Issues
📚 Documentation: docs/

How do I stay updated?

⭐ Star the repo on GitHub
👀 Watch for releases
📧 Subscribe to discussions