SF-Bench Validation Methodology
Core Philosophy
“If a benchmark says 100% pass rate, it must mean the solution works 100% of the time in production, first try.”
This is NOT just about:
- ❌ Code compiles
- ❌ Deployment succeeds
- ❌ Unit tests pass
This IS about:
- ✅ Business requirement is met
- ✅ Functionality works as intended
- ✅ Solution handles edge cases
- ✅ No manual tweaks needed
Validation Levels
Level 1: Syntax Validation (Necessary but NOT Sufficient)
- Code parses without errors
- Metadata XML is valid
- Dependencies are resolved
Level 2: Deployment Validation (Still NOT Sufficient)
- Deploys to scratch org without errors
- All metadata types recognized
- No missing references
Level 3: Functional Validation (REQUIRED for SF-Bench)
- Solution actually DOES what it’s supposed to do
- Tested with real data in real org
- Outcome matches expected business result
Level 4: Production-Ready Validation (IDEAL)
- Handles bulk data (200+ records)
- Error handling works correctly
- Performance is acceptable
- Security model is correct
Validation by Task Type
APEX - Trigger/Class Validation
Current (Inadequate):
sf apex run test --class-names MyTest
Required (SF-Bench Standard):
validation:
steps:
- name: "Deploy to Scratch Org"
command: "sf project deploy start"
success_criteria: "exit_code == 0"
- name: "Run Unit Tests"
command: "sf apex run test --code-coverage --result-format json"
success_criteria:
- "tests_passed == tests_total"
- "code_coverage >= 80"
- name: "Execute Trigger with Test Data"
command: "sf apex run --file scripts/test_trigger.apex"
# test_trigger.apex creates records that should fire the trigger
- name: "Verify Outcome"
command: "sf data query --query \"SELECT Id, Description FROM Account WHERE Name = 'Test Account'\""
success_criteria:
- "Description CONTAINS 'Owner:'" # Verify trigger actually ran
- "Description NOT CONTAINS 'null'" # Verify null-safety worked
- name: "Verify No Governor Limit Issues"
command: "sf apex run --file scripts/bulk_test.apex" # Insert 200 records
success_criteria: "exit_code == 0"
LWC - Component Validation
Current (Partially Adequate):
npm run test:unit
Required (SF-Bench Standard):
validation:
steps:
- name: "Run Jest Tests"
command: "npm run test:unit -- --coverage"
success_criteria:
- "tests_passed == tests_total"
- "coverage >= 80"
- name: "Deploy to Scratch Org"
command: "sf project deploy start"
success_criteria: "exit_code == 0"
- name: "Verify Apex Controller Works"
command: "sf apex run --file scripts/test_controller.apex"
success_criteria: "output CONTAINS 'SUCCESS'"
- name: "UI Smoke Test (via UTAM or Playwright)"
command: "npm run test:e2e -- --spec apexImperativeMethod"
success_criteria:
- "component_renders == true"
- "no_console_errors == true"
- "loading_state_works == true"
- "error_state_works == true"
FLOW - Record-Triggered Flow Validation
Current (Completely Inadequate):
sf project deploy start --dry-run
Required (SF-Bench Standard):
validation:
steps:
- name: "Deploy Flow"
command: "sf project deploy start --source-dir force-app/main/default/flows"
success_criteria: "exit_code == 0"
- name: "Activate Flow"
command: "sf apex run --file scripts/activate_flow.apex"
success_criteria: "Flow.Status == 'Active'"
- name: "Create Test Record (Trigger Conditions Met)"
command: "sf apex run --file scripts/create_test_account.apex"
# Creates Account with Type='Customer - Direct', AnnualRevenue=2000000
- name: "Wait for Flow Execution"
command: "sleep 5" # Allow async processing
- name: "Verify Task Created"
query: "SELECT Id, Subject, WhatId, ActivityDate FROM Task WHERE WhatId = :accountId"
success_criteria:
- "record_count == 1"
- "Subject CONTAINS 'High Value Account'"
- "ActivityDate == TODAY() + 7"
- name: "Verify Contacts Updated"
query: "SELECT Id, Level__c FROM Contact WHERE AccountId = :accountId AND Title LIKE '%Director%'"
success_criteria:
- "Level__c == 'Primary' FOR ALL RECORDS"
- name: "Verify Platform Event Published"
command: "sf apex run --file scripts/check_platform_event.apex"
success_criteria: "event_published == true"
- name: "Test Bulk (200 records)"
command: "sf apex run --file scripts/bulk_insert_accounts.apex"
success_criteria:
- "exit_code == 0"
- "tasks_created == 200"
- "no_flow_errors"
- name: "Test Entry Condition NOT Met (Negative Test)"
command: "sf apex run --file scripts/create_low_value_account.apex"
success_criteria: "tasks_created == 0" # Flow should NOT fire
LIGHTNING PAGE - FlexiPage Validation
Current (Inadequate):
sf project deploy start --dry-run
Required (SF-Bench Standard):
validation:
steps:
- name: "Deploy FlexiPage"
command: "sf project deploy start --source-dir force-app/main/default/flexipages"
success_criteria: "exit_code == 0"
- name: "Assign to App/Record Type"
command: "sf apex run --file scripts/assign_flexipage.apex"
success_criteria: "assignment_success == true"
- name: "Create Test Record"
command: "sf data create record --sobject Property__c --values \"Name='Test Property' Status__c='Available'\""
- name: "Capture Page Screenshot (or UTAM test)"
command: "npm run test:e2e -- --spec propertyRecordPage"
success_criteria:
- "page_loads == true"
- "field_sections_visible >= 4"
- "action_bar_present == true"
- name: "Test Visibility Rules"
commands:
- "sf data update record --sobject Property__c --record-id :id --values \"Status__c='Sold'\""
- "npm run test:e2e -- --spec propertyRecordPage --status=Sold"
success_criteria:
- "sold_section_visible == true"
- "available_section_hidden == true"
- name: "Test Mobile Responsiveness"
command: "npm run test:e2e -- --spec propertyRecordPage --viewport=mobile"
success_criteria: "mobile_layout_correct == true"
EXPERIENCE CLOUD - Site Validation
Current (Inadequate):
sf project deploy start --dry-run
Required (SF-Bench Standard):
validation:
steps:
- name: "Deploy Site Configuration"
command: "sf project deploy start --source-dir force-app/main/default/experiences"
success_criteria: "exit_code == 0"
- name: "Publish Site"
command: "sf community publish --name E_Bikes1"
success_criteria: "exit_code == 0"
- name: "Wait for Publication"
command: "sleep 30" # Sites take time to publish
- name: "Test Guest Access"
command: "curl -s https://ebikes1.preview.salesforce-sites.com/"
success_criteria:
- "http_status == 200"
- "response CONTAINS 'E-Bikes'"
- name: "Test Authenticated Access"
command: "npm run test:e2e -- --spec ebikesAuth --user=testuser"
success_criteria:
- "login_works == true"
- "profile_page_loads == true"
- "order_history_visible == true"
- name: "Test Product Navigation"
command: "npm run test:e2e -- --spec ebikesNavigation"
success_criteria:
- "category_navigation_works == true"
- "product_detail_loads == true"
- "add_to_cart_works == true"
- name: "Test Mobile Responsiveness"
command: "npm run test:e2e -- --spec ebikes --viewport=mobile"
success_criteria: "mobile_navigation_works == true"
Scoring Methodology
Pass Criteria (Must meet ALL)
| Check | Weight | Meaning |
|---|---|---|
| Deploys Successfully | 10% | Necessary but not sufficient |
| Unit Tests Pass | 20% | Code quality check |
| Functional Test Pass | 50% | Core requirement |
| Bulk Test Pass | 10% | Production readiness |
| No Manual Tweaks | 10% | True one-shot solution |
Score Interpretation
| Score | What It Means |
|---|---|
| 100% | All validation steps passed |
| 80-99% | Most steps passed, minor issues |
| 50-79% | Partial success |
| 20-49% | Limited success |
| 0-19% | Most steps failed |
We report scores objectively. We don’t claim what’s “good” or “bad” - users decide based on their requirements.
Implementation Approach
Phase 1: Task Definition Enhancement
- Add
functional_validationsection to each task - Define specific queries/commands to verify outcomes
- Include expected data values
Phase 2: Test Data Setup
- Create Apex scripts for test data creation
- Include negative test cases
- Support bulk testing
Phase 3: Validation Runner Enhancement
- Execute multi-step validation pipelines
- Capture detailed results at each step
- Calculate weighted scores
Phase 4: Reporting
- Show step-by-step validation results
- Highlight where solutions fail
- Provide actionable feedback
Why This Matters
When a model scores 75% on SF-Bench, it should mean:
- 75% of solutions work first try, no tweaks
- 25% need some iteration
NOT:
- 75% of solutions deploy without errors
- But only 30% actually work
This benchmark must reflect real-world developer experience.
“A benchmark is only as credible as its validation methodology.”