Result Schema Reference
Complete reference for SF-Bench result schema v2 format.
Schema Version
Current Version: 2.0
Previous Version: 1.0 (legacy format, still supported)
Top-Level Structure (SWE-bench Compatible)
{
"schema_version": "2.0",
"run_id": "claude-sonnet-4.5-20251229_143000",
"model_name": "anthropic/claude-3.5-sonnet",
"model_name_or_path": "anthropic/claude-3.5-sonnet",
"dataset": "verified",
"created_at": "2025-12-29T14:30:00Z",
"start_time": "2025-12-29T14:30:00Z",
"end_time": "2025-12-29T16:45:00Z",
"evaluation_config": {...},
"config": {...},
"instances": [...],
"summary": {...},
"resolved_ids": ["lwc-component-001", "apex-trigger-handler-001"],
"unresolved_ids": ["flow-screen-component-001"],
"error_ids": ["architecture-001"],
"empty_patch_ids": [],
"completed_ids": ["lwc-component-001", "apex-trigger-handler-001", "flow-screen-component-001"],
"environment": {...}
}
Note: SF-Bench result schema is compatible with SWE-bench format for tool interoperability. Both field names are included for backward compatibility.
Fields
schema_version (string, required)
Schema version identifier. Current: "2.0"
run_id (string, required)
Unique identifier for this evaluation run.
Format: <model-name>-<timestamp>
model_name (string, required)
Name of the AI model being evaluated.
Examples: "anthropic/claude-3.5-sonnet", "gemini-2.5-flash"
model_name_or_path (string, required)
SWE-bench compatible field name (alias for model_name).
dataset (string, required)
Dataset used for evaluation.
Options: "verified", "lite", "full", or custom name
config (object, required)
Evaluation configuration (backward compatibility).
evaluation_config (object, required)
SWE-bench compatible field name (alias for config).
{
"tasks_file": "data/tasks/verified.json",
"validation_mode": "functional",
"max_workers": 2,
"functional_validation": true,
"scratch_org_alias": "sfbench-claude-sonnet-4.5-20251229"
}
instances (array, required)
Array of InstanceResult objects (one per task).
summary (object, required)
Summary statistics. See Summary Object below.
start_time (string, required)
ISO 8601 timestamp when evaluation started (backward compatibility).
created_at (string, required)
SWE-bench compatible field name (alias for start_time).
end_time (string, optional)
ISO 8601 timestamp when evaluation completed.
resolved_ids (array, required)
List of instance IDs that were successfully resolved (SWE-bench compatible).
unresolved_ids (array, required)
List of instance IDs that failed validation (SWE-bench compatible).
error_ids (array, required)
List of instance IDs that encountered errors (SWE-bench compatible).
empty_patch_ids (array, required)
List of instance IDs with empty or missing patches (SWE-bench compatible).
completed_ids (array, required)
List of instance IDs that completed (not errored) (SWE-bench compatible).
environment (object, optional)
Environment information (CLI version, Python version, etc.)
Instance Result
Represents the result of evaluating one task.
{
"instance_id": "lwc-component-001",
"model_name": "anthropic/claude-3.5-sonnet",
"status": "resolved",
"resolved": true,
"validation": {...},
"duration_seconds": 245.3,
"scratch_org_username": "test-abc123@example.com",
"error_message": null,
"error_type": null,
"start_time": "2025-12-29T14:30:15Z",
"end_time": "2025-12-29T14:34:20Z",
"log_path": "logs/run_evaluation/claude-sonnet-4.5-20251229/lwc-component-001/run_instance.log",
"solution_patch": "diff --git a/..."
}
Status Values
"resolved": Task successfully completed (score ≥80, all critical components pass)"fail": Task failed validation"error": Error during execution"skipped": Task was not executed
Validation Breakdown
Detailed validation results for each component.
{
"deployment_status": "pass",
"deployment_message": "Deployed successfully",
"deployment_points": 10,
"unit_test_status": "pass",
"unit_test_message": "All tests passed",
"unit_test_passed": 15,
"unit_test_failed": 0,
"unit_test_total": 15,
"unit_test_points": 20,
"functional_status": "pass",
"functional_message": "Business outcome achieved",
"functional_details": {...},
"functional_points": 50,
"bulk_status": "pass",
"bulk_message": "Handled 200 records",
"bulk_records_processed": 200,
"bulk_records_expected": 200,
"bulk_points": 10,
"no_tweaks_status": "pass",
"no_tweaks_message": "No manual tweaks needed",
"no_tweaks_points": 10,
"total_score": 100
}
Component Status Values
"pass": Component validation passed"fail": Component validation failed"error": Error during component validation"skipped": Component validation skipped
Scoring
| Component | Max Points | Criteria |
|---|---|---|
| Deployment | 10 | Code deploys without errors |
| Unit Tests | 20 | All tests pass, coverage ≥80% |
| Functional | 50 | Business outcome achieved |
| Bulk | 10 | Handles 200+ records |
| No Tweaks | 10 | Works without manual fixes |
| Total | 100 |
Summary Object
Overall statistics for the evaluation.
{
"total_instances": 12,
"instances_submitted": 12,
"instances_completed": 7,
"resolved_instances": 5,
"instances_resolved": 5,
"instances_unresolved": 2,
"failed_instances": 2,
"error_instances": 5,
"instances_error": 5,
"instances_empty_patch": 0,
"resolve_rate": 0.4167,
"resolution_rate": 41.67,
"avg_score": 6.0,
"avg_functional_score": 3.0,
"median_score": 0.0,
"min_score": 0,
"max_score": 100,
"deployment_pass_rate": 0.4167,
"unit_test_pass_rate": 0.3333,
"functional_pass_rate": 0.0833,
"bulk_pass_rate": 0.0,
"no_tweaks_pass_rate": 0.4167,
"avg_duration_seconds": 245.3,
"total_duration_seconds": 2943.6
}
SWE-bench Compatible Fields:
instances_submitted: Total instances submitted for evaluationinstances_completed: Instances that completed (not errored)instances_resolved: Alias forresolved_instancesinstances_unresolved: Instances that failed validationinstances_error: Alias forerror_instancesinstances_empty_patch: Instances with empty/missing patchesresolution_rate: Resolution rate as percentage (0-100)avg_functional_score: Average functional validation score
Example: Complete Report
{
"schema_version": "2.0",
"run_id": "claude-sonnet-4.5-20251229",
"model_name": "anthropic/claude-3.5-sonnet",
"dataset": "verified",
"config": {
"tasks_file": "data/tasks/verified.json",
"validation_mode": "functional",
"max_workers": 2,
"functional_validation": true
},
"instances": [
{
"instance_id": "lwc-component-001",
"model_name": "anthropic/claude-3.5-sonnet",
"status": "resolved",
"resolved": true,
"validation": {
"deployment_status": "pass",
"deployment_points": 10,
"unit_test_status": "pass",
"unit_test_points": 20,
"functional_status": "pass",
"functional_points": 50,
"bulk_status": "pass",
"bulk_points": 10,
"no_tweaks_status": "pass",
"no_tweaks_points": 10,
"total_score": 100
},
"duration_seconds": 245.3
}
],
"summary": {
"total_instances": 12,
"resolved_instances": 5,
"resolve_rate": 41.67,
"avg_score": 6.0
},
"start_time": "2025-12-29T14:30:00Z",
"end_time": "2025-12-29T16:45:00Z"
}
Migration from v1
If you have v1 format results, use the migration function:
from sfbench.utils.schema import migrate_v1_to_v2
v1_data = {...} # Your v1 format data
v2_data = migrate_v1_to_v2(v1_data)
Validation
Validate a report conforms to schema:
from sfbench.utils.schema import validate_schema
is_valid = validate_schema(report_data)
Tools & Libraries
Python
from sfbench.utils.schema import EvaluationReport, InstanceResult
from sfbench.utils.reporting import make_run_report
# Load report
with open('report.json') as f:
data = json.load(f)
report = EvaluationReport(**data)
# Generate markdown
from sfbench.utils.reporting import generate_markdown_summary
md = generate_markdown_summary(report)
Command Line
# Validate schema
python -c "from sfbench.utils.schema import validate_schema; import json; print(validate_schema(json.load(open('report.json'))))"
# Convert v1 to v2
python -c "from sfbench.utils.schema import migrate_v1_to_v2; import json; print(json.dumps(migrate_v1_to_v2(json.load(open('v1_results.json'))), indent=2))"
Related Documentation
Last updated: December 2025