Benchmarking System¶
The Opifex framework includes a benchmarking system designed specifically for scientific machine learning applications. This system provides domain-specific evaluation, publication-ready output, and statistical rigor.
Overview¶
The Benchmarking System consists of 8+ specialized components that work together to provide evaluation of scientific machine learning models:
- BenchmarkRegistry - Domain-specific configuration management
- ValidationFramework - Reference comparison, convergence analysis, and error analysis
- ChemicalAccuracyValidator - Chemical accuracy assessment with domain-specific thresholds
- ConservationValidator - Physics conservation law validation
- AnalysisEngine - Statistical analysis and performance comparison
- ResultsManager - JSON persistence and publication output
- BenchmarkRunner - End-to-end workflow orchestration
- Adapters - Bridge to calibrax
Runobjects for cross-tool analysis
Core types (BenchmarkResult, Metric, Run) and statistical analysis (StatisticalAnalyzer) are provided by calibrax.
Key Features¶
Domain-Specific Intelligence¶
- Physics-aware validation for quantum chemistry, fluid dynamics, and materials science
- Chemical accuracy assessment with <1 kcal/mol tolerance for quantum applications
- Conservation law validation for energy, momentum, and mass conservation
- Domain-specific metrics tailored to scientific computing requirements
Publication-Ready Output¶
- LaTeX table generation for academic papers
- HTML report generation for web-based sharing
- CSV export for data analysis
- Publication-quality plots with matplotlib integration
- Automated figure generation with comparison visualizations
Statistical Rigor¶
- Welch t-test and Mann-Whitney U via calibrax for significance testing
- Multi-operator comparison with per-metric rankings
- Scaling behavior analysis across different problem sizes
- Performance insights with bottleneck detection
Enterprise Reliability¶
- Database isolation for reliable benchmarking
- Pre-commit compliance with zero errors
- Production-ready architecture with modular design
Quick Start¶
Basic Usage¶
from opifex.benchmarking import (
BenchmarkRegistry, ValidationFramework, AnalysisEngine,
ResultsManager, BenchmarkRunner
)
# Initialize components
registry = BenchmarkRegistry()
validator = ValidationFramework()
analyzer = AnalysisEngine()
manager = ResultsManager(storage_path="./benchmark_results")
# Create runner with all components
runner = BenchmarkRunner(
registry=registry,
validator=validator,
analyzer=analyzer,
results_manager=manager,
output_dir="./benchmark_results",
)
# Run benchmark suite
results = runner.run_comprehensive_benchmark(
operators=["FNO", "DeepONet"],
)
# Generate publication report
report = runner.generate_publication_report(results)
Domain-Specific Benchmarking¶
from opifex.benchmarking.validators.chemical_accuracy import ChemicalAccuracyValidator
from opifex.benchmarking.validators.conservation import ConservationValidator
# Quantum chemistry — chemical accuracy assessment
chem_validator = ChemicalAccuracyValidator()
assessment = chem_validator.assess(result, domain="quantum_computing")
print(f"Passed: {assessment.passed}, Achieved: {assessment.achieved:.4f}")
# Fluid dynamics — conservation law validation
conservation = ConservationValidator(laws=["energy", "momentum"])
report = conservation.validate(y_pred, y_true)
print(f"All conserved: {report.all_conserved}")
Component Details¶
BenchmarkResult (from calibrax)¶
BenchmarkResult is the central data container for all benchmark outputs:
from calibrax.core import BenchmarkResult, Metric
result = BenchmarkResult(
name="darcy_flow_fno",
domain="scientific_ml",
tags={"dataset": "darcy_flow", "operator": "FNO"},
metrics={
"mse": Metric(value=0.0012),
"relative_error": Metric(value=0.034, lower=0.029, upper=0.041),
},
metadata={
"execution_time": 1.23,
"framework_version": "1.0.0",
},
)
# Access fields
print(result.metrics["mse"].value) # 0.0012
print(result.metadata["execution_time"]) # 1.23
print(result.tags["dataset"]) # "darcy_flow"
BenchmarkRegistry¶
The BenchmarkRegistry manages domain-specific configurations and operator discovery:
from opifex.benchmarking import BenchmarkRegistry
from opifex.benchmarking.benchmark_registry import BenchmarkConfig
registry = BenchmarkRegistry()
# Register domain-specific benchmark
config = BenchmarkConfig(
name="darcy_flow_fno",
domain="fluid_dynamics",
problem_type="elliptic_pde",
input_shape=(64, 64, 1),
output_shape=(64, 64, 1),
)
registry.register_benchmark(config)
# Auto-discover operators
registry.auto_discover_operators()
# Get benchmark suite for a domain
suite = registry.get_benchmark_suite("quantum_computing")
Key Features:
- Domain-specific configuration management
- Automatic operator discovery
- Benchmark suite generation per domain
- Compatibility checking
- JSON persistence
ValidationFramework¶
The ValidationFramework provides reference comparison and convergence analysis:
from opifex.benchmarking import ValidationFramework
# Initialize (no domain parameter — domain is inferred from results)
validator = ValidationFramework(
default_tolerances=[1e-3, 1e-4, 1e-5],
reference_methods={"analytical": analytical_solver},
)
# Validate against a reference method
report = validator.validate_against_reference(
result=benchmark_result,
reference_method="analytical",
reference_data=reference_array,
predictions=pred_array,
)
# Check convergence rates across a sequence of results
convergence = validator.check_convergence_rates(
results_sequence=[result_32, result_64, result_128],
tolerances=[1e-3, 1e-4, 1e-5],
)
# Generate detailed error analysis
error_analysis = validator.generate_error_analysis(
predictions=pred_array,
ground_truth=truth_array,
)
Key Features:
- Reference method comparison with pluggable solvers
- Convergence rate analysis across resolution sequences
- Detailed error analysis with spatial/temporal patterns
- Chemical accuracy assessment (delegates to
ChemicalAccuracyValidatorfor detailed analysis)
AnalysisEngine¶
The AnalysisEngine provides statistical analysis and performance comparison:
from opifex.benchmarking import AnalysisEngine
analyzer = AnalysisEngine(significance_threshold=0.05)
# Multi-operator comparison (single run per operator)
comparison = analyzer.compare_operators(
results_dict={"FNO": fno_result, "DeepONet": deeponet_result, "PINN": pinn_result}
)
print(f"Overall winner: {comparison.overall_winner}")
print(f"Rankings: {comparison.performance_rankings}")
# Multi-run statistical significance testing
significance = analyzer.test_statistical_significance_multi_run(
multi_run_results={
"FNO": [fno_run1, fno_run2, fno_run3],
"DeepONet": [don_run1, don_run2, don_run3],
}
)
# Scaling behavior analysis
scaling = analyzer.analyze_scaling_behavior(
performance_data={32: result_32, 64: result_64, 128: result_128}
)
print(f"Complexity estimates: {scaling.complexity_estimates}")
# Performance insights for a single result
insights = analyzer.generate_performance_insights(result=fno_result)
print(f"Key insights: {insights.key_insights}")
print(f"Bottlenecks: {insights.performance_bottlenecks}")
Key Features:
- Multi-operator performance comparison with per-metric rankings
- Statistical significance testing via calibrax (Welch t-test, Mann-Whitney U)
- Scaling behavior analysis with complexity estimation
- Performance insights with bottleneck detection
- Operator recommendations by problem type and domain
ResultsManager¶
The ResultsManager handles JSON persistence and publication output:
from opifex.benchmarking import ResultsManager
manager = ResultsManager(storage_path="./benchmark_results")
# Save results
result_id = manager.save_benchmark_results(result)
# Load a specific result
loaded = manager.load_result(result_id)
# Query stored results
matching = manager.query_results(
name="darcy",
dataset="darcy_flow",
metric_filter={"mse": (0.0, 0.01)},
)
# Generate publication plots
plots = manager.export_publication_plots(
results=[result1, result2],
plot_type="comparison",
output_format="png",
)
# Generate LaTeX tables
table_path = manager.generate_comparison_tables(
operators=["FNO", "DeepONet"],
metrics=["mse", "relative_error"],
output_format="latex",
)
Key Features:
- JSON-based database persistence
- Publication-quality plot generation
- LaTeX/HTML/CSV table generation
- Query functionality with metric filtering
- Database statistics and export
BenchmarkRunner¶
The BenchmarkRunner orchestrates end-to-end benchmarking workflows:
from opifex.benchmarking import BenchmarkRunner
runner = BenchmarkRunner(
registry=registry,
validator=validator,
analyzer=analyzer,
results_manager=manager,
output_dir="./benchmark_results",
)
# Run full benchmark suite
results = runner.run_comprehensive_benchmark(
operators=["FNO", "DeepONet"],
benchmarks=["darcy_flow", "navier_stokes"],
validate_results=True,
generate_analysis=True,
)
# Run domain-specific suite
domain_results = runner.execute_domain_specific_suite(domain="fluid_dynamics")
# Generate publication report
report = runner.generate_publication_report(
results=results,
title="Neural Operator Comparison on Fluid Dynamics",
)
print(f"Key findings: {report.key_findings}")
print(f"Tables: {report.comparison_tables}")
Key Features:
- End-to-end workflow orchestration
- Component integration with registry, validator, analyzer, results manager
- Domain-specific suite execution
- Publication report generation (
PublicationReportdataclass) - Database update functionality
Adapters¶
The adapters module bridges opifex BenchmarkResult objects to calibrax Run objects:
from opifex.benchmarking.adapters import results_to_run, default_metric_defs
# Convert benchmark results to a calibrax Run for cross-tool analysis
run = results_to_run(
results=[result1, result2, result3],
commit="abc123",
branch="main",
metric_defs=default_metric_defs(),
)
Profiling¶
The profiling subsystem delegates hardware detection, roofline analysis, FLOPS counting, and compilation profiling to calibrax while providing an opifex-specific harness and event coordinator:
from opifex.benchmarking.profiling import (
OpifexProfilingHarness,
EventCoordinator,
# From calibrax:
CompilationProfiler,
FlopsCounter,
ResourceMonitor,
RooflineAnalyzer,
detect_hardware_specs,
analyze_complexity,
)
# Profile a neural operator
harness = OpifexProfilingHarness(
enable_hardware_profiling=True,
enable_roofline_analysis=True,
)
with harness.profiling_session():
metrics, report = harness.profile_neural_operator(
operator=fno_model,
inputs=[input_array],
operation_name="FNO forward pass",
)
print(report.render())
Usage¶
Custom Domain Configuration¶
from opifex.benchmarking.benchmark_registry import DomainConfig
# Define custom domain with specific tolerances and metrics
config = DomainConfig(
name="custom_physics",
tolerance_ranges={
"energy_conservation": (1e-7, 1e-5),
"momentum_conservation": (1e-6, 1e-4),
},
required_metrics=["l2_error", "max_error", "physics_residual"],
reference_methods=["analytical", "high_fidelity_simulation"],
)
Statistical Analysis¶
# Multi-run significance testing delegates to calibrax
significance = analyzer.test_statistical_significance_multi_run(
multi_run_results={
"FNO": fno_runs,
"DeepONet": deeponet_runs,
}
)
# Results include Welch t-test and Mann-Whitney U per metric pair
for pair, metrics in significance.items():
for metric_name, stats in metrics.items():
print(f"{pair} / {metric_name}: p={stats.get('p_value', 'N/A')}")
Publication Output¶
# Generate publication report with tables and figures
report = runner.generate_publication_report(
results=results,
title="Neural Operator Benchmark Results",
)
# Access report fields
print(report.abstract)
print(report.methodology)
for finding in report.key_findings:
print(f" - {finding}")
for table in report.comparison_tables:
print(f" Table: {table}")
Testing and Validation¶
The benchmarking system includes testing across all components:
# Run all benchmarking tests
uv run pytest tests/benchmarking/ -v
# Run specific component tests
uv run pytest tests/benchmarking/test_benchmark_registry.py -v
uv run pytest tests/benchmarking/test_validation_framework.py -v
uv run pytest tests/benchmarking/test_analysis_engine.py -v
uv run pytest tests/benchmarking/test_adapters.py -v
uv run pytest tests/benchmarking/test_baseline_repository.py -v
uv run pytest tests/benchmarking/test_chemical_accuracy_validator.py -v
uv run pytest tests/benchmarking/test_conservation_validator.py -v
uv run pytest tests/benchmarking/test_operator_execution.py -v
Test Coverage:
- Component unit tests with database isolation
- Integration tests with end-to-end workflows
- Performance tests with timing validation
- Error handling tests with recovery scenarios
Best Practices¶
Database Management¶
- Use unique storage paths for different benchmark runs
- Implement proper cleanup in test environments
- Use the
ResultsManagerquery API to find past results before re-running
Statistical Analysis¶
- Use appropriate sample sizes for statistical tests
- Apply multiple comparison corrections when needed
- Report confidence intervals alongside point estimates
- Validate assumptions before applying statistical tests
Publication Output¶
- Follow journal-specific formatting requirements
- Include metadata in tables via
ResultsManager.generate_comparison_tables() - Use consistent color schemes across figures
- Provide clear captions and legends
Performance Optimization¶
- Use JAX-JIT compilation for computational kernels
- Cache frequently accessed results
- Use parallel processing for independent benchmarks
Troubleshooting¶
Common Issues¶
-
Storage Path Errors
-
Memory Issues with Large Datasets
# Use batch processing via the evaluator evaluator = BenchmarkEvaluator(output_dir="./results") # Evaluate in smaller batches for batch_x, batch_y in batched_data: result = evaluator.evaluate_model( model=model_fn, model_name="FNO", input_data=batch_x, target_data=batch_y, dataset_name="darcy_flow", ) -
Statistical Test Failures
Performance Optimization¶
# Enable JAX-JIT compilation
import jax
jax.config.update("jax_enable_x64", True)
# Query results efficiently with filters
results = manager.query_results(
name="darcy",
metric_filter={"mse": (0.0, 0.01)},
)
Future Features¶
Planned Features¶
- Automated hyperparameter optimization for benchmark configurations
- Multi-GPU benchmarking for large-scale experiments
- Real-time benchmarking with streaming results
- Interactive dashboards for result exploration
Research Directions¶
- Uncertainty-aware benchmarking with probabilistic metrics
- Transfer learning evaluation across different domains
- Robustness testing with adversarial examples