Benchmarking API Reference¶

Overview¶

Benchmarking tools for scientific machine learning methods. The module delegates core types (BenchmarkResult, Metric, Run) and statistical analysis to calibrax while providing domain-specific evaluation, validation, and profiling on top.

Benchmark Registry¶

Benchmark Registry for Opifex Advanced Benchmarking System

Manages available benchmarks and neural operators with domain organization. Provides registration, discovery, and configuration management for the full benchmarking ecosystem.

DomainConfig `dataclass` ¶

DomainConfig(*, name: str, tolerance_ranges: dict[str, tuple[float, float]] = dict(), required_metrics: list[str] = list(), reference_methods: list[str] = list(), default_problem_sizes: list[int] = list())

Configuration for a specific scientific domain.

BenchmarkConfig `dataclass` ¶

BenchmarkConfig(*, name: str, domain: str, problem_type: str, input_shape: tuple[int, ...], output_shape: tuple[int, ...], dataset_path: str | None = None, reference_solution_path: str | None = None, physics_constraints: dict[str, Any] = dict(), computational_requirements: dict[str, Any] = dict())

Configuration for a specific benchmark.

BenchmarkRegistry ¶

BenchmarkRegistry(config_path: str | None = None)

Manages available benchmarks and neural operators with domain organization.

This registry provides centralized management of: - Neural operator architectures available for benchmarking - Benchmark problems organized by scientific domain - Domain-specific configurations and requirements - Compatibility checking between operators and benchmarks

Parameters:

Name	Type	Description	Default
`config_path`	`str \| None`	Path to registry configuration file	`None`

save_registry ¶

save_registry() -> None

Save registry configuration to file.

register_operator ¶

register_operator(operator_class: type, metadata: dict[str, Any] | None = None) -> None

Register a neural operator for benchmarking.

Parameters:

Name	Type	Description	Default
`operator_class`	`type`	Neural operator class to register	required
`metadata`	`dict[str, Any] \| None`	Additional metadata about the operator	`None`

register_benchmark ¶

register_benchmark(benchmark_config: BenchmarkConfig) -> None

Register a benchmark configuration.

Parameters:

Name	Type	Description	Default
`benchmark_config`	`BenchmarkConfig`	Benchmark configuration to register	required

get_benchmark_suite ¶

get_benchmark_suite(domain: str) -> list[BenchmarkConfig]

Get all benchmarks for a specific domain.

Parameters:

Name	Type	Description	Default
`domain`	`str`	Scientific domain name	required

Returns:

Type	Description
`list[BenchmarkConfig]`	List of benchmark configurations for the domain

list_compatible_operators ¶

list_compatible_operators(benchmark_name: str) -> list[str]

Get list of operators compatible with a benchmark.

Parameters:

Name	Type	Description	Default
`benchmark_name`	`str`	Name of the benchmark	required

Returns:

Type	Description
`list[str]`	List of compatible operator names

get_domain_specific_config ¶

get_domain_specific_config(domain: str) -> DomainConfig

Get configuration for a specific domain.

Parameters:

Name	Type	Description	Default
`domain`	`str`	Domain name	required

Returns:

Type	Description
`DomainConfig`	Domain configuration

Raises:

Type	Description
`ValueError`	If domain not found

get_operator_class ¶

get_operator_class(operator_name: str) -> type

Get operator class by name.

Parameters:

Name	Type	Description	Default
`operator_name`	`str`	Name of the operator	required

Returns:

Type	Description
`type`	Operator class

Raises:

Type	Description
`ValueError`	If operator not found

get_operator_metadata ¶

get_operator_metadata(operator_name: str) -> dict[str, Any]

Get metadata for a registered operator.

Parameters:

Name	Type	Description	Default
`operator_name`	`str`	Name of the operator.	required

Returns:

Type	Description
`dict[str, Any]`	Metadata dictionary for the operator, or empty dict if not found.

get_benchmark_config ¶

get_benchmark_config(benchmark_name: str) -> BenchmarkConfig

Get benchmark configuration by name.

Parameters:

Name	Type	Description	Default
`benchmark_name`	`str`	Name of the benchmark	required

Returns:

Type	Description
`BenchmarkConfig`	Benchmark configuration

Raises:

Type	Description
`ValueError`	If benchmark not found

list_available_domains ¶

list_available_domains() -> list[str]

Get list of available domains.

list_available_operators ¶

list_available_operators() -> list[str]

Get list of available operators.

list_available_benchmarks ¶

list_available_benchmarks() -> list[str]

Get list of available benchmarks.

auto_discover_operators ¶

auto_discover_operators() -> None

Auto-discover neural operators from opifex.neural.operators module.

generate_compatibility_report ¶

generate_compatibility_report() -> dict[str, Any]

Generate a report of benchmark-operator compatibility.

Returns:

Type	Description
`dict[str, Any]`	Full compatibility report

Benchmark Runner¶

Benchmark Runner for Opifex Advanced Benchmarking System

Orchestrates complete benchmarking pipeline execution. Provides end-to-end benchmarking workflows, domain-specific suites, publication report generation, and database updates.

DomainResults `dataclass` ¶

DomainResults(*, domain: str, benchmark_results: dict[str, dict[str, BenchmarkResult]], validation_reports: dict[str, dict[str, ValidationReport]] = dict(), comparison_reports: dict[str, ComparisonReport] = dict(), insight_reports: dict[str, dict[str, InsightReport]] = dict(), summary_statistics: dict[str, Any] = dict())

Results for a domain-specific benchmark suite.

PublicationReport `dataclass` ¶

PublicationReport(*, title: str, abstract: str, methodology: str, results_summary: dict[str, Any], comparison_tables: list[Path] = list(), figures: list[Path] = list(), key_findings: list[str] = list(), recommendations: list[str] = list(), appendix_data: dict[str, Any] = dict())

Publication-ready benchmark report.

BenchmarkFailure `dataclass` ¶

BenchmarkFailure(*, benchmark_name: str, operator_name: str, error: Exception)

A single (benchmark, operator) run that raised during execution.

Recording failures explicitly keeps a benchmark crash from silently disappearing: the suite continues, but the error remains queryable on the runner instead of being indistinguishable from an incompatible/skipped pair.

BenchmarkRunner ¶

BenchmarkRunner(registry: BenchmarkRegistry | None = None, evaluator: BenchmarkEvaluator | None = None, validator: ValidationFramework | None = None, analyzer: AnalysisEngine | None = None, results_manager: ResultsManager | None = None, output_dir: str = './benchmark_results')

Orchestrates complete benchmarking pipeline execution.

This runner provides end-to-end benchmarking capabilities including: - Full multi-operator benchmarking across domains - Domain-specific benchmark suite execution with validation - Publication-ready report and figure generation - Automated benchmark database updates and maintenance

Parameters:

Name	Type	Description	Default
`registry`	`BenchmarkRegistry \| None`	Benchmark registry (creates default if None)	`None`
`evaluator`	`BenchmarkEvaluator \| None`	Benchmark evaluator (creates default if None)	`None`
`validator`	`ValidationFramework \| None`	Validation framework (creates default if None)	`None`
`analyzer`	`AnalysisEngine \| None`	Analysis engine (creates default if None)	`None`
`results_manager`	`ResultsManager \| None`	Results manager (creates default if None)	`None`
`output_dir`	`str`	Output directory for results	`'./benchmark_results'`

run_comprehensive_benchmark ¶

run_comprehensive_benchmark(operators: list[str] | None = None, benchmarks: list[str] | None = None, validate_results: bool = True, generate_analysis: bool = True) -> dict[str, dict[str, BenchmarkResult]]

Run full benchmark across multiple operators and problems.

Parameters:

Name	Type	Description	Default
`operators`	`list[str] \| None`	List of operator names (uses all available if None)	`None`
`benchmarks`	`list[str] \| None`	List of benchmark names (uses all available if None)	`None`
`validate_results`	`bool`	Whether to run validation framework	`True`
`generate_analysis`	`bool`	Whether to run analysis engine	`True`

Returns:

Type	Description
`dict[str, dict[str, BenchmarkResult]]`	Nested dictionary: benchmark_name -> operator_name -> BenchmarkResult

execute_domain_specific_suite ¶

execute_domain_specific_suite(domain: str) -> DomainResults

Execute benchmark suite for a specific scientific domain.

Parameters:

Name	Type	Description	Default
`domain`	`str`	Scientific domain name	required

Returns:

Type	Description
`DomainResults`	Full domain-specific results

generate_publication_report ¶

generate_publication_report(results: dict[str, dict[str, BenchmarkResult]] | DomainResults, title: str | None = None) -> PublicationReport

Generate publication-ready report from benchmark results.

Parameters:

Name	Type	Description	Default
`results`	`dict[str, dict[str, BenchmarkResult]] \| DomainResults`	Benchmark results (either full or domain-specific)	required
`title`	`str \| None`	Report title (auto-generated if None)	`None`

Returns:

Type	Description
`PublicationReport`	Publication-ready report with figures and tables

update_benchmark_database ¶

update_benchmark_database() -> dict[str, Any]

Update benchmark database with latest results.

Returns:

Type	Description
`dict[str, Any]`	Database update summary

Evaluation Engine¶

Core benchmarking evaluation engine for Opifex framework.

This module provides model evaluation capabilities using calibrax for metrics and statistical analysis. BenchmarkEvaluator orchestrates evaluation runs, profiling, and result management.

BenchmarkEvaluator ¶

BenchmarkEvaluator(output_dir: str = './benchmark_results', save_detailed_results: bool = True, enable_gpu_profiling: bool = False)

Main benchmark evaluator for Opifex models.

Provides full evaluation capabilities including model assessment, performance profiling, batch evaluation, and result management.

Parameters:

Name	Type	Description	Default
`output_dir`	`str`	Directory for saving results.	`'./benchmark_results'`
`save_detailed_results`	`bool`	Whether to save detailed results to files.	`True`
`enable_gpu_profiling`	`bool`	Whether to enable GPU profiling.	`False`

evaluate_model ¶

evaluate_model(model: Any, model_name: str, input_data: Array | tuple[Array, ...], target_data: Array, dataset_name: str, forward_fn: Callable | None = None, custom_metrics: dict[str, Callable] | None = None) -> BenchmarkResult

Evaluate a model on given data with extensive metrics.

Parameters:

Name	Type	Description	Default
`model`	`Any`	Model to evaluate.	required
`model_name`	`str`	Name identifier for the model.	required
`input_data`	`Array \| tuple[Array, ...]`	Input data for evaluation.	required
`target_data`	`Array`	Expected target outputs.	required
`dataset_name`	`str`	Name of the dataset being used.	required
`forward_fn`	`Callable \| None`	Optional custom forward function.	`None`
`custom_metrics`	`dict[str, Callable] \| None`	Optional dictionary of custom metric functions.	`None`

Returns:

Type	Description
`BenchmarkResult`	BenchmarkResult with evaluation metrics and metadata.

batch_evaluate ¶

batch_evaluate(models: list[tuple[str, Any]], datasets: list[tuple[str, Any, Array, Callable | None]]) -> list[BenchmarkResult]

Evaluate multiple models on multiple datasets.

Parameters:

Name	Type	Description	Default
`models`	`list[tuple[str, Any]]`	List of (model_name, model) tuples.	required
`datasets`	`list[tuple[str, Any, Array, Callable \| None]]`	List of (dataset_name, input_data, target_data, forward_fn) tuples.	required

Returns:

Type	Description
`list[BenchmarkResult]`	List of BenchmarkResults for all model-dataset combinations.

profile_model_performance ¶

profile_model_performance(model: Any, input_data: Array | tuple[Array, ...], num_runs: int = 10, forward_fn: Callable | None = None) -> dict[str, float]

Profile model performance with multiple runs.

Parameters:

Name	Type	Description	Default
`model`	`Any`	Model to profile.	required
`input_data`	`Array \| tuple[Array, ...]`	Input data for profiling.	required
`num_runs`	`int`	Number of runs for statistics.	`10`
`forward_fn`	`Callable \| None`	Custom forward function.	`None`

Returns:

Type	Description
`dict[str, float]`	Dictionary with performance statistics.

load_results ¶

load_results() -> list[BenchmarkResult]

Load all benchmark results from files.

Returns:

Type	Description
`list[BenchmarkResult]`	List of BenchmarkResults.

generate_summary_report ¶

generate_summary_report() -> dict[str, Any]

Generate complete summary report of all evaluations.

Returns:

Type	Description
`dict[str, Any]`	Dictionary with summary statistics and analysis.

Validation Framework¶

Validation Framework for Opifex Advanced Benchmarking System.

Scientific accuracy validation against reference computational methods. Provides convergence analysis, chemical accuracy assessment, and error analysis for rigorous scientific computing validation.

Generic dataclasses (ConvergenceAnalysis, AccuracyAssessment) are replaced by calibrax.validation equivalents (ConvergenceResult, AccuracyResult).

ValidationReport `dataclass` ¶

ValidationReport(*, benchmark_name: str, reference_method: str, accuracy_metrics: dict[str, float], convergence_metrics: dict[str, float], chemical_accuracy_status: bool | None = None, tolerance_violations: list[str] = list(), validation_passed: bool = False, notes: str = '')

Report of validation results against reference methods.

ErrorAnalysis `dataclass` ¶

ErrorAnalysis(*, global_errors: dict[str, float], local_errors: dict[str, Array], error_distribution: dict[str, Any], outlier_analysis: dict[str, Any], spatial_error_patterns: dict[str, Any] | None = None, temporal_error_patterns: dict[str, Any] | None = None)

Error analysis between predictions and ground truth.

Physics-specific: includes spatial and temporal pattern detection not available in calibrax generic validation.

ValidationFramework ¶

ValidationFramework(default_tolerances: list[float] | None = None, reference_methods: dict[str, Callable] | None = None)

Scientific accuracy validation against reference computational methods.

Provides: - Comparison against established computational methods (FEM, FDM, spectral) - Convergence rate analysis across multiple tolerance levels - Chemical accuracy assessment for quantum computing applications - Statistical error analysis with spatial and temporal pattern detection

Parameters:

Name	Type	Description	Default
`default_tolerances`	`list[float] \| None`	Default tolerance levels for convergence testing.	`None`
`reference_methods`	`dict[str, Callable] \| None`	Dictionary of reference computational methods.	`None`

validate_against_reference ¶

validate_against_reference(result: BenchmarkResult, reference_method: str, reference_data: Array | None = None, predictions: Array | None = None) -> ValidationReport

Validate benchmark results against reference computational method.

Parameters:

Name	Type	Description	Default
`result`	`BenchmarkResult`	Benchmark result to validate.	required
`reference_method`	`str`	Name of reference method.	required
`reference_data`	`Array \| None`	Reference solution data (if available).	`None`
`predictions`	`Array \| None`	Raw model predictions (if available). Required for meaningful accuracy metrics when reference_data is provided.	`None`

Returns:

Type	Description
`ValidationReport`	Validation report with accuracy metrics and tolerance violations.

check_convergence_rates ¶

check_convergence_rates(results_sequence: list[BenchmarkResult], tolerances: list[float] | None = None) -> ConvergenceResult

Analyze convergence rates across multiple tolerance levels.

Delegates to calibrax.validation.check_convergence after extracting metric series from BenchmarkResult sequence.

Parameters:

Name	Type	Description	Default
`results_sequence`	`list[BenchmarkResult]`	Sequence of results at different tolerance levels.	required
`tolerances`	`list[float] \| None`	Tolerance levels tested.	`None`

Returns:

Type	Description
`ConvergenceResult`	ConvergenceResult from calibrax with rates and achievement flags.

assess_chemical_accuracy ¶

assess_chemical_accuracy(result: BenchmarkResult, target_accuracy: float | None = None, accuracy_type: str = 'chemical_accuracy') -> AccuracyResult

Assess chemical accuracy for quantum computing applications.

Delegates to calibrax.validation.check_accuracy after extracting the appropriate metric from the BenchmarkResult.

Parameters:

Name	Type	Description	Default
`result`	`BenchmarkResult`	Benchmark result to assess.	required
`target_accuracy`	`float \| None`	Target accuracy threshold (defaults to domain standard).	`None`
`accuracy_type`	`str`	Type of accuracy being assessed.	`'chemical_accuracy'`

Returns:

Type	Description
`AccuracyResult`	AccuracyResult from calibrax with pass/fail and margin.

generate_error_analysis ¶

generate_error_analysis(predictions: Array, ground_truth: Array, spatial_coords: Array | None = None, temporal_coords: Array | None = None) -> ErrorAnalysis

Generate error analysis for predictions vs ground truth.

Parameters:

Name	Type	Description	Default
`predictions`	`Array`	Model predictions.	required
`ground_truth`	`Array`	Ground truth data.	required
`spatial_coords`	`Array \| None`	Spatial coordinates (if available).	`None`
`temporal_coords`	`Array \| None`	Temporal coordinates (if available).	`None`

Returns:

Type	Description
`ErrorAnalysis`	ErrorAnalysis with global, local, distribution, and pattern data.

Analysis Engine¶

Analysis Engine for Opifex Advanced Benchmarking System.

Comparative analysis and performance insights generation for scientific computing benchmarks. Operator comparison and statistical testing delegate to calibrax.analysis and calibrax.statistics. Domain-specific recommendation logic and scaling analysis are retained here.

ComparisonReport `dataclass` ¶

ComparisonReport(*, benchmark_name: str, operators_compared: list[str], metric_comparisons: dict[str, dict[str, float]], performance_rankings: dict[str, list[str]], statistical_significance: dict[str, dict[str, bool]], winner_by_metric: dict[str, str], overall_winner: str, improvement_factors: dict[str, dict[str, float]] = dict())

Report comparing multiple operators on the same benchmark.

ScalingAnalysis `dataclass` ¶

ScalingAnalysis(*, operator_name: str, problem_sizes: list[int], scaling_metrics: dict[str, dict[int, float]], scaling_coefficients: dict[str, float], complexity_estimates: dict[str, str], efficiency_scores: dict[int, float], optimal_problem_size: int | None = None)

Analysis of scaling behavior across problem sizes.

InsightReport `dataclass` ¶

InsightReport(*, benchmark_name: str, operator_name: str, key_insights: list[str], performance_bottlenecks: list[str], optimization_suggestions: list[str], domain_specific_observations: list[str], confidence_level: float = 0.0)

Performance insights for a specific benchmark run.

RecommendationReport `dataclass` ¶

RecommendationReport(*, problem_type: str, domain: str, recommended_operators: list[dict[str, Any]], use_case_specific_recommendations: dict[str, str], performance_trade_offs: dict[str, str], implementation_considerations: list[str])

Recommendations for optimal operator selection.

AnalysisEngine ¶

AnalysisEngine(significance_threshold: float = 0.05)

Comparative analysis and performance insights for scientific benchmarks.

Provides: - Multi-operator performance comparisons with statistical significance - Scaling behavior analysis across problem sizes - Performance insights and bottleneck identification - Intelligent operator recommendations for specific use cases

Statistical significance testing delegates to calibrax.statistics (welch_t_test, mann_whitney_u) for multi-run comparisons.

Parameters:

Name	Type	Description	Default
`significance_threshold`	`float`	Threshold for statistical significance.	`0.05`

compare_operators ¶

compare_operators(results_dict: dict[str, BenchmarkResult]) -> ComparisonReport

Compare multiple operators on the same benchmark.

Delegates ranking and overall-winner determination to calibrax.analysis.compare_configurations(). Domain-specific features (improvement_factors, statistical_significance, weighted scoring) are retained here because calibrax lacks equivalents.

Parameters:

Name	Type	Description	Default
`results_dict`	`dict[str, BenchmarkResult]`	Dictionary mapping operator names to benchmark results.	required

Returns:

Type	Description
`ComparisonReport`	Comparison report with rankings and improvement factors.

test_statistical_significance_multi_run ¶

test_statistical_significance_multi_run(multi_run_results: dict[str, list[BenchmarkResult]]) -> dict[str, dict[str, dict[str, Any]]]

Test statistical significance with multiple runs per operator.

Delegates to calibrax.statistics.welch_t_test and mann_whitney_u for proper parametric and non-parametric testing.

Parameters:

Name	Type	Description	Default
`multi_run_results`	`dict[str, list[BenchmarkResult]]`	Operator names mapped to lists of results.	required

Returns:

Type	Description
`dict[str, dict[str, dict[str, Any]]]`	Pairwise significance results with p-values and statistics.

create_operator_recommendations ¶

create_operator_recommendations(problem_type: str, domain: str = 'general') -> RecommendationReport

Create operator recommendations for specific problem types.

Parameters:

Name	Type	Description	Default
`problem_type`	`str`	Type of problem (e.g., "pde_solving", "time_series").	required
`domain`	`str`	Scientific domain.	`'general'`

Returns:

Type	Description
`RecommendationReport`	Operator recommendation report.

analyze_scaling_behavior ¶

analyze_scaling_behavior(performance_data: dict[int, BenchmarkResult]) -> ScalingAnalysis

Analyze scaling behavior across different problem sizes.

Parameters:

Name	Type	Description	Default
`performance_data`	`dict[int, BenchmarkResult]`	Dictionary mapping problem sizes to benchmark results.	required

Returns:

Type	Description
`ScalingAnalysis`	Scaling behavior analysis.

generate_performance_insights ¶

generate_performance_insights(result: BenchmarkResult) -> InsightReport

Generate performance insights for a benchmark run.

Parameters:

Name	Type	Description	Default
`result`	`BenchmarkResult`	Benchmark result to analyze.	required

Returns:

Type	Description
`InsightReport`	Performance insights report.

Results Manager¶

Results Manager for Opifex Advanced Benchmarking System.

Data persistence and publication-ready export capabilities. Provides results storage, publication plot generation, comparison tables, and benchmark database management. Each saved result is also persisted to a calibrax Store for cross-tool interoperability.

ResultsManager ¶

ResultsManager(storage_path: str = './benchmark_results', database_path: str | None = None)

Data persistence and publication-ready export capabilities.

Provides: - Persistent storage of benchmark results with metadata - calibrax Store write-through for cross-tool interoperability - Publication-ready plot and table generation - Benchmark database maintenance and querying - Export formats for different publication venues

Parameters:

Name	Type	Description	Default
`storage_path`	`str`	Base path for storing benchmark results.	`'./benchmark_results'`
`database_path`	`str \| None`	Path to benchmark database file.	`None`

save_benchmark_results ¶

save_benchmark_results(result: BenchmarkResult, extra_metadata: dict[str, Any] | None = None) -> str

Save benchmark results with metadata.

Parameters:

Name	Type	Description	Default
`result`	`BenchmarkResult`	Benchmark result to save.	required
`extra_metadata`	`dict[str, Any] \| None`	Additional metadata to store alongside.	`None`

Returns:

Type	Description
`str`	Unique identifier for saved results.

load_results ¶

load_results(result_id: str) -> BenchmarkResult | None

Load benchmark result by ID.

Parameters:

Name	Type	Description	Default
`result_id`	`str`	Unique identifier for results.	required

Returns:

Type	Description
`BenchmarkResult \| None`	Loaded BenchmarkResult or None if not found.

query_results ¶

query_results(name: str | None = None, dataset: str | None = None, metric_filter: dict[str, tuple[float, float]] | None = None) -> list[dict[str, Any]]

Query benchmark database with filters.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	Filter by benchmark name.	`None`
`dataset`	`str \| None`	Filter by dataset tag.	`None`
`metric_filter`	`dict[str, tuple[float, float]] \| None`	Filter by metric ranges {metric: (min, max)}.	`None`

Returns:

Type	Description
`list[dict[str, Any]]`	List of matching database entries.

get_database_statistics ¶

get_database_statistics() -> dict[str, Any]

Get statistics about the benchmark database.

Returns:

Type	Description
`dict[str, Any]`	Database statistics summary.

create_benchmark_database_entry ¶

create_benchmark_database_entry(result: BenchmarkResult) -> dict[str, Any]

Create standardized database entry for benchmark results.

Parameters:

Name	Type	Description	Default
`result`	`BenchmarkResult`	Benchmark result.	required

Returns:

Type	Description
`dict[str, Any]`	Standardized database entry dictionary.

export_database ¶

export_database(export_path: str, output_format: str = 'json') -> None

Export entire benchmark database.

Parameters:

Name	Type	Description	Default
`export_path`	`str`	Path to export file.	required
`output_format`	`str`	Export format (`"json"`).	`'json'`

export_publication_plots ¶

export_publication_plots(results: list[BenchmarkResult], plot_type: Literal['comparison', 'scaling', 'convergence'] = 'comparison', output_format: str = 'png') -> list[Path]

Export publication-ready plots.

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	List of benchmark results to plot.	required
`plot_type`	`Literal['comparison', 'scaling', 'convergence']`	Type of plot to generate.	`'comparison'`
`output_format`	`str`	Output format (png, pdf, svg).	`'png'`

Returns:

Type	Description
`list[Path]`	List of paths to generated plot files.

generate_comparison_tables ¶

generate_comparison_tables(operators: list[str], metrics: list[str], output_format: Literal['latex', 'html', 'csv'] = 'latex') -> Path

Generate publication-ready comparison tables.

Queries the local benchmark database and generates a formatted comparison table in the requested output format.

Parameters:

Name	Type	Description	Default
`operators`	`list[str]`	List of operator names to include.	required
`metrics`	`list[str]`	List of metrics to include in table.	required
`output_format`	`Literal['latex', 'html', 'csv']`	Output format.	`'latex'`

Returns:

Type	Description
`Path`	Path to generated table file.

Baseline Repository¶

Baseline Repository Module.

Stores and retrieves baseline performance metrics for PDEBench datasets. Delegates persistence to calibrax.storage.Store while retaining domain-specific comparison and reporting logic.

BaselineRepository ¶

BaselineRepository(baseline_data_path: str | None = None, store_path: Path | str | None = None)

Repository for storing and retrieving baseline performance metrics.

Manages a database of baseline performance metrics for standard PDEBench datasets, enabling comparison of new models against established benchmarks. New baselines are persisted via a calibrax.storage.Store.

Parameters:

Name	Type	Description	Default
`baseline_data_path`	`str \| None`	Path to baseline data file (JSON format).	`None`
`store_path`	`Path \| str \| None`	Directory for calibrax Store persistence.	`None`

save_baselines ¶

save_baselines() -> None

Save baseline data to file.

get_baseline_metrics ¶

get_baseline_metrics(dataset_name: str, model_type: str) -> dict[str, float]

Get baseline metrics for a specific dataset and model type.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset	required
`model_type`	`str`	Type of model (e.g., "fno", "deeponet")	required

Returns:

Type	Description
`dict[str, float]`	Dictionary of baseline metrics

Raises:

Type	Description
`ValueError`	If dataset or model type not found

get_available_datasets ¶

get_available_datasets() -> list[str]

Get list of datasets with baseline data.

get_available_model_types ¶

get_available_model_types(dataset_name: str) -> list[str]

Get list of model types with baselines for a dataset.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset	required

Returns:

Type	Description
`list[str]`	List of available model types

add_baseline ¶

add_baseline(dataset_name: str, model_type: str, metrics: dict[str, float], source: str = 'User Added', model_config: dict[str, Any] | None = None, notes: str | None = None) -> None

Add a new baseline to the repository.

Persists both to the JSON file and to the calibrax Store.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset.	required
`model_type`	`str`	Type of model.	required
`metrics`	`dict[str, float]`	Performance metrics.	required
`source`	`str`	Source of the baseline data.	`'User Added'`
`model_config`	`dict[str, Any] \| None`	Model configuration details.	`None`
`notes`	`str \| None`	Additional notes.	`None`

compare_to_baseline ¶

compare_to_baseline(dataset_name: str, model_type: str, test_metrics: dict[str, float], metrics_to_compare: list[str] | None = None) -> dict[str, dict[str, float]]

Compare test metrics to baseline metrics.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset	required
`model_type`	`str`	Type of model	required
`test_metrics`	`dict[str, float]`	Metrics to compare against baseline	required
`metrics_to_compare`	`list[str] \| None`	Specific metrics to compare (None for all)	`None`

Returns:

Type	Description
`dict[str, dict[str, float]]`	Dictionary with comparison results including relative improvements

get_best_baseline ¶

get_best_baseline(dataset_name: str, metric: str = 'mse') -> tuple[str, dict[str, float]]

Get the best baseline for a dataset based on a specific metric.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset	required
`metric`	`str`	Metric to use for comparison	`'mse'`

Returns:

Type	Description
`tuple[str, dict[str, float]]`	Tuple of (model_type, metrics) for the best baseline

generate_baseline_summary ¶

generate_baseline_summary() -> dict[str, Any]

Generate a full summary of all baselines.

Returns:

Type	Description
`dict[str, Any]`	Dictionary with baseline summary statistics

Operator Executor¶

Operator Executor - Runs actual Opifex operators for benchmarking.

This module replaces the mock execution in BenchmarkRunner with real operator training and evaluation.

ExecutionConfig `dataclass` ¶

ExecutionConfig(*, n_epochs: int = 100, batch_size: int = 32, learning_rate: float = 0.001, warmup_steps: int = 5, eval_frequency: int = 10, use_mixed_precision: bool = False, seed: int = 42)

Configuration for benchmark execution.

OperatorExecutor ¶

OperatorExecutor(config: ExecutionConfig | None = None)

Executes actual Opifex operators for benchmarking.

This class provides the core execution logic that was missing from the original BenchmarkRunner implementation. It uses: - Real Opifex operators (TFNO, DeepONet, etc.) - Real Opifex data loaders (create_darcy_loader, etc.) - Flax NNX 0.11.0+ optimizer pattern - calibrax.metrics for evaluation (DRY)

Parameters:

Name	Type	Description	Default
`config`	`ExecutionConfig \| None`	Execution configuration. Uses defaults if None.	`None`

execute_training_benchmark ¶

execute_training_benchmark(operator_class: type, operator_config: dict[str, Any], train_loader: Any, test_loader: Any, benchmark_name: str) -> BenchmarkResult

Execute a training benchmark with actual operator.

Parameters:

Name	Type	Description	Default
`operator_class`	`type`	Opifex operator class to instantiate	required
`operator_config`	`dict[str, Any]`	Configuration dict for operator	required
`train_loader`	`Any`	Training data loader (from opifex.data.loaders)	required
`test_loader`	`Any`	Test data loader	required
`benchmark_name`	`str`	Name of benchmark for results	required

Returns:

Type	Description
`BenchmarkResult`	BenchmarkResult with real metrics from training

Adapters¶

Adapter for converting BenchmarkResult lists to calibrax Run objects.

Bridges the opifex benchmarking pipeline (which produces BenchmarkResult lists) with calibrax's Run-based analysis and storage APIs.

results_to_run ¶

results_to_run(results: list[BenchmarkResult], *, commit: str | None = None, branch: str | None = None, metric_defs: dict[str, MetricDef] | None = None) -> Run

Convert a list of BenchmarkResult objects to a calibrax Run.

Maps each BenchmarkResult to a Point: - BenchmarkResult.name -> Point.name - BenchmarkResult.tags["dataset"] -> Point.scenario (default: "unknown") - BenchmarkResult.tags -> Point.tags - BenchmarkResult.metrics -> Point.metrics (same Metric type)

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	List of benchmark results to convert.	required
`commit`	`str \| None`	Git commit hash to attach to the Run.	`None`
`branch`	`str \| None`	Git branch name to attach to the Run.	`None`
`metric_defs`	`dict[str, MetricDef] \| None`	Metric definitions for semantic interpretation.	`None`

Returns:

Type	Description
`Run`	A calibrax Run containing one Point per BenchmarkResult.

default_metric_defs ¶

default_metric_defs() -> dict[str, MetricDef]

Create standard metric definitions for scientific ML benchmarks.

Returns:

Type	Description
`dict[str, MetricDef]`	Dictionary mapping metric names to MetricDef objects with proper
`dict[str, MetricDef]`	direction, units, and priority annotations.

Validators — Chemical Accuracy¶

Chemical accuracy validation for scientific ML benchmarks.

Assesses whether a benchmark result meets domain-specific accuracy thresholds by delegating to calibrax.validation.check_accuracy().

ChemicalAccuracyAssessment `dataclass` ¶

ChemicalAccuracyAssessment(*, passed: bool, domain: str, threshold: float, achieved: float, margin: float, accuracy_result: AccuracyResult, recommendations: tuple[str, ...] = tuple())

Result of a chemical accuracy assessment.

Wraps a calibrax.validation.AccuracyResult with domain context and actionable recommendations.

Attributes:

Name	Type	Description
`passed`	`bool`	Whether the result meets the chemical accuracy threshold.
`domain`	`str`	Scientific domain used for assessment.
`threshold`	`float`	Accuracy threshold applied.
`achieved`	`float`	Achieved error value.
`margin`	`float`	Headroom (positive) or deficit (negative) relative to threshold.
`accuracy_result`	`AccuracyResult`	Underlying calibrax AccuracyResult.
`recommendations`	`tuple[str, ...]`	Suggested actions if assessment fails.

ChemicalAccuracyValidator ¶

ChemicalAccuracyValidator(thresholds: dict[str, float] | None = None, error_metric: str = 'relative_error')

Validates benchmark results against domain-specific chemical accuracy thresholds.

Delegates accuracy computation to calibrax.validation.check_accuracy().

Note: Registry registration intentionally omitted -- validators are instantiated directly, not discovered dynamically.

Parameters:

Name	Type	Description	Default
`thresholds`	`dict[str, float] \| None`	Custom domain-to-threshold mapping. Merged with defaults.	`None`
`error_metric`	`str`	Metric name to extract from BenchmarkResult.	`'relative_error'`

Parameters:

Name	Type	Description	Default
`thresholds`	`dict[str, float] \| None`	Custom domain-to-threshold mapping. Merged with defaults.	`None`
`error_metric`	`str`	Metric name to extract from BenchmarkResult.	`'relative_error'`

assess ¶

assess(result: BenchmarkResult, domain: str | None = None) -> ChemicalAccuracyAssessment

Assess whether a benchmark result meets chemical accuracy for a domain.

Parameters:

Name	Type	Description	Default
`result`	`BenchmarkResult`	Benchmark result containing error metrics.	required
`domain`	`str \| None`	Scientific domain. Auto-detected from result tags/domain if None.	`None`

Returns:

Type	Description
`ChemicalAccuracyAssessment`	Assessment with pass/fail, margin, and recommendations.

Raises:

Type	Description
`ValueError`	If domain is unknown and cannot be auto-detected.
`KeyError`	If the error metric is not present in the result.

Validators — Conservation Laws¶

Conservation law validation for scientific ML benchmarks.

Orchestrates conservation law checks from opifex.core.physics.conservation and optionally delegates convergence analysis to calibrax.

ConservationReport `dataclass` ¶

ConservationReport(*, violations: dict[str, float], all_conserved: bool, worst_violation: float, convergence: ConvergenceResult | None = None)

Report from conservation law validation.

Uses a local dataclass instead of calibrax.validation.ValidationReport because conservation checking requires violation magnitudes (dict[str, float]) rather than textual violation descriptions (tuple[str, ...]), plus domain-specific fields (worst_violation, all_conserved) that ValidationReport does not provide. :meth:to_validation_report bridges the two when calibrax interop is needed.

Attributes:

Name	Type	Description
`violations`	`dict[str, float]`	Conservation law name to violation magnitude.
`all_conserved`	`bool`	True if all violations are zero (within tolerance).
`worst_violation`	`float`	Maximum violation across all checked laws.
`convergence`	`ConvergenceResult \| None`	Optional convergence result from multi-resolution analysis.

to_validation_report ¶

to_validation_report() -> ValidationReport

Convert to a calibrax ValidationReport for cross-tool interop.

Returns:

Type	Description
`ValidationReport`	A `ValidationReport` with violation magnitudes as accuracy_metrics
`ValidationReport`	and textual summaries in the violations tuple.

ConservationValidator ¶

ConservationValidator(laws: Sequence[str] | None = None, energy_tolerance: float = 1e-06, momentum_tolerance: float = 1e-05, mass_target: float = 1.0, mass_tolerance: float = 0.0001)

Validates physics conservation laws on model predictions.

Orchestrates existing pure-JAX functions from opifex.core.physics.conservation and provides a unified interface.

Parameters:

Name	Type	Description	Default
`laws`	`Sequence[str] \| None`	Conservation laws to check. Defaults to energy and momentum.	`None`
`energy_tolerance`	`float`	Tolerance for energy conservation check.	`1e-06`
`momentum_tolerance`	`float`	Tolerance for momentum conservation check.	`1e-05`
`mass_target`	`float`	Target mass for mass conservation check.	`1.0`
`mass_tolerance`	`float`	Tolerance for mass conservation check.	`0.0001`

Parameters:

Name	Type	Description	Default
`laws`	`Sequence[str] \| None`	Conservation laws to check. Defaults to energy and momentum.	`None`
`energy_tolerance`	`float`	Tolerance for energy conservation check.	`1e-06`
`momentum_tolerance`	`float`	Tolerance for momentum conservation check.	`1e-05`
`mass_target`	`float`	Target mass for mass conservation check.	`1.0`
`mass_tolerance`	`float`	Tolerance for mass conservation check.	`0.0001`

validate ¶

validate(y_pred: Array, y_true: Array) -> ConservationReport

Validate conservation laws on a single prediction set.

Parameters:

Name	Type	Description	Default
`y_pred`	`Array`	Model predictions.	required
`y_true`	`Array`	Ground truth values.	required

Returns:

Type	Description
`ConservationReport`	ConservationReport with violations and overall status.

validate_convergence ¶

validate_convergence(predictions: Sequence[Array], truths: Sequence[Array], tolerances: Sequence[float]) -> ConvergenceResult

Validate conservation convergence across multiple resolutions.

Computes violations at each resolution and delegates convergence analysis to calibrax.validation.check_convergence().

Parameters:

Name	Type	Description	Default
`predictions`	`Sequence[Array]`	Predictions at increasing resolutions.	required
`truths`	`Sequence[Array]`	Ground truths at increasing resolutions.	required
`tolerances`	`Sequence[float]`	Tolerance thresholds for convergence check.	required

Returns:

Type	Description
`ConvergenceResult`	ConvergenceResult with rates and achievement flags.

Shared Utilities¶

Shared constants and utilities for the benchmarking module.

Centralises domain inference, metric classification, and chemical accuracy thresholds to eliminate duplication across sub-modules.

LOWER_IS_BETTER `module-attribute` ¶

LOWER_IS_BETTER: frozenset[str] = frozenset({'mse', 'mae', 'rmse', 'relative_error', 'mape', 'execution_time'})

Metrics where a lower value indicates better performance.

ACCURACY_METRIC_KEYS `module-attribute` ¶

ACCURACY_METRIC_KEYS: tuple[str, ...] = ('mse', 'mae', 'rmse', 'r2_score', 'relative_error')

Standard accuracy metric keys used across reporting and analysis.

CHEMICAL_ACCURACY_THRESHOLDS `module-attribute` ¶

CHEMICAL_ACCURACY_THRESHOLDS: dict[str, float] = {'quantum_computing': 0.001, 'materials_science': 0.05, 'molecular_dynamics': 0.01}

Domain-specific accuracy thresholds for chemical/physical accuracy checks.

infer_domain ¶

infer_domain(dataset_name: str) -> str

Infer scientific domain from dataset name.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset.	required

Returns:

Type	Description
`str`	Inferred domain string, or `"general"` if no match.

extract_metric_value ¶

extract_metric_value(result: BenchmarkResult, metric_name: str, default: float = float('inf')) -> float

Extract a scalar metric value from a BenchmarkResult.

Parameters:

Name	Type	Description	Default
`result`	`BenchmarkResult`	Benchmark result to extract from.	required
`metric_name`	`str`	Name of the metric.	required
`default`	`float`	Value to return if metric is absent.	`float('inf')`

Returns:

Type	Description
`float`	The metric value as a float.

Report Generation¶

Report generation for PDEBench evaluation and benchmarking results.

This module provides full report generation capabilities for PDEBench evaluation results, including statistical analysis, baseline comparisons, and publication-ready formatted outputs.

PDEBenchReportGenerator ¶

PDEBenchReportGenerator(report_format: str = 'json')

Generator for full PDEBench evaluation reports.

Creates detailed reports from evaluation results including statistical analysis, baseline comparisons, and multiple output formats for both programmatic access and human readability.

Parameters:

Name	Type	Description	Default
`report_format`	`str`	Default output format ("json" or "text")	`'json'`

generate_evaluation_report ¶

generate_evaluation_report(evaluation_results: dict[str, Any], baseline_comparisons: dict[str, Any] | None = None, dataset_info: dict[str, str] | None = None, model_info: dict[str, str] | None = None) -> dict[str, Any]

Generate full evaluation report.

Parameters:

Name	Type	Description	Default
`evaluation_results`	`dict[str, Any]`	Results from benchmarking evaluation	required
`baseline_comparisons`	`dict[str, Any] \| None`	Optional baseline comparison data	`None`
`dataset_info`	`dict[str, str] \| None`	Optional dataset metadata	`None`
`model_info`	`dict[str, str] \| None`	Optional model metadata	`None`

Returns:

Type	Description
`dict[str, Any]`	Complete evaluation report dictionary

format_report_as_text ¶

format_report_as_text(report: dict[str, Any]) -> str

Format report as human-readable text.

save_report ¶

save_report(report: dict[str, Any], filepath: str, format_type: str | None = None) -> None

Save report to file.

Parameters:

Name	Type	Description	Default
`report`	`dict[str, Any]`	Report data to save	required
`filepath`	`str`	Output file path	required
`format_type`	`str \| None`	Output format ("json" or "text"), defaults to self.report_format	`None`

generate_summary_statistics ¶

generate_summary_statistics(reports: list[dict[str, Any]]) -> dict[str, Any]

Generate summary statistics across multiple reports.

Parameters:

Name	Type	Description	Default
`reports`	`list[dict[str, Any]]`	List of evaluation reports to analyze	required

Returns:

Type	Description
`dict[str, Any]`	Summary statistics across all reports

generate_comprehensive_report ¶

generate_comprehensive_report(results: list[BenchmarkResult], include_baseline_comparison: bool = True, include_statistical_analysis: bool = True) -> dict[str, Any]

Generate full report from benchmark results.

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	List of BenchmarkResult objects	required
`include_baseline_comparison`	`bool`	Whether to include baseline comparisons	`True`
`include_statistical_analysis`	`bool`	Whether to include statistical analysis	`True`

Returns:

Type	Description
`dict[str, Any]`	Full report dictionary

Visualization Tools¶

Visualization Tools Module

This module provides visualization utilities for PDEBench benchmarking results. It focuses on generating figure metadata and configuration rather than actual plotting to integrate optimally with the core scientific framework.

Key Features: - Figure metadata generation for comparison charts - Configuration for publication-ready visualizations - Support for multiple chart types and metrics - Integration with benchmarking infrastructure

Following Critical Technical Guidelines: - JAX-native data processing - Type hints and full documentation - No external plotting dependencies (metadata only)

PDEBenchVisualizer ¶

PDEBenchVisualizer()

Visualization utilities for PDEBench benchmark results.

This class generates figure metadata and configurations for creating charts and plots of benchmark results. It avoids direct plotting to maintain lightweight dependencies.

create_comparison_chart ¶

create_comparison_chart(results: list[BenchmarkResult], metric: str, title: str = 'Model Comparison', sort_by_performance: bool = True) -> dict[str, Any]

Create metadata for a model comparison chart.

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	List of benchmark results to compare	required
`metric`	`str`	Metric to use for comparison	required
`title`	`str`	Chart title	`'Model Comparison'`
`sort_by_performance`	`bool`	Whether to sort results by performance	`True`

Returns:

Type	Description
`dict[str, Any]`	Dictionary with figure metadata and configuration

create_multi_metric_comparison ¶

create_multi_metric_comparison(results: list[BenchmarkResult], metrics: list[str], title: str = 'Multi-Metric Comparison') -> dict[str, Any]

Create metadata for multi-metric comparison chart.

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	List of benchmark results	required
`metrics`	`list[str]`	List of metrics to compare	required
`title`	`str`	Chart title	`'Multi-Metric Comparison'`

Returns:

Type	Description
`dict[str, Any]`	Dictionary with figure metadata

create_performance_trends ¶

create_performance_trends(results: list[BenchmarkResult], group_by: str = 'dataset_name', metric: str = 'mse') -> dict[str, Any]

Create metadata for performance trends visualization.

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	List of benchmark results	required
`group_by`	`str`	Field to group results by	`'dataset_name'`
`metric`	`str`	Metric to track trends for	`'mse'`

Returns:

Type	Description
`dict[str, Any]`	Dictionary with trend visualization metadata

create_baseline_comparison ¶

create_baseline_comparison(results: list[BenchmarkResult], baseline_metrics: dict[str, dict[str, float]], metric: str = 'mse') -> dict[str, Any]

Create metadata for baseline comparison visualization.

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	Test results to compare	required
`baseline_metrics`	`dict[str, dict[str, float]]`	Dictionary of baseline metrics by model type	required
`metric`	`str`	Metric to use for comparison	`'mse'`

Returns:

Type	Description
`dict[str, Any]`	Dictionary with baseline comparison metadata

create_error_distribution ¶

create_error_distribution(results: list[BenchmarkResult], error_metric: str = 'mae') -> dict[str, Any]

Create metadata for error distribution visualization.

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	List of benchmark results	required
`error_metric`	`str`	Error metric to analyze distribution for	`'mae'`

Returns:

Type	Description
`dict[str, Any]`	Dictionary with error distribution metadata

create_model_ranking ¶

create_model_ranking(results: list[BenchmarkResult], ranking_metrics: list[str], weights: dict[str, float] | None = None) -> dict[str, Any]

Create metadata for model ranking visualization.

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	List of benchmark results	required
`ranking_metrics`	`list[str]`	Metrics to use for ranking	required
`weights`	`dict[str, float] \| None`	Optional weights for each metric	`None`

Returns:

Type	Description
`dict[str, Any]`	Dictionary with model ranking metadata

get_visualization_summary ¶

get_visualization_summary(results: list[BenchmarkResult]) -> dict[str, Any]

Generate a summary of available visualization options.

Parameters:

Name	Type	Description	Default
`results`	`list[BenchmarkResult]`	List of benchmark results	required

Returns:

Type	Description
`dict[str, Any]`	Dictionary with visualization recommendations

PDE Bench Integration¶

PDEBench Integration Module

This module provides full integration with PDEBench datasets for standardized evaluation of neural operators. It includes dataset loading, preprocessing, and automated evaluation pipelines.

Key Features: - Support for major PDEBench datasets (Advection, Burgers, Darcy Flow, etc.) - Standardized data preprocessing for neural operator compatibility - Automated evaluation pipelines with statistical analysis - Integration with existing benchmarking infrastructure

Following Critical Technical Guidelines: - JAX-native data processing for GPU compatibility - FLAX NNX integration for neural operator evaluation - Test-driven development with full coverage - Type hints and documentation for all public APIs

PDEBenchLoader ¶

PDEBenchLoader(data_root: str | None = None, cache_dir: str | None = None)

Loads and preprocesses PDEBench datasets for neural operator evaluation.

This class provides a unified interface for loading standard PDE benchmark datasets with automatic preprocessing for compatibility with different neural operator architectures (FNO, DeepONet, etc.).

Parameters:

Name	Type	Description	Default
`data_root`	`str \| None`	Root directory for PDEBench datasets	`None`
`cache_dir`	`str \| None`	Directory for caching preprocessed datasets	`None`

list_available_datasets ¶

list_available_datasets() -> list[str]

List all supported PDEBench datasets.

get_dataset_info ¶

get_dataset_info(dataset_name: str) -> dict[str, Any]

Get detailed information about a specific dataset.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset	required

Returns:

Type	Description
`dict[str, Any]`	Dictionary containing dataset metadata and characteristics

load_dataset ¶

load_dataset(dataset_name: str, subset_size: int | None = None, resolution: str = 'low', split: str = 'test', normalize: bool = True, format_for_model: str = 'auto', *, allow_synthetic: bool = False) -> dict[str, Any]

Load and preprocess a PDEBench dataset.

PDEBench datasets ship as multi-gigabyte HDF5 files that must be downloaded separately (see the PDEBench data-download tooling). If the corresponding HDF5 file is not present under data_root it raises :class:FileNotFoundError unless allow_synthetic=True is passed explicitly. When synthetic data is requested it is generated from the dataset's spatial/channel characteristics, a :class:UserWarning is emitted, and the returned metadata is flagged with data_source="synthetic" and is_synthetic=True so callers can detect it.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset to load	required
`subset_size`	`int \| None`	Number of samples to load (None for full dataset)	`None`
`resolution`	`str`	Resolution setting ("low", "medium", "high")	`'low'`
`split`	`str`	Dataset split ("train", "val", "test")	`'test'`
`normalize`	`bool`	Whether to normalize the data	`True`
`format_for_model`	`str`	Target model format ("fno", "deeponet", "auto")	`'auto'`
`allow_synthetic`	`bool`	Opt in to clearly-flagged synthetic data when no real PDEBench HDF5 file is available. Defaults to `False` (fail fast).	`False`

Returns:

Type	Description
`dict[str, Any]`	Dictionary containing: - input_data: Input arrays - target_data: Target arrays - metadata: Dataset metadata (including `data_source` and `is_synthetic`)

Raises:

Type	Description
`ValueError`	If `dataset_name` is not supported.
`FileNotFoundError`	If no real PDEBench HDF5 file is found for the dataset and `allow_synthetic` is `False`.

PDEBenchEvaluationPipeline ¶

PDEBenchEvaluationPipeline(output_dir: str | None = None, *, allow_synthetic: bool = False)

Automated evaluation pipeline for PDEBench datasets.

This class provides end-to-end evaluation workflows that integrate dataset loading, model evaluation, and result analysis.

Parameters:

Name	Type	Description	Default
`output_dir`	`str \| None`	Directory for saving evaluation results	`None`
`allow_synthetic`	`bool`	Propagated to :meth:`PDEBenchLoader.load_dataset`. When `True` the pipeline may run on clearly-flagged synthetic data (with a loud warning) if real PDEBench files are absent. Defaults to `False` (fail fast on missing real data).	`False`

evaluate_model_on_datasets ¶

evaluate_model_on_datasets(model: Any, model_name: str, datasets: list[str], subset_size: int = 10, resolution: str = 'low', **kwargs: Any) -> list[BenchmarkResult]

Evaluate a model on multiple PDEBench datasets.

Parameters:

Name	Type	Description	Default
`model`	`Any`	Neural operator model to evaluate	required
`model_name`	`str`	Name identifier for the model	required
`datasets`	`list[str]`	List of dataset names to evaluate on	required
`subset_size`	`int`	Number of samples per dataset	`10`
`resolution`	`str`	Resolution setting for datasets	`'low'`
`**kwargs`	`Any`	Additional arguments for evaluation	`{}`

Returns:

Type	Description
`list[BenchmarkResult]`	List of benchmark results for each dataset

run_comprehensive_evaluation ¶

run_comprehensive_evaluation(models: list[tuple[str, Any]], datasets: list[str] | None = None, resolutions: list[str] | None = None, subset_size: int = 10) -> dict[str, list[BenchmarkResult]]

Run full evaluation across multiple models and datasets.

Parameters:

Name	Type	Description	Default
`models`	`list[tuple[str, Any]]`	List of (model_name, model) tuples	required
`datasets`	`list[str] \| None`	List of datasets to evaluate (None for all supported)	`None`
`resolutions`	`list[str] \| None`	List of resolutions to test (None for just "low")	`None`
`subset_size`	`int`	Number of samples per dataset	`10`

Returns:

Type	Description
`dict[str, list[BenchmarkResult]]`	Dictionary mapping model names to their evaluation results

CLI¶

Benchmarking CLI - Command-line interface for running Opifex benchmarks.

Usage

python -m opifex.benchmarking.cli -b PDEBench_2D_DarcyFlow -o TFNO python -m opifex.benchmarking.cli --list-benchmarks python -m opifex.benchmarking.cli --list-operators

parse_args ¶

parse_args(args: Sequence[str] | None = None) -> Namespace

Parse command-line arguments.

Parameters:

Name	Type	Description	Default
`args`	`Sequence[str] \| None`	Command-line arguments (defaults to sys.argv[1:])	`None`

Returns:

Type	Description
`Namespace`	Parsed arguments namespace

run_cli ¶

run_cli(args: Sequence[str] | None = None) -> int

Main CLI entry point.

Parameters:

Name	Type	Description	Default
`args`	`Sequence[str] \| None`	Command-line arguments (defaults to sys.argv[1:])	`None`

Returns:

Type	Description
`int`	Exit code (0 for success)

main ¶

main() -> None

Main entry point for module execution.

Profiling¶

Profiling Harness¶

Full JAX Profiling Harness for Opifex.

Main interface for the full profiling system that coordinates hardware-aware profiling, roofline analysis, compilation profiling, and generates actionable optimization reports.

OptimizationReport ¶

OptimizationReport()

Structured optimization report with actionable recommendations.

add_section ¶

add_section(title: str, content: Any) -> None

Add a section to the report.

set_executive_summary ¶

set_executive_summary(summary: dict[str, Any]) -> None

Set the executive summary.

add_priority_recommendation ¶

add_priority_recommendation(recommendation: str, impact: str = 'medium', effort: str = 'medium') -> None

Add a priority recommendation.

render ¶

render(output_format: str = 'text') -> str

Render the report in specified format.

OpifexProfilingHarness ¶

OpifexProfilingHarness(enable_hardware_profiling: bool = True, enable_compilation_profiling: bool = True, enable_roofline_analysis: bool = True, trace_dir: str | None = None)

Full JAX profiling harness for Opifex applications.

profiling_session ¶

profiling_session(enable_jax_profiler: bool = True)

Context manager for full profiling session.

profile_neural_operator ¶

profile_neural_operator(operator: Module | Callable, inputs: list[Array], operation_name: str | None = None) -> tuple[dict[str, Any], OptimizationReport]

Profile a complete neural operator with full analysis.

profile_function ¶

profile_function(func: Callable, inputs: list[Array], function_name: str | None = None) -> tuple[dict[str, Any], OptimizationReport]

Profile a JAX function with full analysis.

compare_operations ¶

compare_operations(operations: list[tuple[str, Module | Callable, list[Array]]]) -> dict[str, Any]

Compare multiple operations and identify optimization opportunities.

get_session_summary ¶

get_session_summary() -> dict[str, Any]

Get summary of all profiling sessions.

Event Coordinator¶

Event Coordinator for JAX Profiling Harness.

Coordinates timing and events across multiple profilers to ensure consistent measurements and prevent interference between profiling components.

ProfilingEvent `dataclass` ¶

ProfilingEvent(*, timestamp: float, event_type: str, profiler_id: str, data: dict[str, Any] = dict(), duration_ms: float | None = None)

Represents a profiling event with timing information.

ProfilingTimeline ¶

ProfilingTimeline()

Thread-safe timeline for profiling events.

start_timeline ¶

start_timeline() -> None

Start the profiling timeline.

add_event ¶

add_event(event_type: str, profiler_id: str, data: dict[str, Any] | None = None, duration_ms: float | None = None) -> None

Add an event to the timeline.

get_events ¶

get_events(profiler_id: str | None = None) -> list[ProfilingEvent]

Get events, optionally filtered by profiler ID.

get_timeline_duration ¶

get_timeline_duration() -> float

Get total timeline duration in seconds.

EventCoordinator ¶

EventCoordinator()

Coordinates profiling events and timing across multiple profilers.

register_profiler ¶

register_profiler(profiler_id: str) -> None

Register a profiler with the coordinator.

unregister_profiler ¶

unregister_profiler(profiler_id: str) -> None

Unregister a profiler from the coordinator.

profiling_session ¶

profiling_session(enable_jax_profiler: bool = True, trace_dir: str | None = None)

Context manager for coordinated profiling session.

add_event ¶

add_event(event_type: str, profiler_id: str, data: dict[str, Any] | None = None, duration_ms: float | None = None) -> None

Add an event to the coordinated timeline.

time_function ¶

time_function(func: Callable[..., Any], *args: Any, profiler_id: str = 'unknown', operation_name: str = 'operation', **kwargs: Any) -> tuple[Any, float]

Time a function execution and record the event.

get_profiling_summary ¶

get_profiling_summary() -> dict[str, Any]

Get a summary of the profiling session.

export_timeline ¶

export_timeline(output_format: str = 'json') -> str

Export timeline in specified format.

create_shared_coordinator ¶

create_shared_coordinator() -> EventCoordinator

Create a shared event coordinator instance.

Benchmarking API Reference¶

Overview¶

Benchmark Registry¶

DomainConfig dataclass ¶

BenchmarkConfig dataclass ¶

BenchmarkRegistry ¶

save_registry ¶

register_operator ¶

register_benchmark ¶

get_benchmark_suite ¶

list_compatible_operators ¶

get_domain_specific_config ¶

get_operator_class ¶

get_operator_metadata ¶

get_benchmark_config ¶

list_available_domains ¶

list_available_operators ¶

list_available_benchmarks ¶

auto_discover_operators ¶

generate_compatibility_report ¶

Benchmark Runner¶

DomainResults dataclass ¶

PublicationReport dataclass ¶

BenchmarkFailure dataclass ¶

BenchmarkRunner ¶

run_comprehensive_benchmark ¶

execute_domain_specific_suite ¶

generate_publication_report ¶

update_benchmark_database ¶

Evaluation Engine¶

BenchmarkEvaluator ¶

evaluate_model ¶

batch_evaluate ¶

profile_model_performance ¶

load_results ¶

generate_summary_report ¶

Validation Framework¶

ValidationReport dataclass ¶

ErrorAnalysis dataclass ¶

ValidationFramework ¶

validate_against_reference ¶

check_convergence_rates ¶

assess_chemical_accuracy ¶

generate_error_analysis ¶

Analysis Engine¶

ComparisonReport dataclass ¶

ScalingAnalysis dataclass ¶

InsightReport dataclass ¶

RecommendationReport dataclass ¶

AnalysisEngine ¶

compare_operators ¶

test_statistical_significance_multi_run ¶

create_operator_recommendations ¶

analyze_scaling_behavior ¶

generate_performance_insights ¶

Results Manager¶

ResultsManager ¶

save_benchmark_results ¶

load_results ¶

query_results ¶

get_database_statistics ¶

create_benchmark_database_entry ¶

export_database ¶

export_publication_plots ¶

generate_comparison_tables ¶

Baseline Repository¶

BaselineRepository ¶

save_baselines ¶

get_baseline_metrics ¶

get_available_datasets ¶

get_available_model_types ¶

add_baseline ¶

compare_to_baseline ¶

get_best_baseline ¶

generate_baseline_summary ¶

Operator Executor¶

ExecutionConfig dataclass ¶

OperatorExecutor ¶

execute_training_benchmark ¶

Adapters¶

DomainConfig `dataclass` ¶

BenchmarkConfig `dataclass` ¶

DomainResults `dataclass` ¶

PublicationReport `dataclass` ¶

BenchmarkFailure `dataclass` ¶

ValidationReport `dataclass` ¶

ErrorAnalysis `dataclass` ¶

ComparisonReport `dataclass` ¶

ScalingAnalysis `dataclass` ¶

InsightReport `dataclass` ¶

RecommendationReport `dataclass` ¶

ExecutionConfig `dataclass` ¶

ChemicalAccuracyAssessment `dataclass` ¶

ConservationReport `dataclass` ¶

LOWER_IS_BETTER `module-attribute` ¶

ACCURACY_METRIC_KEYS `module-attribute` ¶

CHEMICAL_ACCURACY_THRESHOLDS `module-attribute` ¶

ProfilingEvent `dataclass` ¶