Skip to content

Benchmarking API Reference

Overview

Benchmarking tools for scientific machine learning methods. The module delegates core types (BenchmarkResult, Metric, Run) and statistical analysis to calibrax while providing domain-specific evaluation, validation, and profiling on top.

Benchmark Registry

Benchmark Registry for Opifex Advanced Benchmarking System

Manages available benchmarks and neural operators with domain organization. Provides registration, discovery, and configuration management for the full benchmarking ecosystem.

DomainConfig dataclass

DomainConfig(*, name: str, tolerance_ranges: dict[str, tuple[float, float]] = dict(), required_metrics: list[str] = list(), reference_methods: list[str] = list(), default_problem_sizes: list[int] = list())

Configuration for a specific scientific domain.

BenchmarkConfig dataclass

BenchmarkConfig(*, name: str, domain: str, problem_type: str, input_shape: tuple[int, ...], output_shape: tuple[int, ...], dataset_path: str | None = None, reference_solution_path: str | None = None, physics_constraints: dict[str, Any] = dict(), computational_requirements: dict[str, Any] = dict())

Configuration for a specific benchmark.

BenchmarkRegistry

BenchmarkRegistry(config_path: str | None = None)

Manages available benchmarks and neural operators with domain organization.

This registry provides centralized management of: - Neural operator architectures available for benchmarking - Benchmark problems organized by scientific domain - Domain-specific configurations and requirements - Compatibility checking between operators and benchmarks

Parameters:

Name Type Description Default
config_path str | None

Path to registry configuration file

None

save_registry

save_registry() -> None

Save registry configuration to file.

register_operator

register_operator(operator_class: type, metadata: dict[str, Any] | None = None) -> None

Register a neural operator for benchmarking.

Parameters:

Name Type Description Default
operator_class type

Neural operator class to register

required
metadata dict[str, Any] | None

Additional metadata about the operator

None

register_benchmark

register_benchmark(benchmark_config: BenchmarkConfig) -> None

Register a benchmark configuration.

Parameters:

Name Type Description Default
benchmark_config BenchmarkConfig

Benchmark configuration to register

required

get_benchmark_suite

get_benchmark_suite(domain: str) -> list[BenchmarkConfig]

Get all benchmarks for a specific domain.

Parameters:

Name Type Description Default
domain str

Scientific domain name

required

Returns:

Type Description
list[BenchmarkConfig]

List of benchmark configurations for the domain

list_compatible_operators

list_compatible_operators(benchmark_name: str) -> list[str]

Get list of operators compatible with a benchmark.

Parameters:

Name Type Description Default
benchmark_name str

Name of the benchmark

required

Returns:

Type Description
list[str]

List of compatible operator names

get_domain_specific_config

get_domain_specific_config(domain: str) -> DomainConfig

Get configuration for a specific domain.

Parameters:

Name Type Description Default
domain str

Domain name

required

Returns:

Type Description
DomainConfig

Domain configuration

Raises:

Type Description
ValueError

If domain not found

get_operator_class

get_operator_class(operator_name: str) -> type

Get operator class by name.

Parameters:

Name Type Description Default
operator_name str

Name of the operator

required

Returns:

Type Description
type

Operator class

Raises:

Type Description
ValueError

If operator not found

get_operator_metadata

get_operator_metadata(operator_name: str) -> dict[str, Any]

Get metadata for a registered operator.

Parameters:

Name Type Description Default
operator_name str

Name of the operator.

required

Returns:

Type Description
dict[str, Any]

Metadata dictionary for the operator, or empty dict if not found.

get_benchmark_config

get_benchmark_config(benchmark_name: str) -> BenchmarkConfig

Get benchmark configuration by name.

Parameters:

Name Type Description Default
benchmark_name str

Name of the benchmark

required

Returns:

Type Description
BenchmarkConfig

Benchmark configuration

Raises:

Type Description
ValueError

If benchmark not found

list_available_domains

list_available_domains() -> list[str]

Get list of available domains.

list_available_operators

list_available_operators() -> list[str]

Get list of available operators.

list_available_benchmarks

list_available_benchmarks() -> list[str]

Get list of available benchmarks.

auto_discover_operators

auto_discover_operators() -> None

Auto-discover neural operators from opifex.neural.operators module.

generate_compatibility_report

generate_compatibility_report() -> dict[str, Any]

Generate a report of benchmark-operator compatibility.

Returns:

Type Description
dict[str, Any]

Full compatibility report

Benchmark Runner

Benchmark Runner for Opifex Advanced Benchmarking System

Orchestrates complete benchmarking pipeline execution. Provides end-to-end benchmarking workflows, domain-specific suites, publication report generation, and database updates.

DomainResults dataclass

DomainResults(*, domain: str, benchmark_results: dict[str, dict[str, BenchmarkResult]], validation_reports: dict[str, dict[str, ValidationReport]] = dict(), comparison_reports: dict[str, ComparisonReport] = dict(), insight_reports: dict[str, dict[str, InsightReport]] = dict(), summary_statistics: dict[str, Any] = dict())

Results for a domain-specific benchmark suite.

PublicationReport dataclass

PublicationReport(*, title: str, abstract: str, methodology: str, results_summary: dict[str, Any], comparison_tables: list[Path] = list(), figures: list[Path] = list(), key_findings: list[str] = list(), recommendations: list[str] = list(), appendix_data: dict[str, Any] = dict())

Publication-ready benchmark report.

BenchmarkRunner

BenchmarkRunner(registry: BenchmarkRegistry | None = None, evaluator: BenchmarkEvaluator | None = None, validator: ValidationFramework | None = None, analyzer: AnalysisEngine | None = None, results_manager: ResultsManager | None = None, output_dir: str = './benchmark_results')

Orchestrates complete benchmarking pipeline execution.

This runner provides end-to-end benchmarking capabilities including: - Full multi-operator benchmarking across domains - Domain-specific benchmark suite execution with validation - Publication-ready report and figure generation - Automated benchmark database updates and maintenance

Parameters:

Name Type Description Default
registry BenchmarkRegistry | None

Benchmark registry (creates default if None)

None
evaluator BenchmarkEvaluator | None

Benchmark evaluator (creates default if None)

None
validator ValidationFramework | None

Validation framework (creates default if None)

None
analyzer AnalysisEngine | None

Analysis engine (creates default if None)

None
results_manager ResultsManager | None

Results manager (creates default if None)

None
output_dir str

Output directory for results

'./benchmark_results'

run_comprehensive_benchmark

run_comprehensive_benchmark(operators: list[str] | None = None, benchmarks: list[str] | None = None, validate_results: bool = True, generate_analysis: bool = True) -> dict[str, dict[str, BenchmarkResult]]

Run full benchmark across multiple operators and problems.

Parameters:

Name Type Description Default
operators list[str] | None

List of operator names (uses all available if None)

None
benchmarks list[str] | None

List of benchmark names (uses all available if None)

None
validate_results bool

Whether to run validation framework

True
generate_analysis bool

Whether to run analysis engine

True

Returns:

Type Description
dict[str, dict[str, BenchmarkResult]]

Nested dictionary: benchmark_name -> operator_name -> BenchmarkResult

execute_domain_specific_suite

execute_domain_specific_suite(domain: str) -> DomainResults

Execute benchmark suite for a specific scientific domain.

Parameters:

Name Type Description Default
domain str

Scientific domain name

required

Returns:

Type Description
DomainResults

Full domain-specific results

generate_publication_report

generate_publication_report(results: dict[str, dict[str, BenchmarkResult]] | DomainResults, title: str | None = None) -> PublicationReport

Generate publication-ready report from benchmark results.

Parameters:

Name Type Description Default
results dict[str, dict[str, BenchmarkResult]] | DomainResults

Benchmark results (either full or domain-specific)

required
title str | None

Report title (auto-generated if None)

None

Returns:

Type Description
PublicationReport

Publication-ready report with figures and tables

update_benchmark_database

update_benchmark_database() -> dict[str, Any]

Update benchmark database with latest results.

Returns:

Type Description
dict[str, Any]

Database update summary

Evaluation Engine

Core benchmarking evaluation engine for Opifex framework.

This module provides model evaluation capabilities using calibrax for metrics and statistical analysis. BenchmarkEvaluator orchestrates evaluation runs, profiling, and result management.

BenchmarkEvaluator

BenchmarkEvaluator(output_dir: str = './benchmark_results', save_detailed_results: bool = True, enable_gpu_profiling: bool = False)

Main benchmark evaluator for Opifex models.

Provides full evaluation capabilities including model assessment, performance profiling, batch evaluation, and result management.

Parameters:

Name Type Description Default
output_dir str

Directory for saving results.

'./benchmark_results'
save_detailed_results bool

Whether to save detailed results to files.

True
enable_gpu_profiling bool

Whether to enable GPU profiling.

False

evaluate_model

evaluate_model(model: Any, model_name: str, input_data: Array | tuple[Array, ...], target_data: Array, dataset_name: str, forward_fn: Callable | None = None, custom_metrics: dict[str, Callable] | None = None) -> BenchmarkResult

Evaluate a model on given data with extensive metrics.

Parameters:

Name Type Description Default
model Any

Model to evaluate.

required
model_name str

Name identifier for the model.

required
input_data Array | tuple[Array, ...]

Input data for evaluation.

required
target_data Array

Expected target outputs.

required
dataset_name str

Name of the dataset being used.

required
forward_fn Callable | None

Optional custom forward function.

None
custom_metrics dict[str, Callable] | None

Optional dictionary of custom metric functions.

None

Returns:

Type Description
BenchmarkResult

BenchmarkResult with evaluation metrics and metadata.

batch_evaluate

batch_evaluate(models: list[tuple[str, Any]], datasets: list[tuple[str, Any, Array, Callable | None]]) -> list[BenchmarkResult]

Evaluate multiple models on multiple datasets.

Parameters:

Name Type Description Default
models list[tuple[str, Any]]

List of (model_name, model) tuples.

required
datasets list[tuple[str, Any, Array, Callable | None]]

List of (dataset_name, input_data, target_data, forward_fn) tuples.

required

Returns:

Type Description
list[BenchmarkResult]

List of BenchmarkResults for all model-dataset combinations.

profile_model_performance

profile_model_performance(model: Any, input_data: Array | tuple[Array, ...], num_runs: int = 10, forward_fn: Callable | None = None) -> dict[str, float]

Profile model performance with multiple runs.

Parameters:

Name Type Description Default
model Any

Model to profile.

required
input_data Array | tuple[Array, ...]

Input data for profiling.

required
num_runs int

Number of runs for statistics.

10
forward_fn Callable | None

Custom forward function.

None

Returns:

Type Description
dict[str, float]

Dictionary with performance statistics.

load_results

load_results() -> list[BenchmarkResult]

Load all benchmark results from files.

Returns:

Type Description
list[BenchmarkResult]

List of BenchmarkResults.

generate_summary_report

generate_summary_report() -> dict[str, Any]

Generate complete summary report of all evaluations.

Returns:

Type Description
dict[str, Any]

Dictionary with summary statistics and analysis.

Validation Framework

Validation Framework for Opifex Advanced Benchmarking System.

Scientific accuracy validation against reference computational methods. Provides convergence analysis, chemical accuracy assessment, and error analysis for rigorous scientific computing validation.

Generic dataclasses (ConvergenceAnalysis, AccuracyAssessment) are replaced by calibrax.validation equivalents (ConvergenceResult, AccuracyResult).

ValidationReport dataclass

ValidationReport(*, benchmark_name: str, reference_method: str, accuracy_metrics: dict[str, float], convergence_metrics: dict[str, float], chemical_accuracy_status: bool | None = None, tolerance_violations: list[str] = list(), validation_passed: bool = False, notes: str = '')

Report of validation results against reference methods.

ErrorAnalysis dataclass

ErrorAnalysis(*, global_errors: dict[str, float], local_errors: dict[str, Array], error_distribution: dict[str, Any], outlier_analysis: dict[str, Any], spatial_error_patterns: dict[str, Any] | None = None, temporal_error_patterns: dict[str, Any] | None = None)

Error analysis between predictions and ground truth.

Physics-specific: includes spatial and temporal pattern detection not available in calibrax generic validation.

ValidationFramework

ValidationFramework(default_tolerances: list[float] | None = None, reference_methods: dict[str, Callable] | None = None)

Scientific accuracy validation against reference computational methods.

Provides: - Comparison against established computational methods (FEM, FDM, spectral) - Convergence rate analysis across multiple tolerance levels - Chemical accuracy assessment for quantum computing applications - Statistical error analysis with spatial and temporal pattern detection

Parameters:

Name Type Description Default
default_tolerances list[float] | None

Default tolerance levels for convergence testing.

None
reference_methods dict[str, Callable] | None

Dictionary of reference computational methods.

None

validate_against_reference

validate_against_reference(result: BenchmarkResult, reference_method: str, reference_data: Array | None = None, predictions: Array | None = None) -> ValidationReport

Validate benchmark results against reference computational method.

Parameters:

Name Type Description Default
result BenchmarkResult

Benchmark result to validate.

required
reference_method str

Name of reference method.

required
reference_data Array | None

Reference solution data (if available).

None
predictions Array | None

Raw model predictions (if available). Required for meaningful accuracy metrics when reference_data is provided.

None

Returns:

Type Description
ValidationReport

Validation report with accuracy metrics and tolerance violations.

check_convergence_rates

check_convergence_rates(results_sequence: list[BenchmarkResult], tolerances: list[float] | None = None) -> ConvergenceResult

Analyze convergence rates across multiple tolerance levels.

Delegates to calibrax.validation.check_convergence after extracting metric series from BenchmarkResult sequence.

Parameters:

Name Type Description Default
results_sequence list[BenchmarkResult]

Sequence of results at different tolerance levels.

required
tolerances list[float] | None

Tolerance levels tested.

None

Returns:

Type Description
ConvergenceResult

ConvergenceResult from calibrax with rates and achievement flags.

assess_chemical_accuracy

assess_chemical_accuracy(result: BenchmarkResult, target_accuracy: float | None = None, accuracy_type: str = 'chemical_accuracy') -> AccuracyResult

Assess chemical accuracy for quantum computing applications.

Delegates to calibrax.validation.check_accuracy after extracting the appropriate metric from the BenchmarkResult.

Parameters:

Name Type Description Default
result BenchmarkResult

Benchmark result to assess.

required
target_accuracy float | None

Target accuracy threshold (defaults to domain standard).

None
accuracy_type str

Type of accuracy being assessed.

'chemical_accuracy'

Returns:

Type Description
AccuracyResult

AccuracyResult from calibrax with pass/fail and margin.

generate_error_analysis

generate_error_analysis(predictions: Array, ground_truth: Array, spatial_coords: Array | None = None, temporal_coords: Array | None = None) -> ErrorAnalysis

Generate error analysis for predictions vs ground truth.

Parameters:

Name Type Description Default
predictions Array

Model predictions.

required
ground_truth Array

Ground truth data.

required
spatial_coords Array | None

Spatial coordinates (if available).

None
temporal_coords Array | None

Temporal coordinates (if available).

None

Returns:

Type Description
ErrorAnalysis

ErrorAnalysis with global, local, distribution, and pattern data.

Analysis Engine

Analysis Engine for Opifex Advanced Benchmarking System.

Comparative analysis and performance insights generation for scientific computing benchmarks. Operator comparison and statistical testing delegate to calibrax.analysis and calibrax.statistics. Domain-specific recommendation logic and scaling analysis are retained here.

ComparisonReport dataclass

ComparisonReport(*, benchmark_name: str, operators_compared: list[str], metric_comparisons: dict[str, dict[str, float]], performance_rankings: dict[str, list[str]], statistical_significance: dict[str, dict[str, bool]], winner_by_metric: dict[str, str], overall_winner: str, improvement_factors: dict[str, dict[str, float]] = dict())

Report comparing multiple operators on the same benchmark.

ScalingAnalysis dataclass

ScalingAnalysis(*, operator_name: str, problem_sizes: list[int], scaling_metrics: dict[str, dict[int, float]], scaling_coefficients: dict[str, float], complexity_estimates: dict[str, str], efficiency_scores: dict[int, float], optimal_problem_size: int | None = None)

Analysis of scaling behavior across problem sizes.

InsightReport dataclass

InsightReport(*, benchmark_name: str, operator_name: str, key_insights: list[str], performance_bottlenecks: list[str], optimization_suggestions: list[str], domain_specific_observations: list[str], confidence_level: float = 0.0)

Performance insights for a specific benchmark run.

RecommendationReport dataclass

RecommendationReport(*, problem_type: str, domain: str, recommended_operators: list[dict[str, Any]], use_case_specific_recommendations: dict[str, str], performance_trade_offs: dict[str, str], implementation_considerations: list[str])

Recommendations for optimal operator selection.

AnalysisEngine

AnalysisEngine(significance_threshold: float = 0.05)

Comparative analysis and performance insights for scientific benchmarks.

Provides: - Multi-operator performance comparisons with statistical significance - Scaling behavior analysis across problem sizes - Performance insights and bottleneck identification - Intelligent operator recommendations for specific use cases

Statistical significance testing delegates to calibrax.statistics (welch_t_test, mann_whitney_u) for multi-run comparisons.

Parameters:

Name Type Description Default
significance_threshold float

Threshold for statistical significance.

0.05

compare_operators

compare_operators(results_dict: dict[str, BenchmarkResult]) -> ComparisonReport

Compare multiple operators on the same benchmark.

Delegates ranking and overall-winner determination to calibrax.analysis.compare_configurations(). Domain-specific features (improvement_factors, statistical_significance, weighted scoring) are retained here because calibrax lacks equivalents.

Parameters:

Name Type Description Default
results_dict dict[str, BenchmarkResult]

Dictionary mapping operator names to benchmark results.

required

Returns:

Type Description
ComparisonReport

Comparison report with rankings and improvement factors.

test_statistical_significance_multi_run

test_statistical_significance_multi_run(multi_run_results: dict[str, list[BenchmarkResult]]) -> dict[str, dict[str, dict[str, Any]]]

Test statistical significance with multiple runs per operator.

Delegates to calibrax.statistics.welch_t_test and mann_whitney_u for proper parametric and non-parametric testing.

Parameters:

Name Type Description Default
multi_run_results dict[str, list[BenchmarkResult]]

Operator names mapped to lists of results.

required

Returns:

Type Description
dict[str, dict[str, dict[str, Any]]]

Pairwise significance results with p-values and statistics.

create_operator_recommendations

create_operator_recommendations(problem_type: str, domain: str = 'general') -> RecommendationReport

Create operator recommendations for specific problem types.

Parameters:

Name Type Description Default
problem_type str

Type of problem (e.g., "pde_solving", "time_series").

required
domain str

Scientific domain.

'general'

Returns:

Type Description
RecommendationReport

Operator recommendation report.

analyze_scaling_behavior

analyze_scaling_behavior(performance_data: dict[int, BenchmarkResult]) -> ScalingAnalysis

Analyze scaling behavior across different problem sizes.

Parameters:

Name Type Description Default
performance_data dict[int, BenchmarkResult]

Dictionary mapping problem sizes to benchmark results.

required

Returns:

Type Description
ScalingAnalysis

Scaling behavior analysis.

generate_performance_insights

generate_performance_insights(result: BenchmarkResult) -> InsightReport

Generate performance insights for a benchmark run.

Parameters:

Name Type Description Default
result BenchmarkResult

Benchmark result to analyze.

required

Returns:

Type Description
InsightReport

Performance insights report.

Results Manager

Results Manager for Opifex Advanced Benchmarking System.

Data persistence and publication-ready export capabilities. Provides results storage, publication plot generation, comparison tables, and benchmark database management. Each saved result is also persisted to a calibrax Store for cross-tool interoperability.

ResultsManager

ResultsManager(storage_path: str = './benchmark_results', database_path: str | None = None)

Data persistence and publication-ready export capabilities.

Provides: - Persistent storage of benchmark results with metadata - calibrax Store write-through for cross-tool interoperability - Publication-ready plot and table generation - Benchmark database maintenance and querying - Export formats for different publication venues

Parameters:

Name Type Description Default
storage_path str

Base path for storing benchmark results.

'./benchmark_results'
database_path str | None

Path to benchmark database file.

None

save_benchmark_results

save_benchmark_results(result: BenchmarkResult, extra_metadata: dict[str, Any] | None = None) -> str

Save benchmark results with metadata.

Parameters:

Name Type Description Default
result BenchmarkResult

Benchmark result to save.

required
extra_metadata dict[str, Any] | None

Additional metadata to store alongside.

None

Returns:

Type Description
str

Unique identifier for saved results.

load_result

load_result(result_id: str) -> BenchmarkResult | None

Load benchmark result by ID.

Parameters:

Name Type Description Default
result_id str

Unique identifier for results.

required

Returns:

Type Description
BenchmarkResult | None

Loaded BenchmarkResult or None if not found.

load_results

load_results(result_id: str) -> BenchmarkResult | None

Load benchmark results by ID.

Alias for :meth:load_result for backward compatibility.

query_results

query_results(name: str | None = None, dataset: str | None = None, metric_filter: dict[str, tuple[float, float]] | None = None) -> list[dict[str, Any]]

Query benchmark database with filters.

Parameters:

Name Type Description Default
name str | None

Filter by benchmark name.

None
dataset str | None

Filter by dataset tag.

None
metric_filter dict[str, tuple[float, float]] | None

Filter by metric ranges {metric: (min, max)}.

None

Returns:

Type Description
list[dict[str, Any]]

List of matching database entries.

get_database_statistics

get_database_statistics() -> dict[str, Any]

Get statistics about the benchmark database.

Returns:

Type Description
dict[str, Any]

Database statistics summary.

create_benchmark_database_entry

create_benchmark_database_entry(result: BenchmarkResult) -> dict[str, Any]

Create standardized database entry for benchmark results.

Parameters:

Name Type Description Default
result BenchmarkResult

Benchmark result.

required

Returns:

Type Description
dict[str, Any]

Standardized database entry dictionary.

export_database

export_database(export_path: str, output_format: str = 'json') -> None

Export entire benchmark database.

Parameters:

Name Type Description Default
export_path str

Path to export file.

required
output_format str

Export format ("json").

'json'

export_publication_plots

export_publication_plots(results: list[BenchmarkResult], plot_type: Literal['comparison', 'scaling', 'convergence'] = 'comparison', output_format: str = 'png') -> list[Path]

Export publication-ready plots.

Parameters:

Name Type Description Default
results list[BenchmarkResult]

List of benchmark results to plot.

required
plot_type Literal['comparison', 'scaling', 'convergence']

Type of plot to generate.

'comparison'
output_format str

Output format (png, pdf, svg).

'png'

Returns:

Type Description
list[Path]

List of paths to generated plot files.

generate_comparison_tables

generate_comparison_tables(operators: list[str], metrics: list[str], output_format: Literal['latex', 'html', 'csv'] = 'latex') -> Path

Generate publication-ready comparison tables.

Queries the local benchmark database and generates a formatted comparison table in the requested output format.

Parameters:

Name Type Description Default
operators list[str]

List of operator names to include.

required
metrics list[str]

List of metrics to include in table.

required
output_format Literal['latex', 'html', 'csv']

Output format.

'latex'

Returns:

Type Description
Path

Path to generated table file.

Baseline Repository

Baseline Repository Module.

Stores and retrieves baseline performance metrics for PDEBench datasets. Delegates persistence to calibrax.storage.Store while retaining domain-specific comparison and reporting logic.

BaselineRepository

BaselineRepository(baseline_data_path: str | None = None, store_path: Path | str | None = None)

Repository for storing and retrieving baseline performance metrics.

Manages a database of baseline performance metrics for standard PDEBench datasets, enabling comparison of new models against established benchmarks. New baselines are persisted via a calibrax.storage.Store.

Parameters:

Name Type Description Default
baseline_data_path str | None

Path to baseline data file (JSON format).

None
store_path Path | str | None

Directory for calibrax Store persistence.

None

save_baselines

save_baselines() -> None

Save baseline data to file.

get_baseline_metrics

get_baseline_metrics(dataset_name: str, model_type: str) -> dict[str, float]

Get baseline metrics for a specific dataset and model type.

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset

required
model_type str

Type of model (e.g., "fno", "deeponet")

required

Returns:

Type Description
dict[str, float]

Dictionary of baseline metrics

Raises:

Type Description
ValueError

If dataset or model type not found

get_available_datasets

get_available_datasets() -> list[str]

Get list of datasets with baseline data.

get_available_model_types

get_available_model_types(dataset_name: str) -> list[str]

Get list of model types with baselines for a dataset.

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset

required

Returns:

Type Description
list[str]

List of available model types

add_baseline

add_baseline(dataset_name: str, model_type: str, metrics: dict[str, float], source: str = 'User Added', model_config: dict[str, Any] | None = None, notes: str | None = None) -> None

Add a new baseline to the repository.

Persists both to the JSON file and to the calibrax Store.

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset.

required
model_type str

Type of model.

required
metrics dict[str, float]

Performance metrics.

required
source str

Source of the baseline data.

'User Added'
model_config dict[str, Any] | None

Model configuration details.

None
notes str | None

Additional notes.

None

compare_to_baseline

compare_to_baseline(dataset_name: str, model_type: str, test_metrics: dict[str, float], metrics_to_compare: list[str] | None = None) -> dict[str, dict[str, float]]

Compare test metrics to baseline metrics.

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset

required
model_type str

Type of model

required
test_metrics dict[str, float]

Metrics to compare against baseline

required
metrics_to_compare list[str] | None

Specific metrics to compare (None for all)

None

Returns:

Type Description
dict[str, dict[str, float]]

Dictionary with comparison results including relative improvements

get_best_baseline

get_best_baseline(dataset_name: str, metric: str = 'mse') -> tuple[str, dict[str, float]]

Get the best baseline for a dataset based on a specific metric.

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset

required
metric str

Metric to use for comparison

'mse'

Returns:

Type Description
tuple[str, dict[str, float]]

Tuple of (model_type, metrics) for the best baseline

generate_baseline_summary

generate_baseline_summary() -> dict[str, Any]

Generate a full summary of all baselines.

Returns:

Type Description
dict[str, Any]

Dictionary with baseline summary statistics

Operator Executor

Operator Executor - Runs actual Opifex operators for benchmarking.

This module replaces the mock execution in BenchmarkRunner with real operator training and evaluation.

ExecutionConfig dataclass

ExecutionConfig(*, n_epochs: int = 100, batch_size: int = 32, learning_rate: float = 0.001, warmup_steps: int = 5, eval_frequency: int = 10, use_mixed_precision: bool = False, seed: int = 42)

Configuration for benchmark execution.

OperatorExecutor

OperatorExecutor(config: ExecutionConfig | None = None)

Executes actual Opifex operators for benchmarking.

This class provides the core execution logic that was missing from the original BenchmarkRunner implementation. It uses: - Real Opifex operators (TFNO, DeepONet, etc.) - Real Opifex data loaders (create_darcy_loader, etc.) - Flax NNX 0.11.0+ optimizer pattern - calibrax.metrics for evaluation (DRY)

Parameters:

Name Type Description Default
config ExecutionConfig | None

Execution configuration. Uses defaults if None.

None

execute_training_benchmark

execute_training_benchmark(operator_class: type, operator_config: dict[str, Any], train_loader: Any, test_loader: Any, benchmark_name: str) -> BenchmarkResult

Execute a training benchmark with actual operator.

Parameters:

Name Type Description Default
operator_class type

Opifex operator class to instantiate

required
operator_config dict[str, Any]

Configuration dict for operator

required
train_loader Any

Training data loader (from opifex.data.loaders)

required
test_loader Any

Test data loader

required
benchmark_name str

Name of benchmark for results

required

Returns:

Type Description
BenchmarkResult

BenchmarkResult with real metrics from training

Adapters

Adapter for converting BenchmarkResult lists to calibrax Run objects.

Bridges the opifex benchmarking pipeline (which produces BenchmarkResult lists) with calibrax's Run-based analysis and storage APIs.

results_to_run

results_to_run(results: list[BenchmarkResult], *, commit: str | None = None, branch: str | None = None, metric_defs: dict[str, MetricDef] | None = None) -> Run

Convert a list of BenchmarkResult objects to a calibrax Run.

Maps each BenchmarkResult to a Point: - BenchmarkResult.name -> Point.name - BenchmarkResult.tags["dataset"] -> Point.scenario (default: "unknown") - BenchmarkResult.tags -> Point.tags - BenchmarkResult.metrics -> Point.metrics (same Metric type)

Parameters:

Name Type Description Default
results list[BenchmarkResult]

List of benchmark results to convert.

required
commit str | None

Git commit hash to attach to the Run.

None
branch str | None

Git branch name to attach to the Run.

None
metric_defs dict[str, MetricDef] | None

Metric definitions for semantic interpretation.

None

Returns:

Type Description
Run

A calibrax Run containing one Point per BenchmarkResult.

default_metric_defs

default_metric_defs() -> dict[str, MetricDef]

Create standard metric definitions for scientific ML benchmarks.

Returns:

Type Description
dict[str, MetricDef]

Dictionary mapping metric names to MetricDef objects with proper

dict[str, MetricDef]

direction, units, and priority annotations.

Validators — Chemical Accuracy

Chemical accuracy validation for scientific ML benchmarks.

Assesses whether a benchmark result meets domain-specific accuracy thresholds by delegating to calibrax.validation.check_accuracy().

ChemicalAccuracyAssessment dataclass

ChemicalAccuracyAssessment(*, passed: bool, domain: str, threshold: float, achieved: float, margin: float, accuracy_result: AccuracyResult, recommendations: tuple[str, ...] = tuple())

Result of a chemical accuracy assessment.

Wraps a calibrax.validation.AccuracyResult with domain context and actionable recommendations.

Attributes:

Name Type Description
passed bool

Whether the result meets the chemical accuracy threshold.

domain str

Scientific domain used for assessment.

threshold float

Accuracy threshold applied.

achieved float

Achieved error value.

margin float

Headroom (positive) or deficit (negative) relative to threshold.

accuracy_result AccuracyResult

Underlying calibrax AccuracyResult.

recommendations tuple[str, ...]

Suggested actions if assessment fails.

ChemicalAccuracyValidator

ChemicalAccuracyValidator(thresholds: dict[str, float] | None = None, error_metric: str = 'relative_error')

Validates benchmark results against domain-specific chemical accuracy thresholds.

Delegates accuracy computation to calibrax.validation.check_accuracy().

Note: Registry registration intentionally omitted -- validators are instantiated directly, not discovered dynamically.

Parameters:

Name Type Description Default
thresholds dict[str, float] | None

Custom domain-to-threshold mapping. Merged with defaults.

None
error_metric str

Metric name to extract from BenchmarkResult.

'relative_error'

Parameters:

Name Type Description Default
thresholds dict[str, float] | None

Custom domain-to-threshold mapping. Merged with defaults.

None
error_metric str

Metric name to extract from BenchmarkResult.

'relative_error'

assess

assess(result: BenchmarkResult, domain: str | None = None) -> ChemicalAccuracyAssessment

Assess whether a benchmark result meets chemical accuracy for a domain.

Parameters:

Name Type Description Default
result BenchmarkResult

Benchmark result containing error metrics.

required
domain str | None

Scientific domain. Auto-detected from result tags/domain if None.

None

Returns:

Type Description
ChemicalAccuracyAssessment

Assessment with pass/fail, margin, and recommendations.

Raises:

Type Description
ValueError

If domain is unknown and cannot be auto-detected.

KeyError

If the error metric is not present in the result.

Validators — Conservation Laws

Conservation law validation for scientific ML benchmarks.

Orchestrates conservation law checks from opifex.core.physics.conservation and optionally delegates convergence analysis to calibrax.

ConservationReport dataclass

ConservationReport(*, violations: dict[str, float], all_conserved: bool, worst_violation: float, convergence: ConvergenceResult | None = None)

Report from conservation law validation.

Uses a local dataclass instead of calibrax.validation.ValidationReport because conservation checking requires violation magnitudes (dict[str, float]) rather than textual violation descriptions (tuple[str, ...]), plus domain-specific fields (worst_violation, all_conserved) that ValidationReport does not provide. :meth:to_validation_report bridges the two when calibrax interop is needed.

Attributes:

Name Type Description
violations dict[str, float]

Conservation law name to violation magnitude.

all_conserved bool

True if all violations are zero (within tolerance).

worst_violation float

Maximum violation across all checked laws.

convergence ConvergenceResult | None

Optional convergence result from multi-resolution analysis.

to_validation_report

to_validation_report() -> ValidationReport

Convert to a calibrax ValidationReport for cross-tool interop.

Returns:

Type Description
ValidationReport

A ValidationReport with violation magnitudes as accuracy_metrics

ValidationReport

and textual summaries in the violations tuple.

ConservationValidator

ConservationValidator(laws: Sequence[str] | None = None, energy_tolerance: float = 1e-06, momentum_tolerance: float = 1e-05, mass_target: float = 1.0, mass_tolerance: float = 0.0001)

Validates physics conservation laws on model predictions.

Orchestrates existing pure-JAX functions from opifex.core.physics.conservation and provides a unified interface.

Parameters:

Name Type Description Default
laws Sequence[str] | None

Conservation laws to check. Defaults to energy and momentum.

None
energy_tolerance float

Tolerance for energy conservation check.

1e-06
momentum_tolerance float

Tolerance for momentum conservation check.

1e-05
mass_target float

Target mass for mass conservation check.

1.0
mass_tolerance float

Tolerance for mass conservation check.

0.0001

Parameters:

Name Type Description Default
laws Sequence[str] | None

Conservation laws to check. Defaults to energy and momentum.

None
energy_tolerance float

Tolerance for energy conservation check.

1e-06
momentum_tolerance float

Tolerance for momentum conservation check.

1e-05
mass_target float

Target mass for mass conservation check.

1.0
mass_tolerance float

Tolerance for mass conservation check.

0.0001

validate

validate(y_pred: Array, y_true: Array) -> ConservationReport

Validate conservation laws on a single prediction set.

Parameters:

Name Type Description Default
y_pred Array

Model predictions.

required
y_true Array

Ground truth values.

required

Returns:

Type Description
ConservationReport

ConservationReport with violations and overall status.

validate_convergence

validate_convergence(predictions: Sequence[Array], truths: Sequence[Array], tolerances: Sequence[float]) -> ConvergenceResult

Validate conservation convergence across multiple resolutions.

Computes violations at each resolution and delegates convergence analysis to calibrax.validation.check_convergence().

Parameters:

Name Type Description Default
predictions Sequence[Array]

Predictions at increasing resolutions.

required
truths Sequence[Array]

Ground truths at increasing resolutions.

required
tolerances Sequence[float]

Tolerance thresholds for convergence check.

required

Returns:

Type Description
ConvergenceResult

ConvergenceResult with rates and achievement flags.

Shared Utilities

Shared constants and utilities for the benchmarking module.

Centralises domain inference, metric classification, and chemical accuracy thresholds to eliminate duplication across sub-modules.

LOWER_IS_BETTER module-attribute

LOWER_IS_BETTER: frozenset[str] = frozenset({'mse', 'mae', 'rmse', 'relative_error', 'mape', 'execution_time'})

Metrics where a lower value indicates better performance.

ACCURACY_METRIC_KEYS module-attribute

ACCURACY_METRIC_KEYS: tuple[str, ...] = ('mse', 'mae', 'rmse', 'r2_score', 'relative_error')

Standard accuracy metric keys used across reporting and analysis.

CHEMICAL_ACCURACY_THRESHOLDS module-attribute

CHEMICAL_ACCURACY_THRESHOLDS: dict[str, float] = {'quantum_computing': 0.001, 'materials_science': 0.05, 'molecular_dynamics': 0.01}

Domain-specific accuracy thresholds for chemical/physical accuracy checks.

infer_domain

infer_domain(dataset_name: str) -> str

Infer scientific domain from dataset name.

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset.

required

Returns:

Type Description
str

Inferred domain string, or "general" if no match.

extract_metric_value

extract_metric_value(result: BenchmarkResult, metric_name: str, default: float = float('inf')) -> float

Extract a scalar metric value from a BenchmarkResult.

Parameters:

Name Type Description Default
result BenchmarkResult

Benchmark result to extract from.

required
metric_name str

Name of the metric.

required
default float

Value to return if metric is absent.

float('inf')

Returns:

Type Description
float

The metric value as a float.

Report Generation

Report generation for PDEBench evaluation and benchmarking results.

This module provides full report generation capabilities for PDEBench evaluation results, including statistical analysis, baseline comparisons, and publication-ready formatted outputs.

PDEBenchReportGenerator

PDEBenchReportGenerator(report_format: str = 'json')

Generator for full PDEBench evaluation reports.

Creates detailed reports from evaluation results including statistical analysis, baseline comparisons, and multiple output formats for both programmatic access and human readability.

Parameters:

Name Type Description Default
report_format str

Default output format ("json" or "text")

'json'

generate_evaluation_report

generate_evaluation_report(evaluation_results: dict[str, Any], baseline_comparisons: dict[str, Any] | None = None, dataset_info: dict[str, str] | None = None, model_info: dict[str, str] | None = None) -> dict[str, Any]

Generate full evaluation report.

Parameters:

Name Type Description Default
evaluation_results dict[str, Any]

Results from benchmarking evaluation

required
baseline_comparisons dict[str, Any] | None

Optional baseline comparison data

None
dataset_info dict[str, str] | None

Optional dataset metadata

None
model_info dict[str, str] | None

Optional model metadata

None

Returns:

Type Description
dict[str, Any]

Complete evaluation report dictionary

format_report_as_text

format_report_as_text(report: dict[str, Any]) -> str

Format report as human-readable text.

save_report

save_report(report: dict[str, Any], filepath: str, format_type: str | None = None) -> None

Save report to file.

Parameters:

Name Type Description Default
report dict[str, Any]

Report data to save

required
filepath str

Output file path

required
format_type str | None

Output format ("json" or "text"), defaults to self.report_format

None

generate_summary_statistics

generate_summary_statistics(reports: list[dict[str, Any]]) -> dict[str, Any]

Generate summary statistics across multiple reports.

Parameters:

Name Type Description Default
reports list[dict[str, Any]]

List of evaluation reports to analyze

required

Returns:

Type Description
dict[str, Any]

Summary statistics across all reports

generate_comprehensive_report

generate_comprehensive_report(results: list[BenchmarkResult], include_baseline_comparison: bool = True, include_statistical_analysis: bool = True) -> dict[str, Any]

Generate full report from benchmark results.

Parameters:

Name Type Description Default
results list[BenchmarkResult]

List of BenchmarkResult objects

required
include_baseline_comparison bool

Whether to include baseline comparisons

True
include_statistical_analysis bool

Whether to include statistical analysis

True

Returns:

Type Description
dict[str, Any]

Full report dictionary

Visualization Tools

Visualization Tools Module

This module provides visualization utilities for PDEBench benchmarking results. It focuses on generating figure metadata and configuration rather than actual plotting to integrate optimally with the core scientific framework.

Key Features: - Figure metadata generation for comparison charts - Configuration for publication-ready visualizations - Support for multiple chart types and metrics - Integration with benchmarking infrastructure

Following Critical Technical Guidelines: - JAX-native data processing - Type hints and full documentation - No external plotting dependencies (metadata only)

PDEBenchVisualizer

PDEBenchVisualizer()

Visualization utilities for PDEBench benchmark results.

This class generates figure metadata and configurations for creating charts and plots of benchmark results. It avoids direct plotting to maintain lightweight dependencies.

create_comparison_chart

create_comparison_chart(results: list[BenchmarkResult], metric: str, title: str = 'Model Comparison', sort_by_performance: bool = True) -> dict[str, Any]

Create metadata for a model comparison chart.

Parameters:

Name Type Description Default
results list[BenchmarkResult]

List of benchmark results to compare

required
metric str

Metric to use for comparison

required
title str

Chart title

'Model Comparison'
sort_by_performance bool

Whether to sort results by performance

True

Returns:

Type Description
dict[str, Any]

Dictionary with figure metadata and configuration

create_multi_metric_comparison

create_multi_metric_comparison(results: list[BenchmarkResult], metrics: list[str], title: str = 'Multi-Metric Comparison') -> dict[str, Any]

Create metadata for multi-metric comparison chart.

Parameters:

Name Type Description Default
results list[BenchmarkResult]

List of benchmark results

required
metrics list[str]

List of metrics to compare

required
title str

Chart title

'Multi-Metric Comparison'

Returns:

Type Description
dict[str, Any]

Dictionary with figure metadata

create_performance_trends(results: list[BenchmarkResult], group_by: str = 'dataset_name', metric: str = 'mse') -> dict[str, Any]

Create metadata for performance trends visualization.

Parameters:

Name Type Description Default
results list[BenchmarkResult]

List of benchmark results

required
group_by str

Field to group results by

'dataset_name'
metric str

Metric to track trends for

'mse'

Returns:

Type Description
dict[str, Any]

Dictionary with trend visualization metadata

create_baseline_comparison

create_baseline_comparison(results: list[BenchmarkResult], baseline_metrics: dict[str, dict[str, float]], metric: str = 'mse') -> dict[str, Any]

Create metadata for baseline comparison visualization.

Parameters:

Name Type Description Default
results list[BenchmarkResult]

Test results to compare

required
baseline_metrics dict[str, dict[str, float]]

Dictionary of baseline metrics by model type

required
metric str

Metric to use for comparison

'mse'

Returns:

Type Description
dict[str, Any]

Dictionary with baseline comparison metadata

create_error_distribution

create_error_distribution(results: list[BenchmarkResult], error_metric: str = 'mae') -> dict[str, Any]

Create metadata for error distribution visualization.

Parameters:

Name Type Description Default
results list[BenchmarkResult]

List of benchmark results

required
error_metric str

Error metric to analyze distribution for

'mae'

Returns:

Type Description
dict[str, Any]

Dictionary with error distribution metadata

create_model_ranking

create_model_ranking(results: list[BenchmarkResult], ranking_metrics: list[str], weights: dict[str, float] | None = None) -> dict[str, Any]

Create metadata for model ranking visualization.

Parameters:

Name Type Description Default
results list[BenchmarkResult]

List of benchmark results

required
ranking_metrics list[str]

Metrics to use for ranking

required
weights dict[str, float] | None

Optional weights for each metric

None

Returns:

Type Description
dict[str, Any]

Dictionary with model ranking metadata

get_visualization_summary

get_visualization_summary(results: list[BenchmarkResult]) -> dict[str, Any]

Generate a summary of available visualization options.

Parameters:

Name Type Description Default
results list[BenchmarkResult]

List of benchmark results

required

Returns:

Type Description
dict[str, Any]

Dictionary with visualization recommendations

PDE Bench Integration

PDEBench Integration Module

This module provides full integration with PDEBench datasets for standardized evaluation of neural operators. It includes dataset loading, preprocessing, and automated evaluation pipelines.

Key Features: - Support for major PDEBench datasets (Advection, Burgers, Darcy Flow, etc.) - Standardized data preprocessing for neural operator compatibility - Automated evaluation pipelines with statistical analysis - Integration with existing benchmarking infrastructure

Following Critical Technical Guidelines: - JAX-native data processing for GPU compatibility - FLAX NNX integration for neural operator evaluation - Test-driven development with full coverage - Type hints and documentation for all public APIs

PDEBenchLoader

PDEBenchLoader(data_root: str | None = None, cache_dir: str | None = None)

Loads and preprocesses PDEBench datasets for neural operator evaluation.

This class provides a unified interface for loading standard PDE benchmark datasets with automatic preprocessing for compatibility with different neural operator architectures (FNO, DeepONet, etc.).

Parameters:

Name Type Description Default
data_root str | None

Root directory for PDEBench datasets

None
cache_dir str | None

Directory for caching preprocessed datasets

None

list_available_datasets

list_available_datasets() -> list[str]

List all supported PDEBench datasets.

get_dataset_info

get_dataset_info(dataset_name: str) -> dict[str, Any]

Get detailed information about a specific dataset.

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset

required

Returns:

Type Description
dict[str, Any]

Dictionary containing dataset metadata and characteristics

load_dataset

load_dataset(dataset_name: str, subset_size: int | None = None, resolution: str = 'low', split: str = 'test', normalize: bool = True, format_for_model: str = 'auto') -> dict[str, Any]

Load and preprocess a PDEBench dataset.

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset to load

required
subset_size int | None

Number of samples to load (None for full dataset)

None
resolution str

Resolution setting ("low", "medium", "high")

'low'
split str

Dataset split ("train", "val", "test")

'test'
normalize bool

Whether to normalize the data

True
format_for_model str

Target model format ("fno", "deeponet", "auto")

'auto'

Returns:

Type Description
dict[str, Any]

Dictionary containing: - input_data: Input arrays - target_data: Target arrays - metadata: Dataset metadata

PDEBenchEvaluationPipeline

PDEBenchEvaluationPipeline(output_dir: str | None = None)

Automated evaluation pipeline for PDEBench datasets.

This class provides end-to-end evaluation workflows that integrate dataset loading, model evaluation, and result analysis.

Parameters:

Name Type Description Default
output_dir str | None

Directory for saving evaluation results

None

evaluate_model_on_datasets

evaluate_model_on_datasets(model: Any, model_name: str, datasets: list[str], subset_size: int = 10, resolution: str = 'low', **kwargs: Any) -> list[BenchmarkResult]

Evaluate a model on multiple PDEBench datasets.

Parameters:

Name Type Description Default
model Any

Neural operator model to evaluate

required
model_name str

Name identifier for the model

required
datasets list[str]

List of dataset names to evaluate on

required
subset_size int

Number of samples per dataset

10
resolution str

Resolution setting for datasets

'low'
**kwargs Any

Additional arguments for evaluation

{}

Returns:

Type Description
list[BenchmarkResult]

List of benchmark results for each dataset

run_comprehensive_evaluation

run_comprehensive_evaluation(models: list[tuple[str, Any]], datasets: list[str] | None = None, resolutions: list[str] | None = None, subset_size: int = 10) -> dict[str, list[BenchmarkResult]]

Run full evaluation across multiple models and datasets.

Parameters:

Name Type Description Default
models list[tuple[str, Any]]

List of (model_name, model) tuples

required
datasets list[str] | None

List of datasets to evaluate (None for all supported)

None
resolutions list[str] | None

List of resolutions to test (None for just "low")

None
subset_size int

Number of samples per dataset

10

Returns:

Type Description
dict[str, list[BenchmarkResult]]

Dictionary mapping model names to their evaluation results

CLI

Benchmarking CLI - Command-line interface for running Opifex benchmarks.

Usage

python -m opifex.benchmarking.cli -b PDEBench_2D_DarcyFlow -o TFNO python -m opifex.benchmarking.cli --list-benchmarks python -m opifex.benchmarking.cli --list-operators

parse_args

parse_args(args: Sequence[str] | None = None) -> Namespace

Parse command-line arguments.

Parameters:

Name Type Description Default
args Sequence[str] | None

Command-line arguments (defaults to sys.argv[1:])

None

Returns:

Type Description
Namespace

Parsed arguments namespace

run_cli

run_cli(args: Sequence[str] | None = None) -> int

Main CLI entry point.

Parameters:

Name Type Description Default
args Sequence[str] | None

Command-line arguments (defaults to sys.argv[1:])

None

Returns:

Type Description
int

Exit code (0 for success)

main

main() -> None

Main entry point for module execution.

Profiling

Profiling Harness

Full JAX Profiling Harness for Opifex.

Main interface for the full profiling system that coordinates hardware-aware profiling, roofline analysis, compilation profiling, and generates actionable optimization reports.

OptimizationReport

OptimizationReport()

Structured optimization report with actionable recommendations.

add_section

add_section(title: str, content: Any) -> None

Add a section to the report.

set_executive_summary

set_executive_summary(summary: dict[str, Any]) -> None

Set the executive summary.

add_priority_recommendation

add_priority_recommendation(recommendation: str, impact: str = 'medium', effort: str = 'medium') -> None

Add a priority recommendation.

render

render(output_format: str = 'text') -> str

Render the report in specified format.

OpifexProfilingHarness

OpifexProfilingHarness(enable_hardware_profiling: bool = True, enable_compilation_profiling: bool = True, enable_roofline_analysis: bool = True, trace_dir: str | None = None)

Full JAX profiling harness for Opifex applications.

profiling_session

profiling_session(enable_jax_profiler: bool = True)

Context manager for full profiling session.

profile_neural_operator

profile_neural_operator(operator: Module | Callable, inputs: list[Array], operation_name: str | None = None) -> tuple[dict[str, Any], OptimizationReport]

Profile a complete neural operator with full analysis.

profile_function

profile_function(func: Callable, inputs: list[Array], function_name: str | None = None) -> tuple[dict[str, Any], OptimizationReport]

Profile a JAX function with full analysis.

compare_operations

compare_operations(operations: list[tuple[str, Module | Callable, list[Array]]]) -> dict[str, Any]

Compare multiple operations and identify optimization opportunities.

get_session_summary

get_session_summary() -> dict[str, Any]

Get summary of all profiling sessions.

Event Coordinator

Event Coordinator for JAX Profiling Harness.

Coordinates timing and events across multiple profilers to ensure consistent measurements and prevent interference between profiling components.

ProfilingEvent dataclass

ProfilingEvent(*, timestamp: float, event_type: str, profiler_id: str, data: dict[str, Any] = dict(), duration_ms: float | None = None)

Represents a profiling event with timing information.

ProfilingTimeline

ProfilingTimeline()

Thread-safe timeline for profiling events.

start_timeline

start_timeline() -> None

Start the profiling timeline.

add_event

add_event(event_type: str, profiler_id: str, data: dict[str, Any] | None = None, duration_ms: float | None = None)

Add an event to the timeline.

get_events

get_events(profiler_id: str | None = None) -> list[ProfilingEvent]

Get events, optionally filtered by profiler ID.

get_timeline_duration

get_timeline_duration() -> float

Get total timeline duration in seconds.

EventCoordinator

EventCoordinator()

Coordinates profiling events and timing across multiple profilers.

register_profiler

register_profiler(profiler_id: str) -> None

Register a profiler with the coordinator.

unregister_profiler

unregister_profiler(profiler_id: str) -> None

Unregister a profiler from the coordinator.

profiling_session

profiling_session(enable_jax_profiler: bool = True, trace_dir: str | None = None)

Context manager for coordinated profiling session.

add_event

add_event(event_type: str, profiler_id: str, data: dict[str, Any] | None = None, duration_ms: float | None = None) -> None

Add an event to the coordinated timeline.

time_function

time_function(func: Callable[..., Any], *args: Any, profiler_id: str = 'unknown', operation_name: str = 'operation', **kwargs: Any) -> tuple[Any, float]

Time a function execution and record the event.

get_profiling_summary

get_profiling_summary() -> dict[str, Any]

Get a summary of the profiling session.

export_timeline

export_timeline(output_format: str = 'json') -> str

Export timeline in specified format.

create_shared_coordinator

create_shared_coordinator() -> EventCoordinator

Create a shared event coordinator instance.