Benchmarking API Reference¶
Overview¶
Benchmarking tools for scientific machine learning methods. The module
delegates core types (BenchmarkResult, Metric, Run) and statistical
analysis to calibrax while providing
domain-specific evaluation, validation, and profiling on top.
Benchmark Registry¶
Benchmark Registry for Opifex Advanced Benchmarking System
Manages available benchmarks and neural operators with domain organization. Provides registration, discovery, and configuration management for the full benchmarking ecosystem.
DomainConfig
dataclass
¶
DomainConfig(*, name: str, tolerance_ranges: dict[str, tuple[float, float]] = dict(), required_metrics: list[str] = list(), reference_methods: list[str] = list(), default_problem_sizes: list[int] = list())
Configuration for a specific scientific domain.
BenchmarkConfig
dataclass
¶
BenchmarkConfig(*, name: str, domain: str, problem_type: str, input_shape: tuple[int, ...], output_shape: tuple[int, ...], dataset_path: str | None = None, reference_solution_path: str | None = None, physics_constraints: dict[str, Any] = dict(), computational_requirements: dict[str, Any] = dict())
Configuration for a specific benchmark.
BenchmarkRegistry
¶
BenchmarkRegistry(config_path: str | None = None)
Manages available benchmarks and neural operators with domain organization.
This registry provides centralized management of: - Neural operator architectures available for benchmarking - Benchmark problems organized by scientific domain - Domain-specific configurations and requirements - Compatibility checking between operators and benchmarks
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path
|
str | None
|
Path to registry configuration file |
None
|
register_operator
¶
register_benchmark
¶
register_benchmark(benchmark_config: BenchmarkConfig) -> None
Register a benchmark configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_config
|
BenchmarkConfig
|
Benchmark configuration to register |
required |
get_benchmark_suite
¶
get_benchmark_suite(domain: str) -> list[BenchmarkConfig]
Get all benchmarks for a specific domain.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
str
|
Scientific domain name |
required |
Returns:
| Type | Description |
|---|---|
list[BenchmarkConfig]
|
List of benchmark configurations for the domain |
list_compatible_operators
¶
get_domain_specific_config
¶
get_domain_specific_config(domain: str) -> DomainConfig
Get configuration for a specific domain.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
str
|
Domain name |
required |
Returns:
| Type | Description |
|---|---|
DomainConfig
|
Domain configuration |
Raises:
| Type | Description |
|---|---|
ValueError
|
If domain not found |
get_operator_class
¶
Get operator class by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
operator_name
|
str
|
Name of the operator |
required |
Returns:
| Type | Description |
|---|---|
type
|
Operator class |
Raises:
| Type | Description |
|---|---|
ValueError
|
If operator not found |
get_operator_metadata
¶
get_benchmark_config
¶
get_benchmark_config(benchmark_name: str) -> BenchmarkConfig
Get benchmark configuration by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_name
|
str
|
Name of the benchmark |
required |
Returns:
| Type | Description |
|---|---|
BenchmarkConfig
|
Benchmark configuration |
Raises:
| Type | Description |
|---|---|
ValueError
|
If benchmark not found |
list_available_benchmarks
¶
Get list of available benchmarks.
auto_discover_operators
¶
Auto-discover neural operators from opifex.neural.operators module.
Benchmark Runner¶
Benchmark Runner for Opifex Advanced Benchmarking System
Orchestrates complete benchmarking pipeline execution. Provides end-to-end benchmarking workflows, domain-specific suites, publication report generation, and database updates.
DomainResults
dataclass
¶
DomainResults(*, domain: str, benchmark_results: dict[str, dict[str, BenchmarkResult]], validation_reports: dict[str, dict[str, ValidationReport]] = dict(), comparison_reports: dict[str, ComparisonReport] = dict(), insight_reports: dict[str, dict[str, InsightReport]] = dict(), summary_statistics: dict[str, Any] = dict())
Results for a domain-specific benchmark suite.
PublicationReport
dataclass
¶
PublicationReport(*, title: str, abstract: str, methodology: str, results_summary: dict[str, Any], comparison_tables: list[Path] = list(), figures: list[Path] = list(), key_findings: list[str] = list(), recommendations: list[str] = list(), appendix_data: dict[str, Any] = dict())
Publication-ready benchmark report.
BenchmarkRunner
¶
BenchmarkRunner(registry: BenchmarkRegistry | None = None, evaluator: BenchmarkEvaluator | None = None, validator: ValidationFramework | None = None, analyzer: AnalysisEngine | None = None, results_manager: ResultsManager | None = None, output_dir: str = './benchmark_results')
Orchestrates complete benchmarking pipeline execution.
This runner provides end-to-end benchmarking capabilities including: - Full multi-operator benchmarking across domains - Domain-specific benchmark suite execution with validation - Publication-ready report and figure generation - Automated benchmark database updates and maintenance
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
registry
|
BenchmarkRegistry | None
|
Benchmark registry (creates default if None) |
None
|
evaluator
|
BenchmarkEvaluator | None
|
Benchmark evaluator (creates default if None) |
None
|
validator
|
ValidationFramework | None
|
Validation framework (creates default if None) |
None
|
analyzer
|
AnalysisEngine | None
|
Analysis engine (creates default if None) |
None
|
results_manager
|
ResultsManager | None
|
Results manager (creates default if None) |
None
|
output_dir
|
str
|
Output directory for results |
'./benchmark_results'
|
run_comprehensive_benchmark
¶
run_comprehensive_benchmark(operators: list[str] | None = None, benchmarks: list[str] | None = None, validate_results: bool = True, generate_analysis: bool = True) -> dict[str, dict[str, BenchmarkResult]]
Run full benchmark across multiple operators and problems.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
operators
|
list[str] | None
|
List of operator names (uses all available if None) |
None
|
benchmarks
|
list[str] | None
|
List of benchmark names (uses all available if None) |
None
|
validate_results
|
bool
|
Whether to run validation framework |
True
|
generate_analysis
|
bool
|
Whether to run analysis engine |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, BenchmarkResult]]
|
Nested dictionary: benchmark_name -> operator_name -> BenchmarkResult |
execute_domain_specific_suite
¶
execute_domain_specific_suite(domain: str) -> DomainResults
Execute benchmark suite for a specific scientific domain.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
str
|
Scientific domain name |
required |
Returns:
| Type | Description |
|---|---|
DomainResults
|
Full domain-specific results |
generate_publication_report
¶
generate_publication_report(results: dict[str, dict[str, BenchmarkResult]] | DomainResults, title: str | None = None) -> PublicationReport
Generate publication-ready report from benchmark results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
dict[str, dict[str, BenchmarkResult]] | DomainResults
|
Benchmark results (either full or domain-specific) |
required |
title
|
str | None
|
Report title (auto-generated if None) |
None
|
Returns:
| Type | Description |
|---|---|
PublicationReport
|
Publication-ready report with figures and tables |
Evaluation Engine¶
Core benchmarking evaluation engine for Opifex framework.
This module provides model evaluation capabilities using calibrax for metrics and statistical analysis. BenchmarkEvaluator orchestrates evaluation runs, profiling, and result management.
BenchmarkEvaluator
¶
BenchmarkEvaluator(output_dir: str = './benchmark_results', save_detailed_results: bool = True, enable_gpu_profiling: bool = False)
Main benchmark evaluator for Opifex models.
Provides full evaluation capabilities including model assessment, performance profiling, batch evaluation, and result management.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str
|
Directory for saving results. |
'./benchmark_results'
|
save_detailed_results
|
bool
|
Whether to save detailed results to files. |
True
|
enable_gpu_profiling
|
bool
|
Whether to enable GPU profiling. |
False
|
evaluate_model
¶
evaluate_model(model: Any, model_name: str, input_data: Array | tuple[Array, ...], target_data: Array, dataset_name: str, forward_fn: Callable | None = None, custom_metrics: dict[str, Callable] | None = None) -> BenchmarkResult
Evaluate a model on given data with extensive metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Any
|
Model to evaluate. |
required |
model_name
|
str
|
Name identifier for the model. |
required |
input_data
|
Array | tuple[Array, ...]
|
Input data for evaluation. |
required |
target_data
|
Array
|
Expected target outputs. |
required |
dataset_name
|
str
|
Name of the dataset being used. |
required |
forward_fn
|
Callable | None
|
Optional custom forward function. |
None
|
custom_metrics
|
dict[str, Callable] | None
|
Optional dictionary of custom metric functions. |
None
|
Returns:
| Type | Description |
|---|---|
BenchmarkResult
|
BenchmarkResult with evaluation metrics and metadata. |
batch_evaluate
¶
batch_evaluate(models: list[tuple[str, Any]], datasets: list[tuple[str, Any, Array, Callable | None]]) -> list[BenchmarkResult]
Evaluate multiple models on multiple datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
list[tuple[str, Any]]
|
List of (model_name, model) tuples. |
required |
datasets
|
list[tuple[str, Any, Array, Callable | None]]
|
List of (dataset_name, input_data, target_data, forward_fn) tuples. |
required |
Returns:
| Type | Description |
|---|---|
list[BenchmarkResult]
|
List of BenchmarkResults for all model-dataset combinations. |
profile_model_performance
¶
profile_model_performance(model: Any, input_data: Array | tuple[Array, ...], num_runs: int = 10, forward_fn: Callable | None = None) -> dict[str, float]
Profile model performance with multiple runs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Any
|
Model to profile. |
required |
input_data
|
Array | tuple[Array, ...]
|
Input data for profiling. |
required |
num_runs
|
int
|
Number of runs for statistics. |
10
|
forward_fn
|
Callable | None
|
Custom forward function. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary with performance statistics. |
Validation Framework¶
Validation Framework for Opifex Advanced Benchmarking System.
Scientific accuracy validation against reference computational methods. Provides convergence analysis, chemical accuracy assessment, and error analysis for rigorous scientific computing validation.
Generic dataclasses (ConvergenceAnalysis, AccuracyAssessment) are replaced by calibrax.validation equivalents (ConvergenceResult, AccuracyResult).
ValidationReport
dataclass
¶
ValidationReport(*, benchmark_name: str, reference_method: str, accuracy_metrics: dict[str, float], convergence_metrics: dict[str, float], chemical_accuracy_status: bool | None = None, tolerance_violations: list[str] = list(), validation_passed: bool = False, notes: str = '')
Report of validation results against reference methods.
ErrorAnalysis
dataclass
¶
ErrorAnalysis(*, global_errors: dict[str, float], local_errors: dict[str, Array], error_distribution: dict[str, Any], outlier_analysis: dict[str, Any], spatial_error_patterns: dict[str, Any] | None = None, temporal_error_patterns: dict[str, Any] | None = None)
Error analysis between predictions and ground truth.
Physics-specific: includes spatial and temporal pattern detection not available in calibrax generic validation.
ValidationFramework
¶
ValidationFramework(default_tolerances: list[float] | None = None, reference_methods: dict[str, Callable] | None = None)
Scientific accuracy validation against reference computational methods.
Provides: - Comparison against established computational methods (FEM, FDM, spectral) - Convergence rate analysis across multiple tolerance levels - Chemical accuracy assessment for quantum computing applications - Statistical error analysis with spatial and temporal pattern detection
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
default_tolerances
|
list[float] | None
|
Default tolerance levels for convergence testing. |
None
|
reference_methods
|
dict[str, Callable] | None
|
Dictionary of reference computational methods. |
None
|
validate_against_reference
¶
validate_against_reference(result: BenchmarkResult, reference_method: str, reference_data: Array | None = None, predictions: Array | None = None) -> ValidationReport
Validate benchmark results against reference computational method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
BenchmarkResult
|
Benchmark result to validate. |
required |
reference_method
|
str
|
Name of reference method. |
required |
reference_data
|
Array | None
|
Reference solution data (if available). |
None
|
predictions
|
Array | None
|
Raw model predictions (if available). Required for meaningful accuracy metrics when reference_data is provided. |
None
|
Returns:
| Type | Description |
|---|---|
ValidationReport
|
Validation report with accuracy metrics and tolerance violations. |
check_convergence_rates
¶
check_convergence_rates(results_sequence: list[BenchmarkResult], tolerances: list[float] | None = None) -> ConvergenceResult
Analyze convergence rates across multiple tolerance levels.
Delegates to calibrax.validation.check_convergence after extracting metric series from BenchmarkResult sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_sequence
|
list[BenchmarkResult]
|
Sequence of results at different tolerance levels. |
required |
tolerances
|
list[float] | None
|
Tolerance levels tested. |
None
|
Returns:
| Type | Description |
|---|---|
ConvergenceResult
|
ConvergenceResult from calibrax with rates and achievement flags. |
assess_chemical_accuracy
¶
assess_chemical_accuracy(result: BenchmarkResult, target_accuracy: float | None = None, accuracy_type: str = 'chemical_accuracy') -> AccuracyResult
Assess chemical accuracy for quantum computing applications.
Delegates to calibrax.validation.check_accuracy after extracting the appropriate metric from the BenchmarkResult.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
BenchmarkResult
|
Benchmark result to assess. |
required |
target_accuracy
|
float | None
|
Target accuracy threshold (defaults to domain standard). |
None
|
accuracy_type
|
str
|
Type of accuracy being assessed. |
'chemical_accuracy'
|
Returns:
| Type | Description |
|---|---|
AccuracyResult
|
AccuracyResult from calibrax with pass/fail and margin. |
generate_error_analysis
¶
generate_error_analysis(predictions: Array, ground_truth: Array, spatial_coords: Array | None = None, temporal_coords: Array | None = None) -> ErrorAnalysis
Generate error analysis for predictions vs ground truth.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Array
|
Model predictions. |
required |
ground_truth
|
Array
|
Ground truth data. |
required |
spatial_coords
|
Array | None
|
Spatial coordinates (if available). |
None
|
temporal_coords
|
Array | None
|
Temporal coordinates (if available). |
None
|
Returns:
| Type | Description |
|---|---|
ErrorAnalysis
|
ErrorAnalysis with global, local, distribution, and pattern data. |
Analysis Engine¶
Analysis Engine for Opifex Advanced Benchmarking System.
Comparative analysis and performance insights generation for scientific computing benchmarks. Operator comparison and statistical testing delegate to calibrax.analysis and calibrax.statistics. Domain-specific recommendation logic and scaling analysis are retained here.
ComparisonReport
dataclass
¶
ComparisonReport(*, benchmark_name: str, operators_compared: list[str], metric_comparisons: dict[str, dict[str, float]], performance_rankings: dict[str, list[str]], statistical_significance: dict[str, dict[str, bool]], winner_by_metric: dict[str, str], overall_winner: str, improvement_factors: dict[str, dict[str, float]] = dict())
Report comparing multiple operators on the same benchmark.
ScalingAnalysis
dataclass
¶
ScalingAnalysis(*, operator_name: str, problem_sizes: list[int], scaling_metrics: dict[str, dict[int, float]], scaling_coefficients: dict[str, float], complexity_estimates: dict[str, str], efficiency_scores: dict[int, float], optimal_problem_size: int | None = None)
Analysis of scaling behavior across problem sizes.
InsightReport
dataclass
¶
InsightReport(*, benchmark_name: str, operator_name: str, key_insights: list[str], performance_bottlenecks: list[str], optimization_suggestions: list[str], domain_specific_observations: list[str], confidence_level: float = 0.0)
Performance insights for a specific benchmark run.
RecommendationReport
dataclass
¶
RecommendationReport(*, problem_type: str, domain: str, recommended_operators: list[dict[str, Any]], use_case_specific_recommendations: dict[str, str], performance_trade_offs: dict[str, str], implementation_considerations: list[str])
Recommendations for optimal operator selection.
AnalysisEngine
¶
AnalysisEngine(significance_threshold: float = 0.05)
Comparative analysis and performance insights for scientific benchmarks.
Provides: - Multi-operator performance comparisons with statistical significance - Scaling behavior analysis across problem sizes - Performance insights and bottleneck identification - Intelligent operator recommendations for specific use cases
Statistical significance testing delegates to calibrax.statistics (welch_t_test, mann_whitney_u) for multi-run comparisons.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
significance_threshold
|
float
|
Threshold for statistical significance. |
0.05
|
compare_operators
¶
compare_operators(results_dict: dict[str, BenchmarkResult]) -> ComparisonReport
Compare multiple operators on the same benchmark.
Delegates ranking and overall-winner determination to
calibrax.analysis.compare_configurations(). Domain-specific
features (improvement_factors, statistical_significance, weighted
scoring) are retained here because calibrax lacks equivalents.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dict
|
dict[str, BenchmarkResult]
|
Dictionary mapping operator names to benchmark results. |
required |
Returns:
| Type | Description |
|---|---|
ComparisonReport
|
Comparison report with rankings and improvement factors. |
test_statistical_significance_multi_run
¶
test_statistical_significance_multi_run(multi_run_results: dict[str, list[BenchmarkResult]]) -> dict[str, dict[str, dict[str, Any]]]
Test statistical significance with multiple runs per operator.
Delegates to calibrax.statistics.welch_t_test and mann_whitney_u for proper parametric and non-parametric testing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
multi_run_results
|
dict[str, list[BenchmarkResult]]
|
Operator names mapped to lists of results. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, dict[str, Any]]]
|
Pairwise significance results with p-values and statistics. |
create_operator_recommendations
¶
create_operator_recommendations(problem_type: str, domain: str = 'general') -> RecommendationReport
Create operator recommendations for specific problem types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
problem_type
|
str
|
Type of problem (e.g., "pde_solving", "time_series"). |
required |
domain
|
str
|
Scientific domain. |
'general'
|
Returns:
| Type | Description |
|---|---|
RecommendationReport
|
Operator recommendation report. |
analyze_scaling_behavior
¶
analyze_scaling_behavior(performance_data: dict[int, BenchmarkResult]) -> ScalingAnalysis
Analyze scaling behavior across different problem sizes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
performance_data
|
dict[int, BenchmarkResult]
|
Dictionary mapping problem sizes to benchmark results. |
required |
Returns:
| Type | Description |
|---|---|
ScalingAnalysis
|
Scaling behavior analysis. |
generate_performance_insights
¶
generate_performance_insights(result: BenchmarkResult) -> InsightReport
Generate performance insights for a benchmark run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
BenchmarkResult
|
Benchmark result to analyze. |
required |
Returns:
| Type | Description |
|---|---|
InsightReport
|
Performance insights report. |
Results Manager¶
Results Manager for Opifex Advanced Benchmarking System.
Data persistence and publication-ready export capabilities. Provides results storage, publication plot generation, comparison tables, and benchmark database management. Each saved result is also persisted to a calibrax Store for cross-tool interoperability.
ResultsManager
¶
Data persistence and publication-ready export capabilities.
Provides: - Persistent storage of benchmark results with metadata - calibrax Store write-through for cross-tool interoperability - Publication-ready plot and table generation - Benchmark database maintenance and querying - Export formats for different publication venues
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
storage_path
|
str
|
Base path for storing benchmark results. |
'./benchmark_results'
|
database_path
|
str | None
|
Path to benchmark database file. |
None
|
save_benchmark_results
¶
load_result
¶
load_result(result_id: str) -> BenchmarkResult | None
Load benchmark result by ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result_id
|
str
|
Unique identifier for results. |
required |
Returns:
| Type | Description |
|---|---|
BenchmarkResult | None
|
Loaded BenchmarkResult or None if not found. |
load_results
¶
load_results(result_id: str) -> BenchmarkResult | None
Load benchmark results by ID.
Alias for :meth:load_result for backward compatibility.
query_results
¶
query_results(name: str | None = None, dataset: str | None = None, metric_filter: dict[str, tuple[float, float]] | None = None) -> list[dict[str, Any]]
Query benchmark database with filters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str | None
|
Filter by benchmark name. |
None
|
dataset
|
str | None
|
Filter by dataset tag. |
None
|
metric_filter
|
dict[str, tuple[float, float]] | None
|
Filter by metric ranges {metric: (min, max)}. |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of matching database entries. |
get_database_statistics
¶
create_benchmark_database_entry
¶
export_database
¶
export_publication_plots
¶
export_publication_plots(results: list[BenchmarkResult], plot_type: Literal['comparison', 'scaling', 'convergence'] = 'comparison', output_format: str = 'png') -> list[Path]
Export publication-ready plots.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[BenchmarkResult]
|
List of benchmark results to plot. |
required |
plot_type
|
Literal['comparison', 'scaling', 'convergence']
|
Type of plot to generate. |
'comparison'
|
output_format
|
str
|
Output format (png, pdf, svg). |
'png'
|
Returns:
| Type | Description |
|---|---|
list[Path]
|
List of paths to generated plot files. |
generate_comparison_tables
¶
generate_comparison_tables(operators: list[str], metrics: list[str], output_format: Literal['latex', 'html', 'csv'] = 'latex') -> Path
Generate publication-ready comparison tables.
Queries the local benchmark database and generates a formatted comparison table in the requested output format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
operators
|
list[str]
|
List of operator names to include. |
required |
metrics
|
list[str]
|
List of metrics to include in table. |
required |
output_format
|
Literal['latex', 'html', 'csv']
|
Output format. |
'latex'
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to generated table file. |
Baseline Repository¶
Baseline Repository Module.
Stores and retrieves baseline performance metrics for PDEBench datasets.
Delegates persistence to calibrax.storage.Store while retaining
domain-specific comparison and reporting logic.
BaselineRepository
¶
Repository for storing and retrieving baseline performance metrics.
Manages a database of baseline performance metrics for standard PDEBench
datasets, enabling comparison of new models against established benchmarks.
New baselines are persisted via a calibrax.storage.Store.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline_data_path
|
str | None
|
Path to baseline data file (JSON format). |
None
|
store_path
|
Path | str | None
|
Directory for calibrax Store persistence. |
None
|
get_baseline_metrics
¶
Get baseline metrics for a specific dataset and model type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Name of the dataset |
required |
model_type
|
str
|
Type of model (e.g., "fno", "deeponet") |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary of baseline metrics |
Raises:
| Type | Description |
|---|---|
ValueError
|
If dataset or model type not found |
get_available_datasets
¶
Get list of datasets with baseline data.
get_available_model_types
¶
add_baseline
¶
add_baseline(dataset_name: str, model_type: str, metrics: dict[str, float], source: str = 'User Added', model_config: dict[str, Any] | None = None, notes: str | None = None) -> None
Add a new baseline to the repository.
Persists both to the JSON file and to the calibrax Store.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Name of the dataset. |
required |
model_type
|
str
|
Type of model. |
required |
metrics
|
dict[str, float]
|
Performance metrics. |
required |
source
|
str
|
Source of the baseline data. |
'User Added'
|
model_config
|
dict[str, Any] | None
|
Model configuration details. |
None
|
notes
|
str | None
|
Additional notes. |
None
|
compare_to_baseline
¶
compare_to_baseline(dataset_name: str, model_type: str, test_metrics: dict[str, float], metrics_to_compare: list[str] | None = None) -> dict[str, dict[str, float]]
Compare test metrics to baseline metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Name of the dataset |
required |
model_type
|
str
|
Type of model |
required |
test_metrics
|
dict[str, float]
|
Metrics to compare against baseline |
required |
metrics_to_compare
|
list[str] | None
|
Specific metrics to compare (None for all) |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float]]
|
Dictionary with comparison results including relative improvements |
get_best_baseline
¶
Get the best baseline for a dataset based on a specific metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Name of the dataset |
required |
metric
|
str
|
Metric to use for comparison |
'mse'
|
Returns:
| Type | Description |
|---|---|
tuple[str, dict[str, float]]
|
Tuple of (model_type, metrics) for the best baseline |
Operator Executor¶
Operator Executor - Runs actual Opifex operators for benchmarking.
This module replaces the mock execution in BenchmarkRunner with real operator training and evaluation.
ExecutionConfig
dataclass
¶
ExecutionConfig(*, n_epochs: int = 100, batch_size: int = 32, learning_rate: float = 0.001, warmup_steps: int = 5, eval_frequency: int = 10, use_mixed_precision: bool = False, seed: int = 42)
Configuration for benchmark execution.
OperatorExecutor
¶
OperatorExecutor(config: ExecutionConfig | None = None)
Executes actual Opifex operators for benchmarking.
This class provides the core execution logic that was missing from the original BenchmarkRunner implementation. It uses: - Real Opifex operators (TFNO, DeepONet, etc.) - Real Opifex data loaders (create_darcy_loader, etc.) - Flax NNX 0.11.0+ optimizer pattern - calibrax.metrics for evaluation (DRY)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ExecutionConfig | None
|
Execution configuration. Uses defaults if None. |
None
|
execute_training_benchmark
¶
execute_training_benchmark(operator_class: type, operator_config: dict[str, Any], train_loader: Any, test_loader: Any, benchmark_name: str) -> BenchmarkResult
Execute a training benchmark with actual operator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
operator_class
|
type
|
Opifex operator class to instantiate |
required |
operator_config
|
dict[str, Any]
|
Configuration dict for operator |
required |
train_loader
|
Any
|
Training data loader (from opifex.data.loaders) |
required |
test_loader
|
Any
|
Test data loader |
required |
benchmark_name
|
str
|
Name of benchmark for results |
required |
Returns:
| Type | Description |
|---|---|
BenchmarkResult
|
BenchmarkResult with real metrics from training |
Adapters¶
Adapter for converting BenchmarkResult lists to calibrax Run objects.
Bridges the opifex benchmarking pipeline (which produces BenchmarkResult lists) with calibrax's Run-based analysis and storage APIs.
results_to_run
¶
results_to_run(results: list[BenchmarkResult], *, commit: str | None = None, branch: str | None = None, metric_defs: dict[str, MetricDef] | None = None) -> Run
Convert a list of BenchmarkResult objects to a calibrax Run.
Maps each BenchmarkResult to a Point:
- BenchmarkResult.name -> Point.name
- BenchmarkResult.tags["dataset"] -> Point.scenario (default: "unknown")
- BenchmarkResult.tags -> Point.tags
- BenchmarkResult.metrics -> Point.metrics (same Metric type)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[BenchmarkResult]
|
List of benchmark results to convert. |
required |
commit
|
str | None
|
Git commit hash to attach to the Run. |
None
|
branch
|
str | None
|
Git branch name to attach to the Run. |
None
|
metric_defs
|
dict[str, MetricDef] | None
|
Metric definitions for semantic interpretation. |
None
|
Returns:
| Type | Description |
|---|---|
Run
|
A calibrax Run containing one Point per BenchmarkResult. |
default_metric_defs
¶
Validators — Chemical Accuracy¶
Chemical accuracy validation for scientific ML benchmarks.
Assesses whether a benchmark result meets domain-specific accuracy thresholds
by delegating to calibrax.validation.check_accuracy().
ChemicalAccuracyAssessment
dataclass
¶
ChemicalAccuracyAssessment(*, passed: bool, domain: str, threshold: float, achieved: float, margin: float, accuracy_result: AccuracyResult, recommendations: tuple[str, ...] = tuple())
Result of a chemical accuracy assessment.
Wraps a calibrax.validation.AccuracyResult with domain context
and actionable recommendations.
Attributes:
| Name | Type | Description |
|---|---|---|
passed |
bool
|
Whether the result meets the chemical accuracy threshold. |
domain |
str
|
Scientific domain used for assessment. |
threshold |
float
|
Accuracy threshold applied. |
achieved |
float
|
Achieved error value. |
margin |
float
|
Headroom (positive) or deficit (negative) relative to threshold. |
accuracy_result |
AccuracyResult
|
Underlying calibrax AccuracyResult. |
recommendations |
tuple[str, ...]
|
Suggested actions if assessment fails. |
ChemicalAccuracyValidator
¶
ChemicalAccuracyValidator(thresholds: dict[str, float] | None = None, error_metric: str = 'relative_error')
Validates benchmark results against domain-specific chemical accuracy thresholds.
Delegates accuracy computation to calibrax.validation.check_accuracy().
Note: Registry registration intentionally omitted -- validators are instantiated directly, not discovered dynamically.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
thresholds
|
dict[str, float] | None
|
Custom domain-to-threshold mapping. Merged with defaults. |
None
|
error_metric
|
str
|
Metric name to extract from BenchmarkResult. |
'relative_error'
|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
thresholds
|
dict[str, float] | None
|
Custom domain-to-threshold mapping. Merged with defaults. |
None
|
error_metric
|
str
|
Metric name to extract from BenchmarkResult. |
'relative_error'
|
assess
¶
assess(result: BenchmarkResult, domain: str | None = None) -> ChemicalAccuracyAssessment
Assess whether a benchmark result meets chemical accuracy for a domain.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
BenchmarkResult
|
Benchmark result containing error metrics. |
required |
domain
|
str | None
|
Scientific domain. Auto-detected from result tags/domain if None. |
None
|
Returns:
| Type | Description |
|---|---|
ChemicalAccuracyAssessment
|
Assessment with pass/fail, margin, and recommendations. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If domain is unknown and cannot be auto-detected. |
KeyError
|
If the error metric is not present in the result. |
Validators — Conservation Laws¶
Conservation law validation for scientific ML benchmarks.
Orchestrates conservation law checks from opifex.core.physics.conservation
and optionally delegates convergence analysis to calibrax.
ConservationReport
dataclass
¶
ConservationReport(*, violations: dict[str, float], all_conserved: bool, worst_violation: float, convergence: ConvergenceResult | None = None)
Report from conservation law validation.
Uses a local dataclass instead of calibrax.validation.ValidationReport
because conservation checking requires violation magnitudes
(dict[str, float]) rather than textual violation descriptions
(tuple[str, ...]), plus domain-specific fields (worst_violation,
all_conserved) that ValidationReport does not provide.
:meth:to_validation_report bridges the two when calibrax interop is needed.
Attributes:
| Name | Type | Description |
|---|---|---|
violations |
dict[str, float]
|
Conservation law name to violation magnitude. |
all_conserved |
bool
|
True if all violations are zero (within tolerance). |
worst_violation |
float
|
Maximum violation across all checked laws. |
convergence |
ConvergenceResult | None
|
Optional convergence result from multi-resolution analysis. |
to_validation_report
¶
Convert to a calibrax ValidationReport for cross-tool interop.
Returns:
| Type | Description |
|---|---|
ValidationReport
|
A |
ValidationReport
|
and textual summaries in the violations tuple. |
ConservationValidator
¶
ConservationValidator(laws: Sequence[str] | None = None, energy_tolerance: float = 1e-06, momentum_tolerance: float = 1e-05, mass_target: float = 1.0, mass_tolerance: float = 0.0001)
Validates physics conservation laws on model predictions.
Orchestrates existing pure-JAX functions from
opifex.core.physics.conservation and provides a unified interface.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
laws
|
Sequence[str] | None
|
Conservation laws to check. Defaults to energy and momentum. |
None
|
energy_tolerance
|
float
|
Tolerance for energy conservation check. |
1e-06
|
momentum_tolerance
|
float
|
Tolerance for momentum conservation check. |
1e-05
|
mass_target
|
float
|
Target mass for mass conservation check. |
1.0
|
mass_tolerance
|
float
|
Tolerance for mass conservation check. |
0.0001
|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
laws
|
Sequence[str] | None
|
Conservation laws to check. Defaults to energy and momentum. |
None
|
energy_tolerance
|
float
|
Tolerance for energy conservation check. |
1e-06
|
momentum_tolerance
|
float
|
Tolerance for momentum conservation check. |
1e-05
|
mass_target
|
float
|
Target mass for mass conservation check. |
1.0
|
mass_tolerance
|
float
|
Tolerance for mass conservation check. |
0.0001
|
validate
¶
validate(y_pred: Array, y_true: Array) -> ConservationReport
Validate conservation laws on a single prediction set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_pred
|
Array
|
Model predictions. |
required |
y_true
|
Array
|
Ground truth values. |
required |
Returns:
| Type | Description |
|---|---|
ConservationReport
|
ConservationReport with violations and overall status. |
validate_convergence
¶
validate_convergence(predictions: Sequence[Array], truths: Sequence[Array], tolerances: Sequence[float]) -> ConvergenceResult
Validate conservation convergence across multiple resolutions.
Computes violations at each resolution and delegates convergence
analysis to calibrax.validation.check_convergence().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Sequence[Array]
|
Predictions at increasing resolutions. |
required |
truths
|
Sequence[Array]
|
Ground truths at increasing resolutions. |
required |
tolerances
|
Sequence[float]
|
Tolerance thresholds for convergence check. |
required |
Returns:
| Type | Description |
|---|---|
ConvergenceResult
|
ConvergenceResult with rates and achievement flags. |
Shared Utilities¶
Shared constants and utilities for the benchmarking module.
Centralises domain inference, metric classification, and chemical accuracy thresholds to eliminate duplication across sub-modules.
LOWER_IS_BETTER
module-attribute
¶
LOWER_IS_BETTER: frozenset[str] = frozenset({'mse', 'mae', 'rmse', 'relative_error', 'mape', 'execution_time'})
Metrics where a lower value indicates better performance.
ACCURACY_METRIC_KEYS
module-attribute
¶
Standard accuracy metric keys used across reporting and analysis.
CHEMICAL_ACCURACY_THRESHOLDS
module-attribute
¶
CHEMICAL_ACCURACY_THRESHOLDS: dict[str, float] = {'quantum_computing': 0.001, 'materials_science': 0.05, 'molecular_dynamics': 0.01}
Domain-specific accuracy thresholds for chemical/physical accuracy checks.
infer_domain
¶
extract_metric_value
¶
extract_metric_value(result: BenchmarkResult, metric_name: str, default: float = float('inf')) -> float
Extract a scalar metric value from a BenchmarkResult.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
BenchmarkResult
|
Benchmark result to extract from. |
required |
metric_name
|
str
|
Name of the metric. |
required |
default
|
float
|
Value to return if metric is absent. |
float('inf')
|
Returns:
| Type | Description |
|---|---|
float
|
The metric value as a float. |
Report Generation¶
Report generation for PDEBench evaluation and benchmarking results.
This module provides full report generation capabilities for PDEBench evaluation results, including statistical analysis, baseline comparisons, and publication-ready formatted outputs.
PDEBenchReportGenerator
¶
PDEBenchReportGenerator(report_format: str = 'json')
Generator for full PDEBench evaluation reports.
Creates detailed reports from evaluation results including statistical analysis, baseline comparisons, and multiple output formats for both programmatic access and human readability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
report_format
|
str
|
Default output format ("json" or "text") |
'json'
|
generate_evaluation_report
¶
generate_evaluation_report(evaluation_results: dict[str, Any], baseline_comparisons: dict[str, Any] | None = None, dataset_info: dict[str, str] | None = None, model_info: dict[str, str] | None = None) -> dict[str, Any]
Generate full evaluation report.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_results
|
dict[str, Any]
|
Results from benchmarking evaluation |
required |
baseline_comparisons
|
dict[str, Any] | None
|
Optional baseline comparison data |
None
|
dataset_info
|
dict[str, str] | None
|
Optional dataset metadata |
None
|
model_info
|
dict[str, str] | None
|
Optional model metadata |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Complete evaluation report dictionary |
format_report_as_text
¶
Format report as human-readable text.
save_report
¶
generate_summary_statistics
¶
generate_comprehensive_report
¶
generate_comprehensive_report(results: list[BenchmarkResult], include_baseline_comparison: bool = True, include_statistical_analysis: bool = True) -> dict[str, Any]
Generate full report from benchmark results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[BenchmarkResult]
|
List of BenchmarkResult objects |
required |
include_baseline_comparison
|
bool
|
Whether to include baseline comparisons |
True
|
include_statistical_analysis
|
bool
|
Whether to include statistical analysis |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Full report dictionary |
Visualization Tools¶
Visualization Tools Module
This module provides visualization utilities for PDEBench benchmarking results. It focuses on generating figure metadata and configuration rather than actual plotting to integrate optimally with the core scientific framework.
Key Features: - Figure metadata generation for comparison charts - Configuration for publication-ready visualizations - Support for multiple chart types and metrics - Integration with benchmarking infrastructure
Following Critical Technical Guidelines: - JAX-native data processing - Type hints and full documentation - No external plotting dependencies (metadata only)
PDEBenchVisualizer
¶
Visualization utilities for PDEBench benchmark results.
This class generates figure metadata and configurations for creating charts and plots of benchmark results. It avoids direct plotting to maintain lightweight dependencies.
create_comparison_chart
¶
create_comparison_chart(results: list[BenchmarkResult], metric: str, title: str = 'Model Comparison', sort_by_performance: bool = True) -> dict[str, Any]
Create metadata for a model comparison chart.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[BenchmarkResult]
|
List of benchmark results to compare |
required |
metric
|
str
|
Metric to use for comparison |
required |
title
|
str
|
Chart title |
'Model Comparison'
|
sort_by_performance
|
bool
|
Whether to sort results by performance |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with figure metadata and configuration |
create_multi_metric_comparison
¶
create_multi_metric_comparison(results: list[BenchmarkResult], metrics: list[str], title: str = 'Multi-Metric Comparison') -> dict[str, Any]
Create metadata for multi-metric comparison chart.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[BenchmarkResult]
|
List of benchmark results |
required |
metrics
|
list[str]
|
List of metrics to compare |
required |
title
|
str
|
Chart title |
'Multi-Metric Comparison'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with figure metadata |
create_performance_trends
¶
create_performance_trends(results: list[BenchmarkResult], group_by: str = 'dataset_name', metric: str = 'mse') -> dict[str, Any]
Create metadata for performance trends visualization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[BenchmarkResult]
|
List of benchmark results |
required |
group_by
|
str
|
Field to group results by |
'dataset_name'
|
metric
|
str
|
Metric to track trends for |
'mse'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with trend visualization metadata |
create_baseline_comparison
¶
create_baseline_comparison(results: list[BenchmarkResult], baseline_metrics: dict[str, dict[str, float]], metric: str = 'mse') -> dict[str, Any]
Create metadata for baseline comparison visualization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[BenchmarkResult]
|
Test results to compare |
required |
baseline_metrics
|
dict[str, dict[str, float]]
|
Dictionary of baseline metrics by model type |
required |
metric
|
str
|
Metric to use for comparison |
'mse'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with baseline comparison metadata |
create_error_distribution
¶
create_error_distribution(results: list[BenchmarkResult], error_metric: str = 'mae') -> dict[str, Any]
Create metadata for error distribution visualization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[BenchmarkResult]
|
List of benchmark results |
required |
error_metric
|
str
|
Error metric to analyze distribution for |
'mae'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with error distribution metadata |
create_model_ranking
¶
create_model_ranking(results: list[BenchmarkResult], ranking_metrics: list[str], weights: dict[str, float] | None = None) -> dict[str, Any]
Create metadata for model ranking visualization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[BenchmarkResult]
|
List of benchmark results |
required |
ranking_metrics
|
list[str]
|
Metrics to use for ranking |
required |
weights
|
dict[str, float] | None
|
Optional weights for each metric |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with model ranking metadata |
get_visualization_summary
¶
PDE Bench Integration¶
PDEBench Integration Module
This module provides full integration with PDEBench datasets for standardized evaluation of neural operators. It includes dataset loading, preprocessing, and automated evaluation pipelines.
Key Features: - Support for major PDEBench datasets (Advection, Burgers, Darcy Flow, etc.) - Standardized data preprocessing for neural operator compatibility - Automated evaluation pipelines with statistical analysis - Integration with existing benchmarking infrastructure
Following Critical Technical Guidelines: - JAX-native data processing for GPU compatibility - FLAX NNX integration for neural operator evaluation - Test-driven development with full coverage - Type hints and documentation for all public APIs
PDEBenchLoader
¶
Loads and preprocesses PDEBench datasets for neural operator evaluation.
This class provides a unified interface for loading standard PDE benchmark datasets with automatic preprocessing for compatibility with different neural operator architectures (FNO, DeepONet, etc.).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_root
|
str | None
|
Root directory for PDEBench datasets |
None
|
cache_dir
|
str | None
|
Directory for caching preprocessed datasets |
None
|
list_available_datasets
¶
List all supported PDEBench datasets.
get_dataset_info
¶
load_dataset
¶
load_dataset(dataset_name: str, subset_size: int | None = None, resolution: str = 'low', split: str = 'test', normalize: bool = True, format_for_model: str = 'auto') -> dict[str, Any]
Load and preprocess a PDEBench dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Name of the dataset to load |
required |
subset_size
|
int | None
|
Number of samples to load (None for full dataset) |
None
|
resolution
|
str
|
Resolution setting ("low", "medium", "high") |
'low'
|
split
|
str
|
Dataset split ("train", "val", "test") |
'test'
|
normalize
|
bool
|
Whether to normalize the data |
True
|
format_for_model
|
str
|
Target model format ("fno", "deeponet", "auto") |
'auto'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary containing: - input_data: Input arrays - target_data: Target arrays - metadata: Dataset metadata |
PDEBenchEvaluationPipeline
¶
PDEBenchEvaluationPipeline(output_dir: str | None = None)
Automated evaluation pipeline for PDEBench datasets.
This class provides end-to-end evaluation workflows that integrate dataset loading, model evaluation, and result analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str | None
|
Directory for saving evaluation results |
None
|
evaluate_model_on_datasets
¶
evaluate_model_on_datasets(model: Any, model_name: str, datasets: list[str], subset_size: int = 10, resolution: str = 'low', **kwargs: Any) -> list[BenchmarkResult]
Evaluate a model on multiple PDEBench datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Any
|
Neural operator model to evaluate |
required |
model_name
|
str
|
Name identifier for the model |
required |
datasets
|
list[str]
|
List of dataset names to evaluate on |
required |
subset_size
|
int
|
Number of samples per dataset |
10
|
resolution
|
str
|
Resolution setting for datasets |
'low'
|
**kwargs
|
Any
|
Additional arguments for evaluation |
{}
|
Returns:
| Type | Description |
|---|---|
list[BenchmarkResult]
|
List of benchmark results for each dataset |
run_comprehensive_evaluation
¶
run_comprehensive_evaluation(models: list[tuple[str, Any]], datasets: list[str] | None = None, resolutions: list[str] | None = None, subset_size: int = 10) -> dict[str, list[BenchmarkResult]]
Run full evaluation across multiple models and datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
list[tuple[str, Any]]
|
List of (model_name, model) tuples |
required |
datasets
|
list[str] | None
|
List of datasets to evaluate (None for all supported) |
None
|
resolutions
|
list[str] | None
|
List of resolutions to test (None for just "low") |
None
|
subset_size
|
int
|
Number of samples per dataset |
10
|
Returns:
| Type | Description |
|---|---|
dict[str, list[BenchmarkResult]]
|
Dictionary mapping model names to their evaluation results |
CLI¶
Profiling¶
Profiling Harness¶
Full JAX Profiling Harness for Opifex.
Main interface for the full profiling system that coordinates hardware-aware profiling, roofline analysis, compilation profiling, and generates actionable optimization reports.
OptimizationReport
¶
OpifexProfilingHarness
¶
OpifexProfilingHarness(enable_hardware_profiling: bool = True, enable_compilation_profiling: bool = True, enable_roofline_analysis: bool = True, trace_dir: str | None = None)
Full JAX profiling harness for Opifex applications.
profiling_session
¶
profiling_session(enable_jax_profiler: bool = True)
Context manager for full profiling session.
profile_neural_operator
¶
profile_neural_operator(operator: Module | Callable, inputs: list[Array], operation_name: str | None = None) -> tuple[dict[str, Any], OptimizationReport]
Profile a complete neural operator with full analysis.
profile_function
¶
profile_function(func: Callable, inputs: list[Array], function_name: str | None = None) -> tuple[dict[str, Any], OptimizationReport]
Profile a JAX function with full analysis.
compare_operations
¶
Compare multiple operations and identify optimization opportunities.
Event Coordinator¶
Event Coordinator for JAX Profiling Harness.
Coordinates timing and events across multiple profilers to ensure consistent measurements and prevent interference between profiling components.
ProfilingEvent
dataclass
¶
ProfilingEvent(*, timestamp: float, event_type: str, profiler_id: str, data: dict[str, Any] = dict(), duration_ms: float | None = None)
Represents a profiling event with timing information.
ProfilingTimeline
¶
Thread-safe timeline for profiling events.
EventCoordinator
¶
Coordinates profiling events and timing across multiple profilers.
register_profiler
¶
register_profiler(profiler_id: str) -> None
Register a profiler with the coordinator.
unregister_profiler
¶
unregister_profiler(profiler_id: str) -> None
Unregister a profiler from the coordinator.
profiling_session
¶
Context manager for coordinated profiling session.
add_event
¶
add_event(event_type: str, profiler_id: str, data: dict[str, Any] | None = None, duration_ms: float | None = None) -> None
Add an event to the coordinated timeline.
time_function
¶
time_function(func: Callable[..., Any], *args: Any, profiler_id: str = 'unknown', operation_name: str = 'operation', **kwargs: Any) -> tuple[Any, float]
Time a function execution and record the event.
get_profiling_summary
¶
Get a summary of the profiling session.
create_shared_coordinator
¶
create_shared_coordinator() -> EventCoordinator
Create a shared event coordinator instance.