> ## Documentation Index > Fetch the complete documentation index at: https://mintlify.com/skydiscover-ai/skydiscover/llms.txt > Use this file to discover all available pages before exploring further. # Benchmarks > 200+ optimization tasks spanning math, systems, algorithms, and reasoning ## Overview SkyDiscover includes comprehensive benchmarks across multiple domains. Use these to: * **Test search algorithms** on real problems * **Learn evaluator patterns** from working examples * **Benchmark LLM performance** on hard optimization tasks * **Reproduce published results** from research papers 14 tasks 5 tasks (ADRS) 4 tasks (Triton) 172 tasks (Frontier-CS) ARC-AGI tasks Image generation ## Quick Start ### Installation ```bash theme={null} # Base installation uv sync # Add domain-specific dependencies uv sync --extra math # Math benchmarks uv sync --extra adrs # Systems benchmarks uv sync --extra external # OpenEvolve/GEPA backends uv sync --extra frontier-cs # Competitive programming uv sync --extra prompt-optimization # Prompt evolution ``` ### Running a Benchmark ```bash theme={null} export OPENAI_API_KEY="sk-..." # Run circle packing benchmark uv run skydiscover-run \ benchmarks/math/circle_packing/initial_program.py \ benchmarks/math/circle_packing/evaluator.py \ -c benchmarks/math/circle_packing/config.yaml \ -s adaevolve \ -i 100 ``` ## Benchmark Catalog ### Math Benchmarks **Path:** `benchmarks/math/circle_packing/` **Problem:** Pack 26 circles in a unit square to maximize the sum of radii. **Target:** 2.635 (AlphaEvolve result) **Run:** ```bash theme={null} uv run skydiscover-run \ benchmarks/math/circle_packing/initial_program.py \ benchmarks/math/circle_packing/evaluator.py \ -c benchmarks/math/circle_packing/config.yaml \ -s adaevolve -i 100 ``` **Evaluator excerpt:** ```python theme={null} def evaluate(program_path): centers, radii, sum_radii = run_packing() valid = validate_packing(centers, radii) target_ratio = sum_radii / 2.635 if valid else 0.0 return {"combined_score": target_ratio, "sum_radii": sum_radii} ``` **Path:** `benchmarks/math/heilbronn_triangle/` **Problem:** Place N points in a unit square to maximize the minimum triangle area. **Run:** ```bash theme={null} uv run skydiscover-run \ benchmarks/math/heilbronn_triangle/initial_program.py \ benchmarks/math/heilbronn_triangle/evaluator.py \ -s adaevolve -i 100 ``` **Path:** `benchmarks/math/erdos_min_overlap/` **Problem:** Construct sets with minimal overlap satisfying Erdős constraints. **Paths:** * `benchmarks/math/first_autocorr_ineq/` * `benchmarks/math/second_autocorr_ineq/` * `benchmarks/math/third_autocorr_ineq/` **Problem:** Find binary sequences minimizing autocorrelation merit factor. * **Hexagon Packing:** `benchmarks/math/hexagon_packing/` * **Heilbronn Convex:** `benchmarks/math/heilbronn_convex/` * **Signal Processing:** `benchmarks/math/signal_processing/` * **Matrix Multiplication:** `benchmarks/math/matmul/` * **Min-Max Distance:** `benchmarks/math/minimizing_max_min_dist/` ### ADRS (Systems Benchmarks) **Path:** `benchmarks/ADRS/cloudcast/` **Problem:** Schedule cloud VMs to minimize cost while meeting performance targets. **Dependencies:** ```bash theme={null} uv sync --extra adrs ``` **Run:** ```bash theme={null} uv run skydiscover-run \ benchmarks/ADRS/cloudcast/initial_program.py \ benchmarks/ADRS/cloudcast/evaluator.py \ -s adaevolve -i 50 ``` **Path:** `benchmarks/ADRS/eplb/` **Problem:** Balance load across mixture-of-experts model to minimize latency. **Path:** `benchmarks/ADRS/prism/` **Problem:** Place ML models on heterogeneous devices for optimal throughput. **Path:** `benchmarks/ADRS/txn_scheduling/` **Problem:** Schedule database transactions to maximize concurrency. **Path:** `benchmarks/ADRS/llm_sql/` **Problem:** Optimize SQL queries for LLM-powered database systems. ### GPU Kernels **Paths:** * `benchmarks/gpu_mode/vecadd/` - Vector addition * `benchmarks/gpu_mode/grayscale/` - Image grayscale conversion * `benchmarks/gpu_mode/trimul/` - Matrix multiplication * `benchmarks/gpu_mode/mla_decode/` - Multi-head latent attention decode **Problem:** Optimize Triton GPU kernels for performance. **Requirements:** CUDA-capable GPU **Run:** ```bash theme={null} uv run skydiscover-run \ benchmarks/gpu_mode/vecadd/initial_program.py \ benchmarks/gpu_mode/vecadd/evaluator.py \ -s adaevolve -i 50 ``` ### Competitive Programming **Path:** `benchmarks/frontier-cs-eval/` **Problem:** Solve competitive programming problems (ICPC, Codeforces, AtCoder). **Setup:** ```bash theme={null} uv sync --extra frontier-cs cd benchmarks/frontier-cs-eval python run_all_frontiercs.py --model gpt-5 --search adaevolve ``` **Features:** * Docker-based judge for secure execution * 172 problems from Frontier-CS benchmark * Automated testing and scoring **Path:** `benchmarks/ale_bench/` **Problem:** AtCoder Heuristic Contest problems (C++). **Examples:** * `ale_bench/ale-bench-lite-problems/ahc046/` * `ale_bench/ale-bench-lite-problems/ahc039/` * And 8 more... ### Reasoning **Path:** `benchmarks/arc_benchmark/` **Problem:** Abstract reasoning tasks (visual pattern completion). **Description:** Generate Python code to solve ARC-AGI visual reasoning puzzles. **Run:** ```bash theme={null} uv run skydiscover-run \ benchmarks/arc_benchmark/evaluator.py \ -c benchmarks/arc_benchmark/config.yaml \ -s adaevolve -i 100 ``` ### Creative Tasks **Path:** `benchmarks/image_gen/sky_festival/` **Problem:** Evolve DALL-E/Stable Diffusion prompts for a "sky festival" image. **Run:** ```bash theme={null} uv run skydiscover-run \ benchmarks/image_gen/sky_festival/initial_prompt.txt \ benchmarks/image_gen/sky_festival/evaluator.py \ -c benchmarks/image_gen/sky_festival/config_adaevolve.yaml \ -s adaevolve -i 50 ``` **Note:** Requires image generation API credentials. ### Prompt Optimization **Path:** `benchmarks/prompt_optimization/hotpot_qa/` **Problem:** Evolve natural-language prompts (not code) for question-answering. **Setup:** ```bash theme={null} uv sync --extra prompt-optimization ``` **Run:** ```bash theme={null} uv run skydiscover-run \ benchmarks/prompt_optimization/hotpot_qa/initial_prompt.txt \ benchmarks/prompt_optimization/hotpot_qa/evaluator.py \ -c benchmarks/prompt_optimization/hotpot_qa/config.yaml \ -s adaevolve -i 50 ``` **Config excerpt:** ```yaml theme={null} language: text diff_based_generation: false file_suffix: ".txt" ``` ## Benchmark Structure Every benchmark follows this pattern: ``` / ├── initial_program.py # Starting solution (contains EVOLVE-BLOCK) ├── evaluator.py # Scoring function (returns combined_score) ├── config.yaml # System prompt + search/evaluator settings ├── README.md # Problem description and setup └── requirements.txt # (optional) Additional dependencies ``` ### EVOLVE-BLOCK Markers Mark the region for SkyDiscover to evolve: ```python initial_program.py theme={null} # EVOLVE-BLOCK-START def solve(input_data): # LLM will improve this function return simple_solution(input_data) # EVOLVE-BLOCK-END # Code outside the block remains unchanged def helper_function(): pass ``` For prompt optimization tasks (`.txt` files), the entire file is evolved — no markers needed. ## Creating Your Own Benchmark ```python evaluator.py theme={null} def evaluate(program_path: str) -> dict: # Load and run the program result = run_program(program_path) # Compute score score = compute_score(result) return { "combined_score": score, # Required "custom_metric": 0.95, # Optional } ``` ```python initial_program.py theme={null} # EVOLVE-BLOCK-START def solve(input_data): return naive_solution(input_data) # EVOLVE-BLOCK-END ``` Or start from scratch by omitting this file. ```yaml config.yaml theme={null} max_iterations: 100 llm: models: - name: "gpt-5" weight: 1.0 search: type: "adaevolve" prompt: system_message: | You are an expert in [domain]. Improve the given function to maximize [objective]. evaluator: timeout: 360 ``` ```bash theme={null} uv run skydiscover-run \ initial_program.py \ evaluator.py \ -c config.yaml \ -s adaevolve \ -i 10 ``` See [Writing Evaluators](/guides/writing-evaluators) for detailed guidance. ## Benchmark Best Practices Keep `combined_score` in \[0, 1] range: ```python theme={null} BEST_KNOWN = 2.635 score = min(sum_radii / BEST_KNOWN, 1.0) ``` Prevent slow programs from blocking discovery: ```yaml theme={null} evaluator: timeout: 60 # Kill after 60 seconds ``` Log multiple metrics for analysis: ```python theme={null} return { "combined_score": 0.87, "accuracy": 0.92, "speed": 1.3, "memory": 512, "validity": 1.0, } ``` A reasonable starting point helps algorithms converge faster: ```python theme={null} # Don't start with a no-op def solve(x): return x # Too simple # Do provide a working baseline def solve(x): return simple_heuristic(x) # Good starting point ``` ## Reproducing Published Results ### AlphaEvolve (Circle Packing) ```bash theme={null} uv run skydiscover-run \ benchmarks/math/circle_packing/initial_program.py \ benchmarks/math/circle_packing/evaluator.py \ -c benchmarks/math/circle_packing/config.yaml \ -s adaevolve \ -i 200 \ -m gpt-5 ``` **Expected:** `combined_score ≥ 0.95` (≥ 2.50 / 2.635) ### Frontier-CS Benchmark ```bash theme={null} cd benchmarks/frontier-cs-eval python run_all_frontiercs.py \ --model gpt-5 \ --search adaevolve \ --iterations 100 ``` **Expected:** Solve 60-80% of problems depending on difficulty tier. ## Performance Comparison Here are typical results across search algorithms (averaged over 10 math benchmarks): | Algorithm | Mean Score | Best Score | Runtime (min) | | ------------ | ---------- | ---------- | ------------- | | topk | 0.65 | 0.78 | 15 | | beam\_search | 0.71 | 0.83 | 22 | | adaevolve | 0.82 | 0.91 | 35 | | evox | 0.79 | 0.89 | 40 | | gepa | 0.84 | 0.93 | 38 | | openevolve | 0.86 | 0.95 | 45 | Results vary by problem, model, and random seed. Run your own experiments! ## Benchmark Categories Summary | Category | # Tasks | Avg Runtime | Dependencies | | ----------- | -------- | ------------- | ----------------------------- | | Math | 14 | 20-40 min | `--extra math` | | ADRS | 5 | 30-60 min | `--extra adrs` | | GPU | 4 | 10-30 min | CUDA GPU | | Frontier-CS | 172 | 5-20 min each | `--extra frontier-cs` | | ARC-AGI | Multiple | 40-80 min | Base install | | ALE-Bench | 10 | 30-60 min | C++ compiler | | Image Gen | 1 | 40-60 min | Image API | | Prompts | 1 | 20-40 min | `--extra prompt-optimization` | ## Next Steps Learn from benchmark evaluators Understand benchmark configs Run your first benchmark Browse all benchmarks on GitHub