Skip to main content

Overview

SkyDiscover includes comprehensive benchmarks across multiple domains. Use these to:
  • Test search algorithms on real problems
  • Learn evaluator patterns from working examples
  • Benchmark LLM performance on hard optimization tasks
  • Reproduce published results from research papers

Math

14 tasks

Systems

5 tasks (ADRS)

GPU Kernels

4 tasks (Triton)

Algorithms

172 tasks (Frontier-CS)

Reasoning

ARC-AGI tasks

Creative

Image generation

Quick Start

Installation

# Base installation
uv sync

# Add domain-specific dependencies
uv sync --extra math                # Math benchmarks
uv sync --extra adrs                # Systems benchmarks
uv sync --extra external            # OpenEvolve/GEPA backends
uv sync --extra frontier-cs         # Competitive programming
uv sync --extra prompt-optimization # Prompt evolution

Running a Benchmark

export OPENAI_API_KEY="sk-..."

# Run circle packing benchmark
uv run skydiscover-run \
  benchmarks/math/circle_packing/initial_program.py \
  benchmarks/math/circle_packing/evaluator.py \
  -c benchmarks/math/circle_packing/config.yaml \
  -s adaevolve \
  -i 100

Benchmark Catalog

Math Benchmarks

Path: benchmarks/math/circle_packing/Problem: Pack 26 circles in a unit square to maximize the sum of radii.Target: 2.635 (AlphaEvolve result)Run:
uv run skydiscover-run \
  benchmarks/math/circle_packing/initial_program.py \
  benchmarks/math/circle_packing/evaluator.py \
  -c benchmarks/math/circle_packing/config.yaml \
  -s adaevolve -i 100
Evaluator excerpt:
def evaluate(program_path):
    centers, radii, sum_radii = run_packing()
    valid = validate_packing(centers, radii)
    target_ratio = sum_radii / 2.635 if valid else 0.0
    return {"combined_score": target_ratio, "sum_radii": sum_radii}
Path: benchmarks/math/heilbronn_triangle/Problem: Place N points in a unit square to maximize the minimum triangle area.Run:
uv run skydiscover-run \
  benchmarks/math/heilbronn_triangle/initial_program.py \
  benchmarks/math/heilbronn_triangle/evaluator.py \
  -s adaevolve -i 100
Path: benchmarks/math/erdos_min_overlap/Problem: Construct sets with minimal overlap satisfying Erdős constraints.
Paths:
  • benchmarks/math/first_autocorr_ineq/
  • benchmarks/math/second_autocorr_ineq/
  • benchmarks/math/third_autocorr_ineq/
Problem: Find binary sequences minimizing autocorrelation merit factor.
  • Hexagon Packing: benchmarks/math/hexagon_packing/
  • Heilbronn Convex: benchmarks/math/heilbronn_convex/
  • Signal Processing: benchmarks/math/signal_processing/
  • Matrix Multiplication: benchmarks/math/matmul/
  • Min-Max Distance: benchmarks/math/minimizing_max_min_dist/

ADRS (Systems Benchmarks)

Path: benchmarks/ADRS/cloudcast/Problem: Schedule cloud VMs to minimize cost while meeting performance targets.Dependencies:
uv sync --extra adrs
Run:
uv run skydiscover-run \
  benchmarks/ADRS/cloudcast/initial_program.py \
  benchmarks/ADRS/cloudcast/evaluator.py \
  -s adaevolve -i 50
Path: benchmarks/ADRS/eplb/Problem: Balance load across mixture-of-experts model to minimize latency.
Path: benchmarks/ADRS/prism/Problem: Place ML models on heterogeneous devices for optimal throughput.
Path: benchmarks/ADRS/txn_scheduling/Problem: Schedule database transactions to maximize concurrency.
Path: benchmarks/ADRS/llm_sql/Problem: Optimize SQL queries for LLM-powered database systems.

GPU Kernels

Paths:
  • benchmarks/gpu_mode/vecadd/ - Vector addition
  • benchmarks/gpu_mode/grayscale/ - Image grayscale conversion
  • benchmarks/gpu_mode/trimul/ - Matrix multiplication
  • benchmarks/gpu_mode/mla_decode/ - Multi-head latent attention decode
Problem: Optimize Triton GPU kernels for performance.Requirements: CUDA-capable GPURun:
uv run skydiscover-run \
  benchmarks/gpu_mode/vecadd/initial_program.py \
  benchmarks/gpu_mode/vecadd/evaluator.py \
  -s adaevolve -i 50

Competitive Programming

Path: benchmarks/frontier-cs-eval/Problem: Solve competitive programming problems (ICPC, Codeforces, AtCoder).Setup:
uv sync --extra frontier-cs
cd benchmarks/frontier-cs-eval
python run_all_frontiercs.py --model gpt-5 --search adaevolve
Features:
  • Docker-based judge for secure execution
  • 172 problems from Frontier-CS benchmark
  • Automated testing and scoring
Path: benchmarks/ale_bench/Problem: AtCoder Heuristic Contest problems (C++).Examples:
  • ale_bench/ale-bench-lite-problems/ahc046/
  • ale_bench/ale-bench-lite-problems/ahc039/
  • And 8 more…

Reasoning

Path: benchmarks/arc_benchmark/Problem: Abstract reasoning tasks (visual pattern completion).Description: Generate Python code to solve ARC-AGI visual reasoning puzzles.Run:
uv run skydiscover-run \
  benchmarks/arc_benchmark/evaluator.py \
  -c benchmarks/arc_benchmark/config.yaml \
  -s adaevolve -i 100

Creative Tasks

Path: benchmarks/image_gen/sky_festival/Problem: Evolve DALL-E/Stable Diffusion prompts for a “sky festival” image.Run:
uv run skydiscover-run \
  benchmarks/image_gen/sky_festival/initial_prompt.txt \
  benchmarks/image_gen/sky_festival/evaluator.py \
  -c benchmarks/image_gen/sky_festival/config_adaevolve.yaml \
  -s adaevolve -i 50
Note: Requires image generation API credentials.

Prompt Optimization

Path: benchmarks/prompt_optimization/hotpot_qa/Problem: Evolve natural-language prompts (not code) for question-answering.Setup:
uv sync --extra prompt-optimization
Run:
uv run skydiscover-run \
  benchmarks/prompt_optimization/hotpot_qa/initial_prompt.txt \
  benchmarks/prompt_optimization/hotpot_qa/evaluator.py \
  -c benchmarks/prompt_optimization/hotpot_qa/config.yaml \
  -s adaevolve -i 50
Config excerpt:
language: text
diff_based_generation: false
file_suffix: ".txt"

Benchmark Structure

Every benchmark follows this pattern:
<benchmark_name>/
├── initial_program.py      # Starting solution (contains EVOLVE-BLOCK)
├── evaluator.py           # Scoring function (returns combined_score)
├── config.yaml            # System prompt + search/evaluator settings
├── README.md              # Problem description and setup
└── requirements.txt       # (optional) Additional dependencies

EVOLVE-BLOCK Markers

Mark the region for SkyDiscover to evolve:
initial_program.py
# EVOLVE-BLOCK-START
def solve(input_data):
    # LLM will improve this function
    return simple_solution(input_data)
# EVOLVE-BLOCK-END

# Code outside the block remains unchanged
def helper_function():
    pass
For prompt optimization tasks (.txt files), the entire file is evolved — no markers needed.

Creating Your Own Benchmark

1

Write an Evaluator

evaluator.py
def evaluate(program_path: str) -> dict:
    # Load and run the program
    result = run_program(program_path)
    
    # Compute score
    score = compute_score(result)
    
    return {
        "combined_score": score,  # Required
        "custom_metric": 0.95,    # Optional
    }
2

(Optional) Create Initial Program

initial_program.py
# EVOLVE-BLOCK-START
def solve(input_data):
    return naive_solution(input_data)
# EVOLVE-BLOCK-END
Or start from scratch by omitting this file.
3

Write Config

config.yaml
max_iterations: 100

llm:
  models:
    - name: "gpt-5"
      weight: 1.0

search:
  type: "adaevolve"

prompt:
  system_message: |
    You are an expert in [domain].
    Improve the given function to maximize [objective].

evaluator:
  timeout: 360
4

Test Locally

uv run skydiscover-run \
  initial_program.py \
  evaluator.py \
  -c config.yaml \
  -s adaevolve \
  -i 10
See Writing Evaluators for detailed guidance.

Benchmark Best Practices

Keep combined_score in [0, 1] range:
BEST_KNOWN = 2.635
score = min(sum_radii / BEST_KNOWN, 1.0)
Prevent slow programs from blocking discovery:
evaluator:
  timeout: 60  # Kill after 60 seconds
Log multiple metrics for analysis:
return {
    "combined_score": 0.87,
    "accuracy": 0.92,
    "speed": 1.3,
    "memory": 512,
    "validity": 1.0,
}
A reasonable starting point helps algorithms converge faster:
# Don't start with a no-op
def solve(x):
    return x  # Too simple

# Do provide a working baseline
def solve(x):
    return simple_heuristic(x)  # Good starting point

Reproducing Published Results

AlphaEvolve (Circle Packing)

uv run skydiscover-run \
  benchmarks/math/circle_packing/initial_program.py \
  benchmarks/math/circle_packing/evaluator.py \
  -c benchmarks/math/circle_packing/config.yaml \
  -s adaevolve \
  -i 200 \
  -m gpt-5
Expected: combined_score ≥ 0.95 (≥ 2.50 / 2.635)

Frontier-CS Benchmark

cd benchmarks/frontier-cs-eval
python run_all_frontiercs.py \
  --model gpt-5 \
  --search adaevolve \
  --iterations 100
Expected: Solve 60-80% of problems depending on difficulty tier.

Performance Comparison

Here are typical results across search algorithms (averaged over 10 math benchmarks):
AlgorithmMean ScoreBest ScoreRuntime (min)
topk0.650.7815
beam_search0.710.8322
adaevolve0.820.9135
evox0.790.8940
gepa0.840.9338
openevolve0.860.9545
Results vary by problem, model, and random seed. Run your own experiments!

Benchmark Categories Summary

Category# TasksAvg RuntimeDependencies
Math1420-40 min--extra math
ADRS530-60 min--extra adrs
GPU410-30 minCUDA GPU
Frontier-CS1725-20 min each--extra frontier-cs
ARC-AGIMultiple40-80 minBase install
ALE-Bench1030-60 minC++ compiler
Image Gen140-60 minImage API
Prompts120-40 min--extra prompt-optimization

Next Steps

Writing Evaluators

Learn from benchmark evaluators

Configuration

Understand benchmark configs

Running Discovery

Run your first benchmark

GitHub Repository

Browse all benchmarks on GitHub