🔍 Search Models


🥇 1	gemini-2.5-flash-lite-preview-09-2025	89.6	Anthropic	466/520	91.7	85.1	92.7	92


🥇 1	claude-sonnet-4.6	89.6	Anthropic	466/520	91.7	85.1	92.7	92
🥈 2	gemini-3-flash-preview	82.4	Google	421/511	83.6	85.8	77.8	76
🥉 3	qwen3.6-max-preview	80.1	Qwen	238/297	85.3	76.7	75	0
4	gpt-5.4-nano	75.2	Openai	391/520	75	75.1	72.5	82
5	gemma-4-26b-a4b-it	75.2	Google	391/520	74.4	82.3	68.8	66
6	gemma4:31b-cloud	74.4	Ollama	387/520	76.1	76.2	71.6	68
7	grok-4.1-fast	72.7	X-ai	378/520	75	78.5	65.1	60
8	nemotron-3-super:cloud	71.2	Ollama	370/520	72.8	68	67.9	84
9	gemini-2.5-flash-lite-preview-09-2025	67.3	Google	350/520	69.4	71.8	64.2	50
10	qwen3.6-35b-a3b	52.5	Qwen	273/520	63.3	46.4	52.3	36
11	gpt-5-nano	41.9	Openai	218/520	37.2	49.2	41.3	34
12	deepseek-v4-flash:cloud	24.8	Ollama	129/520	41.1	22.7	12.8	0
13	glm-5.1:cloud	19.2	Ollama	100/520	29.4	18.8	11	2
14	minimax-m2.7:cloud	10.8	Ollama	56/520	16.7	9.4	8.3	0
15	kimi-k2.5:cloud	4.8	Ollama	25/520	5.6	5	5.5	0
16	qwen3.5:cloud	4.6	Ollama	24/520	3.9	6.1	5.5	0
17	ministral-3:14b-cloud	4.4	Ollama	23/520	4.4	5	5.5	0
18	kimi-k2.6:cloud	3.1	Ollama	16/516	3.3	2.8	4.6	0

Visual Comparisons

Plot

🍎 About the MLX Benchmark V2

The MLX Benchmark V2 is a curated evaluation benchmark of 520 questions designed to measure LLM proficiency in Apple's MLX machine learning framework.

MLX is an array framework for ML on Apple Silicon that leverages unified memory, lazy evaluation, and function transforms — paradigms that differ significantly from PyTorch and JAX. This benchmark directly measures whether models can understand, write, and debug MLX code.

📊 Benchmark Structure

Dimension	Options
Categories (11)	`mlx_core` (188), `mlx_nn` (73), `mlx_lm` (61), `mlx_lm_lora` (55), `coding` (35), `mlx_embeddings` (21), `debugging` (21), `mlx_optimizers` (19), `mlx_vlm` (19), `mlx_embeddings_lora` (15), `conceptual` (13)
Question Types (6)	QA (432), Coding (33), Debug (21), MCQ (12), True/False (12), Fill-in-the-Blank (10)
Difficulty Levels (4)	Easy (180), Medium (181), Hard (109), Very Hard (50)

🔬 Why MLX Needs Its Own Benchmark

Existing LLM benchmarks (HumanEval, MBPP, MMLU) don't cover MLX-specific patterns:

Lazy Evaluation — Forgetting mx.eval() is the #1 MLX bug, unique to this framework
Unified Memory — No .to(device) calls, which trips up PyTorch-trained models
Function Transforms — mx.grad(mx.vmap(f)) differs from JAX's equivalents
MLX Ecosystem — mlx-lm, mlx-vlm, mlx-embeddings each have their own APIs

📏 Evaluation Methodology

Question Type	Evaluation Method
MCQ, True/False	Exact matching (letter/keyword extraction)
QA, Fill-in-Blank, Coding, Debug	LLM judge comparing against reference answers

Scoring: Each question → correct/incorrect. Aggregate accuracy computed overall and per breakdown.

🛠️ Running the Benchmark

pip install mlx-benchmark

# Benchmark a local Ollama model
mlx-bench --model llama3.2

# Benchmark with a cloud provider
mlx-bench --provider anthropic --model claude-sonnet-4-20250514

# Filter by difficulty or type
mlx-bench --model llama3.2 --difficulties hard very-hard --types coding debug

📚 Links

Dataset: Goekdeniz-Guelmez/MLX-Benchmark-V2
CLI Tool: pip install mlx-benchmark
GitHub: Goekdeniz-Guelmez/MLX-Benchmark
Created by: Gökdeniz Gülmez

📨 Submit Your Results

Submit benchmark results for any model to be included on this leaderboard.

How to Run

pip install mlx-benchmark

# Run benchmark and save results
mlx-bench --model <your-model> --provider <provider>

This produces a JSON file like bench_<provider>_<model>_<timestamp>.json.

Output Format

The benchmark CLI automatically generates results in this format:

{
  "model": "openrouter/openai/gpt-5-nano",
  "judge": "openrouter/google/gemini-3-flash-preview",
  "timestamp": "20260418_170350",
  "stats": {
    "total": 520,
    "correct": 218,
    "accuracy": 41.92,
    "by_type": {
      "qa": { "total": 432, "correct": 192 },
      "mcq": { "total": 12, "correct": 6 },
      "fill_blank": { "total": 10, "correct": 8 },
      "true_false": { "total": 12, "correct": 9 },
      "coding": { "total": 33, "correct": 0 },
      "debug": { "total": 21, "correct": 3 }
    },
    "by_difficulty": {
      "easy": { "total": 180, "correct": 67 },
      "medium": { "total": 181, "correct": 89 },
      "hard": { "total": 109, "correct": 45 },
      "very-hard": { "total": 50, "correct": 17 }
    },
    "by_category": {
      "mlx_core": { "total": 188, "correct": 88 },
      "mlx_nn": { "total": 73, "correct": 32 },
      "mlx_optimizers": { "total": 19, "correct": 5 },
      "mlx_lm_lora": { "total": 55, "correct": 22 },
      "mlx_lm": { "total": 61, "correct": 30 },
      "mlx_embeddings_lora": { "total": 15, "correct": 7 },
      "mlx_vlm": { "total": 19, "correct": 9 },
      "mlx_embeddings": { "total": 21, "correct": 12 },
      "coding": { "total": 35, "correct": 2 },
      "debugging": { "total": 21, "correct": 3 },
      "conceptual": { "total": 13, "correct": 8 }
    }
  },
  "results": []
}

How to Submit

Run mlx-bench on your model
Open a Discussion on this Space
Attach or paste your output JSON file
Results will be reviewed and added to the leaderboard

Guidelines

Results must be from MLX Benchmark V2 (520 questions)
Use the standard mlx-bench CLI — do not manually edit results
Specify the exact model name/version

For questions, contact Gökdeniz Gülmez.

🍎 MLX Benchmark V2 Leaderboard