๐ MLX Benchmark V2 Leaderboard
Evaluating LLM proficiency on Apple's MLX machine learning framework
520 questions ยท 11 categories ยท 6 question types ยท 4 difficulty levels
Select which columns to display. Rank, Model, and Overall are always shown.
๐ Metadata
๐ Difficulty
๐ Categories
โ Question Types
๐ฅ 1 | gemini-2.5-flash-lite-preview-09-2025 | 89.6 | Anthropic | 466/520 | 91.7 | 85.1 | 92.7 | 92 |
Visual Comparisons
๐ About the MLX Benchmark V2
The MLX Benchmark V2 is a curated evaluation benchmark of 520 questions designed to measure LLM proficiency in Apple's MLX machine learning framework.
MLX is an array framework for ML on Apple Silicon that leverages unified memory, lazy evaluation, and function transforms โ paradigms that differ significantly from PyTorch and JAX. This benchmark directly measures whether models can understand, write, and debug MLX code.
๐ Benchmark Structure
| Dimension | Options |
|---|---|
| Categories (11) | mlx_core (188), mlx_nn (73), mlx_lm (61), mlx_lm_lora (55), coding (35), mlx_embeddings (21), debugging (21), mlx_optimizers (19), mlx_vlm (19), mlx_embeddings_lora (15), conceptual (13) |
| Question Types (6) | QA (432), Coding (33), Debug (21), MCQ (12), True/False (12), Fill-in-the-Blank (10) |
| Difficulty Levels (4) | Easy (180), Medium (181), Hard (109), Very Hard (50) |
๐ฌ Why MLX Needs Its Own Benchmark
Existing LLM benchmarks (HumanEval, MBPP, MMLU) don't cover MLX-specific patterns:
- Lazy Evaluation โ Forgetting
mx.eval()is the #1 MLX bug, unique to this framework - Unified Memory โ No
.to(device)calls, which trips up PyTorch-trained models - Function Transforms โ
mx.grad(mx.vmap(f))differs from JAX's equivalents - MLX Ecosystem โ
mlx-lm,mlx-vlm,mlx-embeddingseach have their own APIs
๐ Evaluation Methodology
| Question Type | Evaluation Method |
|---|---|
| MCQ, True/False | Exact matching (letter/keyword extraction) |
| QA, Fill-in-Blank, Coding, Debug | LLM judge comparing against reference answers |
Scoring: Each question โ correct/incorrect. Aggregate accuracy computed overall and per breakdown.
๐ ๏ธ Running the Benchmark
pip install mlx-benchmark
# Benchmark a local Ollama model
mlx-bench --model llama3.2
# Benchmark with a cloud provider
mlx-bench --provider anthropic --model claude-sonnet-4-20250514
# Filter by difficulty or type
mlx-bench --model llama3.2 --difficulties hard very-hard --types coding debug
๐ Links
- Dataset: Goekdeniz-Guelmez/MLX-Benchmark-V2
- CLI Tool:
pip install mlx-benchmark - GitHub: Goekdeniz-Guelmez/MLX-Benchmark
- Created by: Gรถkdeniz Gรผlmez
๐จ Submit Your Results
Submit benchmark results for any model to be included on this leaderboard.
How to Run
pip install mlx-benchmark
# Run benchmark and save results
mlx-bench --model <your-model> --provider <provider>
This produces a JSON file like bench_<provider>_<model>_<timestamp>.json.
Output Format
The benchmark CLI automatically generates results in this format:
{
"model": "openrouter/openai/gpt-5-nano",
"judge": "openrouter/google/gemini-3-flash-preview",
"timestamp": "20260418_170350",
"stats": {
"total": 520,
"correct": 218,
"accuracy": 41.92,
"by_type": {
"qa": { "total": 432, "correct": 192 },
"mcq": { "total": 12, "correct": 6 },
"fill_blank": { "total": 10, "correct": 8 },
"true_false": { "total": 12, "correct": 9 },
"coding": { "total": 33, "correct": 0 },
"debug": { "total": 21, "correct": 3 }
},
"by_difficulty": {
"easy": { "total": 180, "correct": 67 },
"medium": { "total": 181, "correct": 89 },
"hard": { "total": 109, "correct": 45 },
"very-hard": { "total": 50, "correct": 17 }
},
"by_category": {
"mlx_core": { "total": 188, "correct": 88 },
"mlx_nn": { "total": 73, "correct": 32 },
"mlx_optimizers": { "total": 19, "correct": 5 },
"mlx_lm_lora": { "total": 55, "correct": 22 },
"mlx_lm": { "total": 61, "correct": 30 },
"mlx_embeddings_lora": { "total": 15, "correct": 7 },
"mlx_vlm": { "total": 19, "correct": 9 },
"mlx_embeddings": { "total": 21, "correct": 12 },
"coding": { "total": 35, "correct": 2 },
"debugging": { "total": 21, "correct": 3 },
"conceptual": { "total": 13, "correct": 8 }
}
},
"results": []
}
How to Submit
- Run
mlx-benchon your model - Open a Discussion on this Space
- Attach or paste your output JSON file
- Results will be reviewed and added to the leaderboard
Guidelines
- Results must be from MLX Benchmark V2 (520 questions)
- Use the standard
mlx-benchCLI โ do not manually edit results - Specify the exact model name/version
For questions, contact Gรถkdeniz Gรผlmez.