LLM Benchmarks
for Developers
Compare AI coding assistants by language and task type. Real performance data: speed, cost, accuracy, and context window. Test Python, TypeScript, Go, Rust, and more. Free and open source.
Benchmark Results
| Model | Lang | Task | Context | Speed (tok/s) | Cost ($/1M) | Acc (%) | Qual (%) |
|---|---|---|---|---|---|---|---|
claude-opus-4-6 4.6 | Python | Code Gen | 200K | 0 | 0.00 | 0 | 0 |
claude-opus-4-6 4.6 | TypeScript | Debug | 200K | 0 | 0.00 | 0 | 0 |
claude-sonnet-4-5 4.5 | Go | Code Gen | 200K | 0 | 0.00 | 0 | 0 |
claude-sonnet-4-5 4.5 | Rust | Refactor | 200K | 0 | 0.00 | 0 | 0 |
claude-haiku-4-5 4.5 | Java | Test | 200K | 0 | 0.00 | 0 | 0 |
claude-haiku-4-5 4.5 | C# | Code Gen | 200K | 0 | 0.00 | 0 | 0 |
gpt-4o 2024-11 | Python | Code Gen | 128K | 0 | 0.00 | 0 | 0 |
gpt-4o 2024-11 | TypeScript | Debug | 128K | 0 | 0.00 | 0 | 0 |
gpt-4o 2024-11 | Ruby | Refactor | 128K | 0 | 0.00 | 0 | 0 |
gpt-4-turbo 2024-09 | Swift | Code Gen | 128K | 0 | 0.00 | 0 | 0 |
gpt-4-turbo 2024-09 | Kotlin | Test | 128K | 0 | 0.00 | 0 | 0 |
gpt-3.5-turbo 0125 | PHP | Code Gen | 16K | 0 | 0.00 | 0 | 0 |
llama-3.3-70b 3.3 | Python | Code Gen | 128K | 0 | 0.00 | 0 | 0 |
llama-3.3-70b 3.3 | TypeScript | Debug | 128K | 0 | 0.00 | 0 | 0 |
llama-3.1-405b 3.1 | Go | Code Gen | 128K | 0 | 0.00 | 0 | 0 |
llama-3.1-70b 3.1 | Rust | Refactor | 128K | 0 | 0.00 | 0 | 0 |
deepseek-v3 V3 | Python | Code Gen | 64K | 0 | 0.00 | 0 | 0 |
deepseek-v3 V3 | TypeScript | Debug | 64K | 0 | 0.00 | 0 | 0 |
deepseek-coder-v2 Coder-V2 | C++ | Code Gen | 64K | 0 | 0.00 | 0 | 0 |
deepseek-v2.5 V2.5 | Java | Test | 64K | 0 | 0.00 | 0 | 0 |
gemini-2.0-flash 2.0 | Python | Code Gen | 1M | 0 | 0.00 | 0 | 0 |
gemini-2.0-flash 2.0 | TypeScript | Debug | 1M | 0 | 0.00 | 0 | 0 |
gemini-1.5-pro 1.5 | Go | Code Gen | 2M | 0 | 0.00 | 0 | 0 |
gemini-1.5-flash 1.5 | Scala | Refactor | 1M | 0 | 0.00 | 0 | 0 |
mistral-large 2411 | Python | Code Gen | 128K | 0 | 0.00 | 0 | 0 |
mistral-large 2411 | TypeScript | Debug | 128K | 0 | 0.00 | 0 | 0 |
codestral 2501 | Rust | Code Gen | 32K | 0 | 0.00 | 0 | 0 |
qwen-2.5-coder 2.5 | Python | Code Gen | 32K | 0 | 0.00 | 0 | 0 |
qwen-2.5-coder 2.5 | C# | Debug | 32K | 0 | 0.00 | 0 | 0 |
qwen-2.5 2.5 | R | Code Gen | 32K | 0 | 0.00 | 0 | 0 |
kimi-k1.5 K1.5 | Python | Code Gen | 200K | 0 | 0.00 | 0 | 0 |
kimi-k1.5 K1.5 | TypeScript | Debug | 200K | 0 | 0.00 | 0 | 0 |
kimi-k1 K1 | Java | Code Gen | 128K | 0 | 0.00 | 0 | 0 |
CLI Tool
Why Zygur
Stop guessing which LLM to use. Get real data on which models perform best for your actual coding tasks.
Daily Benchmarks
We test Claude, GPT-4, and Gemini every day on real coding tasks. See which model performs best across frontend, backend, database, and DevOps tasks. Fresh data, daily.
Real Code, Real Data
No synthetic benchmarks. We test on actual coding tasks developers face every day. React components, API endpoints, database queries, Docker configs. The stuff you actually build.
Multiple Dimensions
Speed isn't everything. We measure speed, cost, accuracy, and code quality. Because a fast model that writes buggy code isn't actually fast. Get the full picture.
Open Source CLI
Don't trust our tests? Run your own. Our CLI tool is free and open source. Test with your own prompts, your own tasks, your own standards. Full transparency.
Built by Vibe Coders
We're not CS PhDs. We're developers who learned to code with AI. We test what matters to us, and developers like us. Practical benchmarks, not academic exercises.
No BS Methodology
Our testing methodology is public. See exactly how we score models, what prompts we use, and how we measure quality. No hidden criteria, no bias, just data.
CLI Tools for AI Developers
Stack-specific benchmarks (React ≠ Go ≠ Python). Free tool shows which models win for YOUR languages. Paid router uses them automatically.
llm-bench
Stack-Specific LLM Benchmarks
Free, open source CLI tool testing Claude, GPT-4, Gemini, DeepSeek, Llama, Qwen, and top models. By language. By framework. We test React separately from Go separately from Python. Because you don't code in 'general'.
npm install -g llm-benchllm-router
Stack-Aware Model Routing for Code
Routes React prompts to Claude (87 score), Go prompts to GPT-4 (90 score), Python to DeepSeek (88 score). Optimize for speed, cost, or quality with one parameter. Backed by stack-specific benchmark data. CODE ONLY - not chat, not images, just code.
Coming Q2 2026Why We'll Win
1. Stack-specific testing. Every other benchmark tests "coding" as one thing. We test React, Go, Python, Rust, and 20+ stacks separately. Because you don't code in "general" - you code in specific languages.
2. Fully open source. Most benchmark platforms are closed black boxes. Ours isn't. Verify methodology, run your own tests, contribute improvements. Trust through transparency.
Building in Public
Follow along as we ship daily LLM benchmarks and build CLI tools for AI developers. Built by a vibe coder, for vibe coders.