Video Coding Benchmarks

How Anthropic’s Fable 5 Beat ChatGPT 5.5 by 20% in Coding Benchmarks

Anthropic has launched Claude Fable 5, a Mythos-class AI model that outperforms GPT 5.5 in coding and vision tasks despite ...

16d

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and ...

VentureBeat

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

As large language models (LLMs) continue to improve at coding, the benchmarks used to evaluate their performance are steadily becoming less useful. That's because though many LLMs have similar high ...

Tech Times

MiniMax M3 Open-Weight Coding Model: Frontier Claims, Unverified Benchmarks

MiniMax M3 launched June 1, 2026 with a 1-million-token context window and company-reported SWE-Bench Pro scores that edge ...

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

Geeky Gadgets

DeepSWE AI Coding Model Benchmark Finally Solves AI Training Data Contamination

DeepSWE, created by DataCurve offers a benchmark for assessing AI coding models by focusing on real-world programming challenges rather than synthetic test cases. According to Matthew Berman, one of ...

Hosted on MSN

What AI coding benchmarks still miss about software quality

Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests? This is a useful question, but it is too narrow. Software development is iterative.

GIGAZINE

DeepSWE is a benchmark that prevents cheating using coding AI and allows for more accurate measurement of programming performance.

In recent years, it has become common for developers to use coding AI in software development, and various benchmarks exist to measure the performance of coding AI. Now, a new benchmark called ...

InfoQ

Dynamic Languages Faster and Cheaper in 13-Language Claude Code Benchmark

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources. Birgitta Böckeler, Distinguished Engineer at ...

Results that may be inaccessible to you are currently showing.

Hide inaccessible results