Ai Benchmarks for Code

16d

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

Kimi K2.7-Code claims 30% fewer thinking tokens and a drop-in API swap path, but independent benchmarks show kernel ...

19d

AI Coding Agents Write 180% More Code But Ship Only 30% More Software

AI coding agents boost code output by 180% but shipping rises only 30%, MIT finds. Why private data access beats benchmark ...

Tech Times

Autonomous AI Coding Clears 60,000-Line Ceiling: MirrorCode Benchmark Released

AI coding benchmark MirrorCode published its full results June 26, showing Claude Opus 4.7 autonomously rebuilt a 60,000-line interpreter and scored 56% overall — completing tasks that take human ...

3don MSN

Top AI models might be confident—doesn’t mean they’re right

“Mostly right is the wrong bar,” Pearl CEO Andy Kurtzig says, as research tests top AI models against professional judgment.

SD Times

Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

Value stream management involves people in the organization to examine workflows and other processes to ensure they are deriving the maximum value from their efforts while eliminating waste — of ...

Xiaomi's HarnessX rewrites its own AI scaffolding mid-task — and smaller models gain the most

Xiaomi's HarnessX autonomously rewrites AI agent harnesses mid-execution, delivering +14.5% avg performance gains — and +44% ...

Hosted on MSN

What AI coding benchmarks still miss about software quality

Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests? This is a useful question, but it is too narrow. Software development is iterative.

1mon

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing single-model systems from Anthropic and OpenAI by using more than 100 specialized AI ...

How One AI Tool Is Writing 65% Of Anthropic's Own Code

Anthropic reports 65% of its product team's code is AI-generated by Claude, a statistic often misinterpreted as broad ...

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

Memeburn

China's AI Now Matches Anthropic Mythos in Cybersecurity

Two Chinese AI tools now match Anthropic's Mythos in cybersecurity vulnerability detection. Both are freely available, ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results