Popular AI model performance benchmark may be flawed, Meta researchers warn



A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.
“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.
The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.
OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.

Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.

The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.

“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.

  • Related Posts

    Chinese firms should focus on investing in politically stable regions: entrepreneur

    Chinese firms should prioritise regional political stability when they make overseas investment decisions, as mounting geopolitical tensions and currency fluctuations increase the risks of doing business abroad, a prominent Chinese…

    Continue reading
    Opinion | Why Asia’s future depends on breaking the shackles of fossil fuels

    The escalating crisis around Iran is doing more than just shaking global energy markets. It is constricting the arteries of Asian growth. A massive share of the oil and liquefied…

    Continue reading

    Leave a Reply

    Your email address will not be published. Required fields are marked *