Popular AI model performance benchmark may be flawed, Meta researchers warn



A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.
“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.
The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.
OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.

Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.

The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.

“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.

  • Related Posts

    Hong Kong export credit insurer keeps premiums low despite Middle East tensions

    Hong Kong’s export credit insurer is keeping premiums low and expanding support for small and medium-sized enterprises (SMEs), even as geopolitical tensions in the Middle East raise concerns about risks…

    Continue reading
    Chinese firms should focus on investing in politically stable regions: entrepreneur

    Chinese firms should prioritise regional political stability when they make overseas investment decisions, as mounting geopolitical tensions and currency fluctuations increase the risks of doing business abroad, a prominent Chinese…

    Continue reading

    Leave a Reply

    Your email address will not be published. Required fields are marked *