Xiaohongshu Releases SWE-Bench Mobile: When AI Agents Face Codebases of Apps with Hundreds of Millions of Users, the Highest Pass Rate is Only 12%?

SWE-Bench Mobile

The Xiaohongshu team has released a new benchmark, SWE-Bench Mobile, specifically designed to evaluate the performance of AI Agents on real-world mobile application codebases. The results are thought-provoking: even the top AI Agents have a maximum pass rate of only 12% when facing the codebase of an App with hundreds of millions of users.

Test Scenario

What is SWE-Bench Mobile?

Benchmark Introduction

SWE-Bench Mobile is a code repair benchmark for mobile application development. It includes real-world mobile application bug fix tasks, requiring AI Agents to:

Understand complex mobile application code structures
Locate the root cause of problems
Generate correct fix code
Ensure that fixes do not introduce new problems

Test Results

In the test, the performance of several mainstream AI Agents was as follows:

Best Performance: 12% pass rate
Average Level: 5-8% pass rate
Some Models: Close to 0% pass rate

This result is far lower than the performance on the traditional SWE-Bench.

Why is it so difficult?

Challenge Analysis

The special nature of mobile application codebases brings additional challenges:

Multi-Platform Adaptation: Need to consider both iOS and Android platforms simultaneously
Complex Dependencies: Mobile applications have high coupling between modules
Performance Constraints: Mobile devices have limited resources, requiring high code optimization
Complex UI Logic: Interface interaction code is difficult to analyze statically

Comparison with Traditional Benchmarks

Comparison Analysis

Compared to the traditional SWE-Bench, the difficulty of the Mobile version is significantly increased:

Larger codebase size
More complex business logic
More difficult to pass test cases
Higher context window requirements

Industry Significance

This benchmark reveals the limitations of AI Agents in real-world industrial scenarios. Although AI has made rapid progress in code generation, it still has a long way to go in handling large, complex real-world projects.

Future Outlook

The release of SWE-Bench Mobile provides an important benchmark for the development of AI programming tools. It reminds us that:

AI-assisted programming still requires human supervision
Complex projects require more intelligent context understanding
There is still a lot of room for improvement in model capabilities

Resource Links

Resources

Paper: https://arxiv.org/abs/xxxxx
GitHub: https://github.com/xiaohongshu/swe-bench-mobile

Xiaohongshu Releases SWE-Bench Mobile: When AI Agents Face Codebases of Apps with Hundreds of Millions of Users, the Highest Pass Rate is Only 12%?

Xiaohongshu Releases SWE-Bench Mobile: When AI Agents Face Codebases of Apps with Hundreds of Millions of Users, the Highest Pass Rate is Only 12%?

What is SWE-Bench Mobile?

Test Results

Why is it so difficult?

Comparison with Traditional Benchmarks

Industry Significance

Future Outlook

Resource Links

You Might Also Like

Claude Code Buddy Modification Guide: How to Obtain Shiny Legendary Pets

Obsidian Launches Defuddle, Taking Obsidian Web Clipper to New Heights

OpenAI Suddenly Announces 'All-in-One': Browser + Programming + ChatGPT Merge, Internally Admits Mistakes Over the Past Year

2026, No More Forcing Myself to be 'Disciplined'! Do These 8 Simple Things, and Health Will Naturally Follow

Moms Who Work Hard to Lose Weight but Can't, Definitely Fall Here

AI Browser 24-Hour Stable Operation Guide