Meituan LongCat Open Sources VitaBench 2.0

Olivia Johnson
Jun 26
3 min read

Meituan LongCat released VitaBench 2.0. The benchmark tests whether AI agents can track real user preferences that change over months and years.

The dataset draws from 56 simulated users. It covers 819 complex tasks, more than 2000 dynamic preferences, and 66 executable tools. Each user profile contains an average of 2093 interaction events spread across 1580 days on average.

Dataset Focuses on Years of Changing Preferences

VitaBench 2.0 differs from existing benchmarks because it records how preferences evolve. Most current tests use short sessions that last minutes or hours. This benchmark forces agents to keep consistent records across years of data.

Each user profile contains repeated purchases, schedule changes, and shifting priorities. The tasks require agents to notice when a user stops liking a service or starts preferring new options. Models must update memory without explicit instructions each time.

Top Models Still Struggle on Core Tasks

Claude-Opus-4.6 reached an average score just above 0.5 when given full context in open-book mode. Other leading models scored lower. Scores dropped further when agents needed to ask clarifying questions instead of acting on past data alone.

The results show that current thinking modes do not reliably improve performance on personalized tasks. Some models produced worse answers when forced to reason step by step because they over-generalized from early interactions.

Proactive Questioning Exposes Weaknesses

Every tested model lost significant points on tasks that required asking the user for new information. Agents that relied only on stored history often chose outdated options or missed recent constraints.

This pattern appeared across different task types. Agents performed better when all relevant facts stayed inside the provided context window. Performance collapsed once the needed fact required fresh input from the user.

Memory Strategies Show Clear Tradeoffs

VitaBench 2.0 supports tests of different memory approaches. Long-text context learning keeps the full history inside the prompt. Structured memory strategies summarize older events and store only key facts.

The benchmark results suggest neither approach solves every problem. Full context avoids loss but hits length limits. Summarization saves space yet loses details that later become important. Teams can now measure exactly where each method fails.

Open Release Allows Direct Comparison

LongCat made the full dataset and evaluation code public. Researchers can run the same tasks on new models without recreating the user histories. The release includes scripts that measure both final accuracy and the cost of different memory strategies.

Early adopters have already started reporting their own scores. These reports show that small changes in memory update rules can shift results by more than ten points on certain task groups.

Current Limits Point to Needed Improvements

The benchmark still leaves several questions open. It uses simulated users rather than live deployments, so real-world noise may change the rankings. The 66 tools also remain limited compared with production environments that contain hundreds of services.

Teams that want agents to handle long projects should watch how future models close the gap on proactive questioning. Scores above 0.7 on those tasks would mark a clearer advance than another small gain on static recall.

What to Watch Next

Watch for new model releases that publish VitaBench 2.0 scores within the next three months. Compare how each update affects the gap between open-book and closed-book results.

Pay attention to memory architecture papers that use this benchmark as the main evaluation. The first approach that pushes proactive questioning scores past 0.6 will likely set the direction for the next round of agent products.

Check whether commercial agent platforms begin reporting their own VitaBench numbers. Consistent public reporting will show which teams treat long-term consistency as a core requirement rather than a future feature.

Meituan LongCat Open Sources VitaBench 2.0

Dataset Focuses on Years of Changing Preferences

Top Models Still Struggle on Core Tasks

Proactive Questioning Exposes Weaknesses

Memory Strategies Show Clear Tradeoffs

Open Release Allows Direct Comparison

Current Limits Point to Needed Improvements

What to Watch Next

Recent Posts

Get started for free

Features

Alternatives

Solutions

Resources

Company