HappyHorse Reached #1 on Blind Video AI Tests - Then Alibaba Said It Built It

Sophie Larsen
2 days ago
10 min read

On April 7, 2026, a video AI model with no company attribution appeared on the Artificial Analysis Video Arena and began winning. It would turn out to be HappyHorse - Alibaba's video generation model - but for three days, nobody in the developer community knew who had built it. Within 72 hours, it had climbed to the top of both the text-to-video and image-to-video blind comparison leaderboards.

Alibaba confirmed authorship on April 9. HappyHorse 1.0, built by the company's Taotian Future Life Lab and its ATH Innovation Unit, had achieved Elo scores of 1367 in text-to-video and 1401 in image-to-video - beating the nearest competitor, ByteDance's Seedance 2.0, by 96 to 107 points. In head-to-head blind comparisons, HappyHorse reportedly won roughly 65% of the time.

The Alibaba HappyHorse AI video model represents something the video generation market hasn't seen before: a major tech company using anonymity to let benchmark results build credibility before any official announcement. That strategy generated substantially more media coverage than a standard product launch would have. It also raised a question the coverage mostly skipped past - beating every current competitor on a blind benchmark and converting that into market share are two very different things.

On April 27, fal - the AI infrastructure platform used by developers to access foundation models - launched official API access for HappyHorse-1.0. Three weeks after an anonymous appearance, the model that topped every video AI leaderboard was available via four API endpoints.

What Happened

The story behind HappyHorse starts with Zhang Di. A veteran AI engineer with over 15 years of experience, Zhang led the team that built Kling AI at Kuaishou - one of China's most-used AI video generation tools in 2024 and 2025. In late 2025, Zhang moved to Alibaba to lead the Taotian Future Life Lab, an R&D unit under Alibaba's ATH Innovation Unit. The HappyHorse team he brought with him had already built one top-ranked video model at a different company. The question was whether they could do it again with more resources.

The answer came through the Artificial Analysis Video Arena - the AI industry's primary platform for blind video model comparisons. The Arena uses a methodology borrowed from competitive chess: human users are shown two video clips generated from the same prompt by different models, without any indication of which model produced which. They pick the better clip. Thousands of comparisons aggregate into Elo ratings. Because voters cannot see the model names, results cannot be influenced by brand reputation or pre-existing preference.

On April 7, HappyHorse entered the Arena without any company attribution. The model competed anonymously across both text-to-video and image-to-video categories. By April 9, it had reached the top of both leaderboards. Alibaba then publicly confirmed it had built the model. The announcement described HappyHorse as a 15-billion-parameter unified Transformer, released under an Apache 2.0 license - with full model weights listed as "coming soon."

The stealth approach has a precedent in the language model space. In 2024, a model submitted to the LMSYS Chatbot Arena under the codename "gpt2-chatbot" generated significant speculation before being identified as a GPT-4 variant. HappyHorse marks the first deliberate use of this tactic at scale for video AI, and the first time a company of Alibaba's size has used anonymity as a launch strategy for a flagship AI product.

On April 27, fal launched API access with four endpoints: text-to-video, image-to-video, reference-to-video (which inserts a subject from a reference image while preserving identity), and video editing. Python and JavaScript SDKs are available. On the same day, Alibaba Cloud's Bailian platform opened enterprise-level API testing. Full commercial availability is expected in May 2026.

Why Topping Blind Rankings Is a Different Kind of Win

Not all AI benchmark results are created equal. Many published leaderboards rely on automated metrics - computational measurements of video quality, motion consistency, or text alignment that models can be specifically optimized to score well on, even if the resulting videos look poor to actual viewers. A model trained to maximize a specific metric can achieve a high leaderboard score while producing output that experienced video producers would immediately reject.

The Artificial Analysis Video Arena uses a methodology that is considerably harder to game. Human blind voting means the model needs to win on what users actually perceive as better - not on what a scoring function considers better. There is no target metric to overfit. The Elo system means that a model's score reflects its actual win rate against real competitors, calibrated by the quality of those opponents. A model that beats weak competition accumulates fewer points than one that beats strong competition.

HappyHorse's Elo of 1367 in text-to-video and 1401 in image-to-video represent a 96-to-107-point lead over second-place Seedance 2.0. In practical terms, a 100-point Elo gap translates to roughly a 64–65% expected win rate in direct head-to-head comparisons. Show 10 different users the same prompt outputs from HappyHorse and Seedance, and HappyHorse wins about 6 or 7 times.

The consistency of HappyHorse's performance across both T2V and I2V categories adds credibility to the result. Some video models are specialized - optimized for one task while performing average on another. A model that leads across multiple input modalities suggests the underlying architecture handles diverse inputs well, rather than narrowly excelling at one use case.

The VBench leaderboard - a separate academic benchmark that evaluates video generation using automated metrics including subject consistency, motion smoothness, background coherence, and temporal stability - reportedly shows HappyHorse at an aggregate score of 84.32, described as a 7.8% improvement over the previous top-ranked model. VBench measures different things than the Video Arena, using computation rather than human preference. The fact that HappyHorse scores well on both frameworks - one human-driven, one automated - strengthens the overall signal.

There are limits to what any benchmark tells you, and those limits matter for anyone evaluating HappyHorse for actual work. The Video Arena primarily tests short clips and aesthetic preference. A model can win on visual quality and motion smoothness while underperforming on longer sequences, nuanced dialogue synchronization, or specific professional production requirements. Seedance 2.0, ByteDance's model and HappyHorse's closest benchmark rival, reportedly has stronger performance in long-form audio dialogue sync - a meaningful differentiator for use cases involving extended speech or scripted dialogue. Sora, OpenAI's model, remains the benchmark for physics simulation and narrative coherence across extended sequences. Being #1 on Artificial Analysis Video Arena is a significant data point. It is not the complete picture of what any given production workflow requires.

What the blind benchmark score does establish is that HappyHorse is genuinely competitive with the best video generation models in the world - not as a self-reported claim, but through thousands of human preference judgments made without any knowledge of which company built what.

The Architecture Behind the Rankings - One Model, One Pass

The central technical claim behind HappyHorse is joint audio-video generation in a single forward pass. Most AI video systems treat audio as a separate step: generate the video first, then add sound through a distinct model or post-processing pipeline. The two outputs are produced independently and synchronized afterward, which introduces timing artifacts, lip-sync errors, and ambient audio that doesn't match what's visually happening on screen.

HappyHorse's 15-billion-parameter architecture reportedly processes text prompts, video frames, and audio tokens as a single unified sequence within one Transformer. According to Alibaba, the model performs joint denoising across all three modalities simultaneously, eliminating the need for cross-attention modules between separate encoders. The result, the company claims, is inherently synchronized audio and video rather than two outputs aligned after the fact.

The model uses DMD-2 distillation - a technique that compresses the denoising process into eight steps rather than the 50-to-100 steps used by standard diffusion models. The efficiency gain is significant: HappyHorse reportedly generates a 5-second 256p clip in approximately 2 seconds on an H100 GPU, and a 1080p version in roughly 38 seconds. Those numbers position HappyHorse as competitive on speed, not just quality.

Output specifications include resolutions up to 1080p, clip durations from 3 to 15 seconds, and support for 16:9, 9:16, 1:1, and 4:3 aspect ratios. The model reportedly offers over 50 aesthetic style presets - ranging from photorealism and anime to cinematic film grading styles. Multilingual lip-sync covers seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French, with a reported Word Error Rate of 14.60% on lip-sync evaluation. That language coverage is relevant for brands producing localized video content across multiple markets from a single production pipeline.

The Apache 2.0 license commitment signals that Alibaba intends to make HappyHorse weights publicly available for commercial use, not just through its cloud API. The timeline for the actual weights release remains open as of late April 2026.

How the HappyHorse AI Video Model Compares to Its Competition

The Alibaba HappyHorse AI video model enters a market with four serious competitors who already have established developer ecosystems and commercial deployments.

Sora (OpenAI) remains the most widely cited benchmark for complex physics simulation and long-form narrative coherence. For visual effects work or documentary-style generation where physical plausibility over extended sequences matters, Sora retains a distinct advantage. OpenAI has been cautious about scaling API access since Sora's launch, limiting developer adoption in ways that HappyHorse's fal partnership appears designed to avoid from the start.

Seedance 2.0 (ByteDance) is HappyHorse's closest benchmark competitor. The 96-to-107 Elo point gap is real, but so is ByteDance's structural advantage: TikTok and Douyin collectively reach over a billion monthly users. If ByteDance integrates Seedance into its content creation tools, it has distribution that Alibaba will need to replicate through different channels. Seedance also reportedly outperforms HappyHorse on long-form dialogue synchronization - a meaningful differentiator for scripted content and marketing videos with narration.

Kling (Kuaishou) was built by the same team now running HappyHorse - Zhang Di and the engineers who moved to Alibaba in late 2025. In some respects, HappyHorse is a direct architectural successor to Kling, built by the same people with more resources. Kling has established a strong position in social media content generation, particularly for high-volume, short-form output where generation speed and throughput matter more than peak visual quality.

Veo 3.1 (Google) targets cinematic and broadcast-quality production. Google's cloud infrastructure and enterprise relationships give Veo access to professional video production workflows that consumer-focused video AI models have difficulty penetrating. For Hollywood pre-visualization, advertising production, and broadcast content, Veo operates in a market segment where Google's existing relationships are a significant asset.

HappyHorse's specific strengths, according to available benchmark data, are motion physics, smoothness, and short-clip audio-visual synchronization. Motion dynamics scores reportedly rank substantially higher than older reference models like Wan 2.7. The joint audio-video architecture addresses one of the most consistent criticisms of existing video generation systems - audio that sounds disconnected from the visual content. For social media content creators, brand advertisers, and localized marketing campaigns, that synchronization quality is often the differentiator between output that's usable and output that needs rework. The reference-to-video endpoint adds a practical use case that pure text-to-video models lack: inserting a specific person or product into a generated scene while preserving their visual identity, which is directly relevant to advertising and branded content workflows.

The acknowledged limitation is clip length. HappyHorse generates in the 3-to-15-second range. For longer sequences, users need to chain clips manually - a workflow limitation shared by most current video generation models, but still a practical constraint for professional use cases requiring continuous footage. The benchmark lead is real. The production workflow constraints are equally real. For teams evaluating which video AI model to integrate into a production pipeline, the answer will depend on what they're actually building: HappyHorse for social-first short content with synchronized audio, Sora for long-form physics-accurate sequences, Seedance if ByteDance's distribution reach matters, Veo for enterprise broadcast contexts.

What Comes Next

Alibaba has committed to releasing full HappyHorse weights under an Apache 2.0 license. The GitHub repository was listed as "coming soon" as of late April 2026. If the open-source release follows the trajectory of Alibaba's Qwen language model series - where weights were released and the models quickly became among the most widely adopted open-source options globally - HappyHorse could establish a strong developer ecosystem presence independent of its commercial API. Open weights mean researchers, startups, and companies running on-premise infrastructure can build on HappyHorse without going through Alibaba Cloud billing.

Full commercial API access through Alibaba Cloud Bailian is expected in May 2026, with an early access discount for enterprise customers who began testing in late April. The fal partnership gives the model immediate developer-facing API availability in parallel with the Alibaba Cloud rollout.

The competitive pressure on all sides is intense. Sora, Kling, Seedance, and HappyHorse have all launched or significantly upgraded within a 12-month window. The model ranked #1 on any given benchmark will likely not hold that position indefinitely - the pattern in both language and video AI is continuous rapid iteration that reshuffles leaderboards on a quarterly basis. What matters more than any single benchmark snapshot is the pace of iteration and the depth of developer ecosystem built during a model's period of leadership.

The stealth launch strategy worked as an attention mechanism. Coverage from Bloomberg, CNBC, The Information, and dozens of AI-focused publications created awareness for HappyHorse in the North American developer market, where Alibaba's AI work has historically received less attention than its e-commerce and cloud businesses. That attention is a starting point, not a finish line. Converting developer curiosity into sustained API usage, and API usage into embedded production workflows, depends on factors that benchmark scores don't capture: pricing relative to competitors, API reliability, documentation quality, and the speed with which the open weights actually land.

Alibaba has shown competitive pricing and high-quality weights releases with its Qwen model series. If HappyHorse follows the same playbook, the anonymous April launch may turn out to be the opening move in a longer campaign to position Alibaba's video AI capability as the default choice for developers building video generation applications.

Benchmark leaderboards in AI move fast. The model at #1 this month may not be #1 in three months, and the model that's most widely deployed is often not the one that scores highest on any given test. What makes HappyHorse worth watching isn't just the Elo score - it's that Alibaba deployed a team that already built a market-leading video model once, then used an unconventional launch to let the work speak before the company name was attached.

If you're building a working knowledge of the AI video generation space - tracking benchmark updates, API releases, and the competitive dynamics between Sora, Kling, Seedance, and HappyHorse across time - remio's info capture automatically saves the pages you read into a searchable personal knowledge base. When you need to compare models or brief a team three months from now, you're pulling from your own collected research rather than re-reading the same articles from scratch.

HappyHorse Reached #1 on Blind Video AI Tests - Then Alibaba Said It Built It

What Happened

Why Topping Blind Rankings Is a Different Kind of Win

The Architecture Behind the Rankings - One Model, One Pass

How the HappyHorse AI Video Model Compares to Its Competition

What Comes Next

Recent Posts

Get started for free

Features

Alternatives

Solutions

Resources

Company