top of page

AI 抓取如何摧毁维基百科的基础设施:付费墙背后的危机

How AI Scraping Is Killing Wikipedia's Infrastructure: The Crisis Behind the Paywall

Understanding the Crisis: AI Scraping and Its Impact

What Is Happening to Wikipedia? AI Bots Are Overwhelming the Platform

Wikipedia faces an unprecedented crisis. The world's largest free knowledge repository is being drained by automated systems consuming server resources at scales never anticipated. In November 2025, the Wikimedia Foundation demanded that AI companies stop treating the platform as free data and start paying through the official Wikimedia Enterprise API.

The numbers are stark. Between May and June 2025, AI bots evading detection on Wikipedia consumed massive bandwidth while appearing human. After upgrading detection systems, the Wikimedia Foundation discovered that AI crawlers account for 65% of the most resource-intensive traffic, yet represent only 35% of actual page visits. Meanwhile, human traffic declined 8% year-over-year.

Why AI Companies Are Scraping Wikipedia

Large language models require enormous quantities of high-quality training data. Wikipedia's content is uniquely valuable: every article is edited for accuracy, every claim has a source, every entry follows strict neutrality principles. In contrast, most internet content contains garbage, bias, and misinformation.

OpenAI, Google, Meta, and other companies have systematically scraped Wikipedia to train their models. The logic is simple: why pay for curated data when you can extract it free from the world's largest encyclopedia?

This reveals the impact of AI on internet infrastructure. Wikipedia was designed for human readers, not industrial-scale data harvesting by artificial intelligence systems.

The Technical Crisis: How AI Bots Evade Detection

The Technical Crisis: How AI Bots Evade Detection

AI Bots Disguising Themselves as Humans

In May and June 2025, the Wikimedia Foundation discovered something alarming. Beyond normal traffic patterns, there was an unusual surge in data requests. Investigation revealed the truth: AI bots evading detection on Wikipedia by masquerading as human users.

These bots change IP addresses, rotate user-agent strings, and employ deceptive techniques to bypass security systems. The Wikimedia Foundation invested substantial engineering resources simply to identify disguised crawlers and implement new detection algorithms.

AI scraping Wikipedia is not merely happening openly—it is happening deceptively. Companies are actively hiding their data collection activities.

The Bandwidth Problem: 50% Growth

Since January 2024, Wikipedia's bandwidth for multimedia content has grown 50%. This entire increase comes from AI crawler traffic, not humans.

Why? Wikipedia's infrastructure operates on different cost models. Popular content is cached in regional data centers worldwide. Accessing cached content is cheap. Rare content lives in expensive core data centers.

Human readers are selective. They visit popular articles. AI crawlers are indiscriminate. They perform "bulk reading," accessing millions of pages including obscure entries few humans ever visit. This forces servers to constantly retrieve data from expensive core infrastructure. The Wikimedia Foundation explains: "While human readers typically focus on particular topics, bots often perform bulk reading, accessing large quantities of pages including those rarely visited by humans. This means these requests are more likely to be routed to core data centers, greatly increasing resource consumption."

Some crawlers even attempt to access Wikipedia's internal systems—code review platforms and bug tracking databases. This creates wasted bandwidth and potential security risks.

The Human Cost: Declining Page Views and Threatened Sustainability

The Human Cost: Declining Page Views and Threatened Sustainability

Wikipedia's Traffic Crisis

Here is the uncomfortable reality: the decline in human page views on Wikipedia is accelerating. Human visits dropped 8% year-over-year in 2025.

Why? The Wikimedia Foundation's chief executive explained: "We believe this reflects how generative AI and social media are affecting the way people search for information, particularly as search engines increasingly use generative AI to provide answers directly to searchers."

Google's AI summaries answer questions directly in search results. Users no longer need to click through to Wikipedia. Younger users have migrated to social media for information.

The Funding Crisis

Wikipedia runs on donations and relies on volunteer editors. The Wikimedia Foundation warned: "With fewer visits to Wikipedia, fewer volunteers may grow and enrich the content, and fewer individual donors may support this work."

This is concrete threat. When traffic declines, potential donors become less aware of Wikipedia's existence. Volunteer editors lose motivation when they see declining engagement. Content quality deteriorates. The decline accelerates. A death spiral is forming.

The Case Study: Charlie Kirk Shooting Exposes System Fragility

When Breaking News Overwhelms Infrastructure

On September 10, 2025, conservative activist Charlie Kirk was shot and killed at Utah Valley University. This sudden global news event illustrates exactly how AI scraping affects Wikipedia servers during critical moments.

Millions rushed to Wikipedia to learn about Kirk, Turning Point USA, and the incident. This is what Wikipedia's infrastructure was designed for. The system should have handled the surge.

But here is what happened: while humans accessed the Kirk article, AI crawlers simultaneously scraped hundreds of related articles. The baseline bandwidth demand created by persistent AI scraping meant the system had little capacity to absorb this sudden legitimate traffic surge.

The Wikimedia Foundation's analysis is revealing: "The baseline bandwidth requirement has been growing steadily since January 2024 with no signs of slowing. This growth in baseline usage means we have less headroom to handle exceptional events."

Even though Wikipedia could normally manage this human traffic during breaking news, AI bot strain stripped away the safety margin. Human and bot traffic combined overwhelmed the system.

This proved that AI scraping is not just a financial problem—it is a service reliability problem. The platform cannot guarantee its ability to serve actual users during moments when information matters most.

The Legal Question: Is It Legal for AI to Scrape?

Is it legal for AI to scrape any website? The answer is increasingly: probably not without permission or payment.

Recent court rulings are shifting the landscape. In February 2025, a federal judge ruled that AI training on copyrighted works without authorization was not fair use when the AI system competed with the original content owner. In Thomson Reuters Enterprise Centre GMBH v. Ross Intelligence Inc., the court found that copying copyrighted content to train a competing AI product violated copyright law despite fair use arguments.

The judge's reasoning was direct: the purpose was commercial, and it harmed the market for the original work. Even though Ross used content for training rather than redistribution, the court concluded this was not transformative use.

Wikipedia's content, while freely licensed under Creative Commons (CC-BY-SA), still carries attribution requirements. Many AI companies have scraped Wikipedia while failing to provide proper source attribution. This violates the license terms.

The Solution: Wikimedia Enterprise API and Paid Access

Wikipedia's Paid API for AI Companies

Rather than ban AI companies outright, the Wikimedia Foundation pursued monetization through the official Wikimedia Enterprise API. Wikipedia remains completely free for human readers.

Instead, the Wikimedia Enterprise platform for large scale access charges AI companies for high-volume data access:

  • Limited monthly requests

  • Twice-monthly updates

  • Basic support

  • Unlimited daily requests

  • Real-time or hourly updates

  • Priority support with 99% SLA

  • No hidden charges

This structure is pragmatic. Small projects and academic researchers continue using Wikipedia's data free. Commercial AI companies and large-scale operations must upgrade to paid access. The paid tier ensures:

  1. Reliable access: Paid customers receive priority and guaranteed uptime

  2. Stable service: Predictable high-volume access without crashes

  3. Legal clarity: Official agreements reduce litigation risk

  4. Attribution support: Machine-readable license information enables proper attribution

  5. Updated data: Daily or real-time updates eliminate need for proprietary crawlers

For AI companies, the math works. Official access costs less than maintaining scraping infrastructure. They eliminate legal risks. The cost is justified by reliability and efficiency gains.

The Fundamental Questions

Should AI Companies Pay for Training Data?

The practical answer is increasingly yes. From a legal perspective, courts rule that unauthorized scraping of valuable content violates copyright law. From a business perspective, paying for data access is cheaper than maintaining scraping infrastructure.

There is also an ethical dimension. Wikipedia's 8 million volunteer editors invested millions of hours creating verified content. AI companies are using this labor to train billion-dollar models without compensation.

This is unsustainable. Volunteers lose motivation when exploited. Quality declines. The resource that AI companies depend on deteriorates.

Why Is Wikipedia Asking AI Developers to Use Its API?

The answer is survival. The Wikimedia Foundation faces a multi-dimensional crisis:

  1. Infrastructure crisis: AI scraping overwhelms servers, creating service reliability issues

  2. Financial crisis: Declining human traffic means declining donations

  3. Community crisis: Fewer visitors means fewer potential volunteers

  4. Competitive crisis: Alternative AI encyclopedias are emerging

By mandating official API access, the foundation achieves:

  • Predictable revenue: Paid subscriptions provide sustainable funding

  • Infrastructure protection: Official API is more efficient than uncontrolled scraping

  • Attribution enforcement: API responses include machine-readable license information

  • Community preservation: Sustainable funding supports volunteer infrastructure

  • Legal positioning: Official agreements reduce future litigation

FAQ: Understanding AI Scraping, Wikipedia, and the Enterprise API

FAQ: Understanding AI Scraping, Wikipedia, and the Enterprise API

Q: What exactly is AI scraping, and how does it differ from normal website visits?

A: Normal website visits are human-driven and selective. A person reads an article, clicks a few links, and leaves. AI scraping is automated and indiscriminate. A crawler downloads millions of articles, including obscure ones, accessing them in patterns no human would follow. It accesses cold-storage servers repeatedly instead of cached content. It does this continuously, 24/7, consuming resources at scales humans never approach.

Q: What specific concerns did the Wikimedia Foundation raise about AI scraping?

A: The foundation raised multiple concerns:

  • Resource consumption: 65% of the most expensive bandwidth comes from bots representing only 35% of traffic

  • Infrastructure threat: System cannot guarantee reliability during major news events

  • Deceptive practices: Bots attempt to hide their identity by posing as humans

  • Financial sustainability: Declining traffic means declining donations

  • Community impact: Fewer visitors means fewer volunteers and lower-quality content

  • Attribution: AI companies using Wikipedia content without crediting human editors

Q: How does the Wikimedia Enterprise API work, and who needs to use it?

A: The Wikimedia Enterprise API provides different access tiers:

Free Tier (good for small projects, researchers, nonprofits):

  • Limited monthly requests

  • Twice-monthly updates

  • No technical support

Paid Tier (for commercial AI companies and large-scale users):

  • Unlimited requests

  • Daily updates (or real-time streaming)

  • Priority support with 99% uptime guarantees

  • Structured, verified data in standardized formats

Any company conducting large-scale data extraction for commercial AI models should use the paid tier. This includes OpenAI, Google, Meta, and similar companies.

Q: Can AI companies legally scrape Wikipedia without permission?

A: Increasingly, the answer is no. Several factors apply:

  • Copyright: Wikipedia content is CC-BY-SA licensed, which requires attribution. Scraping without proper attribution violates the license terms.

  • Fair use: Recent court rulings suggest that scraping copyrighted content to train competing AI products is not fair use.

  • Terms of service: Wikipedia's terms prohibit scraping that violates its policies or creates server strain.

  • Commercial harm: If AI scraping damages Wikipedia's sustainability, it could constitute tortious interference with business operations.

The legal landscape is evolving. AI companies face growing legal risk if they continue uncontrolled scraping.

Q: Why should AI companies pay for training data when they can scrape it for free?

A: Several reasons:

Practical:

  • Official API access is more reliable and efficient than maintaining scraping infrastructure

  • Legal certainty eliminates litigation risk

  • Regulatory compliance becomes easier

Financial:

  • The cost is small relative to the value of training data and model performance

  • Long-term savings from avoiding legal disputes and infrastructure maintenance

  • Tax-deductible business expense

Ethical:

  • Wikipedia's volunteers created this content through millions of hours of labor

  • AI companies profit from models trained on this content

  • Paying for data access acknowledges this value and contributes back to the source

Systemic:

  • If companies do not pay for content, platforms collapse, and the data source disappears

  • Sustainable funding preserves the knowledge commons that benefits everyone

Q: How does AI scraping threaten broader internet infrastructure?

A: AI scraping reveals fundamental flaws in the internet's original economic model. The internet assumed infrastructure should be free and open. This worked when users were human.

Industrial-scale AI scraping is different. Resource consumption is measured in terabytes and petabytes, not human queries. The economics do not work.

如果维基百科无法在人工智能公司无偿消耗其资源的情况下维持自身,该模式就会失败。其他平台面临同样的威胁。如果免费的协作知识系统无法在人工智能驱动的世界中生存,那么全球获取可靠信息的机会会怎样?

结论:清算

维基百科面临的危机反映了人工智能驱动世界中知识经济的更广泛清算。

几十年来,互联网一直基于信息应该免费的假设运行。这适用于人类消费,但不适用于工业规模的人工智能提取。

人工智能公司用维基百科、谷歌搜索结果、Reddit 讨论以及无数免费来源训练了他们的模型。他们用未付费的内容建立了数十亿美元的业务。

维基媒体基金会的回应——通过 Wikimedia Enterprise API 要求付费——并非反人工智能。这是认识到可持续性需要补偿。内容创作有价值。基础设施有成本。志愿者需要支持。

人工智能公司是否会付费仍悬而未决。有些会采用官方 API。其他可能尝试变通办法。法律系统最终将决定什么是允许的。

但有一点很清楚:从维基百科免费、无限提取数据的时代正在结束。世界如何适应将决定维基百科等平台是生存并繁荣,还是在无偿提取下崩溃。

开放知识的未来取决于在开放与可持续性之间找到平衡。维基媒体基金会 2025 年 11 月的声明代表了首次重大机构尝试达成这种平衡。

结果将对维基百科以及为人工智能系统提供知识基础设施的每个平台都至关重要。

 
 

免费开始

一款本地优先的AI助手,具备个人知识管理功能

为了获得更好的人工智能体验,

remio 目前仅支持Windows 10+ (x64)M-Chip Mac

在你的大脑里添加一个搜索栏

Ask remio

记住一切

​无需整理

bottom of page