Ticker

6/recent/ticker-posts

Wikipedia offers its data to AI to avoid being siphoned off by robots

Wikipedia offers its data to AI to avoid being siphoned off by robots

Wikimedia, the nonprofit foundation that hosts and supports Wikipedia, is struggling with data-harvesting bots from AI companies. These bots are very data-hungry and put a strain on the organization's infrastructure. In fact, since the beginning of the year, activity of these robots has increased the bandwidth used for downloading multimedia content by 50%. Specifically designed for machine learning applications, it facilitates access to already processed articles that can be immediately used for tasks such as modeling, fine-tuning, alignment, and analysis.

Technically, the database uses the Snapshot Structured Contents API, which provides data in a machine-readable JSON format. This allows developers and researchers to work directly with well-segmented articles, containing summaries, short descriptions, structured data such as infoboxes, links to images, as well as clearly defined article sections (excluding references or non-text elements).

This data is published under open licenses, some cases in the public domain or alternative licenses. It is hosted by Kaggle, the reference platform owned by Google for the machine learning community. Wikimedia already had a partnership with Google for the sharing of its content. This new initiative is therefore a logical continuation of this.

Source: Wikimedia

Post a Comment

0 Comments