Bluesky users debate plans around user data and AI training
Bluesky users debate the use of their data for AI and archiving.

Bluesky, a social network, recently brought forward a new proposal on GitHub, presenting an idea for users to control whether their posts and data could be used for purposes like generative AI training and public archiving. This became a point of heated discussion as some users viewed the proposition as conflicting with Bluesky's prior commitment to not selling user data for advertising or AI training. Bluesky's CEO, Jay Graber, acknowledged these concerns, discussing the proposal during a presentation at South by Southwest and further highlighting it in a post on the Bluesky platform.
Graber elaborated that many generative AI companies have been scraping public data available on the web, including data from Bluesky, because the posts on such platforms are inherently public. The proposal aims to implement a new standard for managing such data scraping, akin to the existing robots.txt files used by websites to communicate with web crawlers. Although robots.txt files have no legal binding authority, they serve as an indicator of intent regarding data scraping permissions.
Bluesky’s proposed standard introduces a framework for user preferences, allowing individuals to signify their consent or dissent in four categories: generative AI, connecting different social ecosystems (protocol bridging), bulk datasets, and web archiving, which includes platforms like the Internet Archive’s Wayback Machine. This system endeavors to provide a mechanism of communicating permissions of users' data to companies or teams involved in AI training when scraping or transferring data.
Molly White, known for her Commentary on technology platforms through the 'Citation Needed' newsletter and the blog 'Web3 is Going Just Great', described Bluesky's approach as commendable. She argued that the proposal should not be perceived as inviting AI scraping but rather as a method to establish a consent signal, reinforcing the idea of ethical data usage. White also pointed out the inherent challenge in relying on voluntary ethical compliance by data scrapers, as many entities have been known to disregard permissions files like robots.txt or even engage in unauthorized data scraping.
While the proposal faced skepticism, with users like Sketchette expressing strong opposition, others see it as a necessary step toward balancing data accessibility and user control on modern digital platforms. This ongoing debate highlights the intricacies and ethical considerations around user data management and AI advancement, reminding stakeholders of the challenges in aligning technology innovations with user trust and privacy expectations.
Sources: Bluesky GitHub, South by Southwest, Bluesky platform, Citation Needed newsletter, Web3 is Going Just Great blog