An investigation by The Atlantic has uncovered that approximately 21.2 million music tracks have been circulating within the AI development ecosystem, used to train generative audio models without the consent or compensation of rights holders. The study was led by staff writer and investigative journalist Alex Reisner.

The project, known as the AI Watchdog, originated in 2025 as a diagnostic tool tracking the unlicensed ingestion of books, research literature, and video media. Its expanded music investigation converts previously opaque training databases into a searchable public tool, giving artists, labels, and legal teams empirical evidence of which specific works have been ingested by AI systems. The tool is free to use and publicly accessible on The Atlantic’s website.

Four datasets at the centre of the investigation

The Watchdog identified four datasets circulating among AI developers. The largest is LAION-DISCO-12M, compiled by German AI non-profit LAION, which contains approximately 12.6 million tracks. It was built using an automated recursive search that matched 250,516 seed artists with YouTube Music URLs, and was released under an Apache 2.0 licence ostensibly for academic research, though it has since been widely downloaded by commercial developers.

The second dataset, Sleeping-DISCO-9M, was assembled by the Sleeping AI Research Collective and contains around 9 million tracks scraped from popular commercial music. It was published on Hugging Face and has been heavily targeted by generative modelling platforms. The collective also maintains a restricted subset called Sleeping-DISCO-Private, which includes full lyrics and Genius annotations, accessible only to verified research institutions.

The remaining two datasets are smaller in scale but significant in method. The Free Music Archive, which originated from WFMU radio station, contains around 100,000 tracks released under Creative Commons licences and has been utilised by Google and Stability AI. A fourth pointer metadata dataset of similar size circulates within private developer forums, linking directly to active Spotify and YouTube files.

A critical technical detail uncovered by the investigation is that three of the four datasets function not as audio libraries but as structured pointer systems. Rather than hosting audio files directly, they store metadata and URLs pointing to YouTube or Spotify, which AI developers then use alongside automated downloading tools to bypass platform logins, advertisements, and creator monetisation mechanisms. This method undermines the defence that developers only use freely available online material, since the process involves coordinated circumvention of licensing agreements embedded in those platforms.

To put the scale into perspective, auditing the contents of LAION-DISCO-12M alone, at an average track length of four minutes, would require approximately 91 years of uninterrupted listening.

Response from APRA AMCOS

The release of the Watchdog tool prompted an immediate response from APRA AMCOS, the Australasian Performing Right Association and Mechanical Copyright Owners Society, which represents over 128,000 songwriters, composers, and publishers. An audit of the datasets confirmed that thousands of Australian and New Zealand works had been scraped, including recordings by Kylie Minogue, AC/DC, Tame Impala, Flume, Sia, Midnight Oil, Lorde, INXS, Crowded House, and Cold Chisel.

Economic projections from the organisation’s AI and Music Report suggest that without a mandatory licensing framework, Australasian creators face an average 23% revenue reduction, amounting to over $500 million in losses across Australia and New Zealand over four years. APRA AMCOS Chief Executive Dean Ormston described the Watchdog’s database as direct proof of theft, pushing back against technology firms lobbying Australian and New Zealand governments for copyright exceptions and legal immunity for text and data mining.

The investigation also uncovered violations of Indigenous cultural rights. The datasets were found to have indiscriminately scraped sacred recordings by Aboriginal, Torres Strait Islander, and Māori artists, including works by Yothu Yindi, Gurrumul, Warumpi Band, Stan Walker, and Maisey Rika. APRA AMCOS Director of Aboriginal and Torres Strait Islander Programs Leah Flanagan stated that these recordings represent living cultural expressions governed by traditional protocols, with some material never cleared for commercial use on any terms.

Australia has formally rejected a copyright exception for AI platforms. New Zealand has taken a different approach, with Commerce and Consumer Affairs Minister Cameron Brewer announcing a 20-year extension to copyright protection alongside a commitment to deliver a comprehensive AI copyright policy report by March 2027.

What the investigation changes

Reisner’s investigation has established that AI training corpuses are neither invisible nor untraceable. By making the datasets searchable, the Watchdog has shifted the burden of proof in ongoing litigation, given regulators empirical data to inform policy, and provided independent artists with a tool to verify whether their work has been ingested.

The central finding is straightforward: 21 million songs did not end up in AI training pipelines by accident. They were systematically catalogued, linked, and downloaded through methods designed to avoid the licensing obligations that govern every other use of recorded music.

You can check out AI Watchdog by The Atlantic on their official website.



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here