NIX Solutions: Apple AI Data Collection – Opt-Outs and Ethics

One of the sources of data for training generative AI systems is publicly available web resources. Apple has given their owners the opportunity to opt out of collecting data for training the Apple Intelligence system, and many of the largest resources have taken advantage of this opportunity. Among them are Facebook and Instagram, as well as major news and media resources, including the New York Times and The Atlantic.

NIX Solutions

AppleBot and Ethical Concerns

In recent years, Apple has used a web crawler called AppleBot – the data it collected was used to train Siri and the Spotlight search engine. And more recently, the company has connected AppleBot and Apple Intelligence. This is a controversial practice, since modern AI is free to use copyrighted materials – in narrow areas where there is not much material at all, systems cite entire paragraphs almost unchanged.

Apple assures that it collects information taking into account ethical standards, filtering out personal data, using only licensed materials and publicly available data that comes from the AppleBot crawler. To give webmasters the opportunity to opt out of collecting information only for training AI, the company used the pseudonym Applebot-Extended — standard search indexing remains if this pseudonym is banned.

The refusal is carried out by adding a corresponding directive to the publicly available robots.txt file on web resources, which means that anyone can see which publishers have blocked Apple Intelligence access to themselves, notes NIX Solutions. This was done by Facebook, Instagram, Craigslist, Tumblr, the New York Times, the Financial Times, The Atlantic, Vox Media, the USA Today network and Condé Nast, Wired magazine found. Just over a quarter of major American news sites (294 out of 1,167) refused to let Apple’s AI into their sites, journalist Ben Welsh clarified.

According to unconfirmed information, Apple has concluded deals with some media companies, paying them for the right to use their materials for training AI. Probably, these considerations are holding back other resources as well — they are just waiting for money. We’ll keep you updated on any developments in this ongoing situation.