For AI Model Training Personal Data is Probably Being Used

Written By: Aniqa Batool
Last Updated On: October 20, 2023

Personal Data Being Used For AI Models Training

Artists and authors are appropriately concerned about the creative AI model. These machine-learning models can only generate images and text because they have been trained on masses of real people’s creative work, much of which is protected. Several lawsuits have been filed against major AI developers, including OpenAI, Meta, and Stability AI. Independent assessments back up such legal arguments; for example, the Atlantic revealed in August that Meta educated its large language model (LLM) in part on a data collection called Books3, which comprised more than 170,000 unauthorized and protected books.

Developers use the public Internet to create massive generative AI models. However, as Emily M. Bender says there’s no one location where you can go download the Internet. Emily M. Bender is an instructor at the University of Washington who researches computational linguistics and language technology. Developers collect their training sets using automated systems that catalog and retrieve material from the Internet. Web “crawlers” go from link to link, identifying the location of information in a database, whereas Web “scrapers” download and extract the same data.

Web crawlers and scrapers can quickly access data from almost any location that isn’t secured by a login page. Private social media profiles are not included. However, data that may be accessed in a search engine or without logging into a site, such as a public LinkedIn profile, may still be gathered up, according to Dodge. He also says that there are the kinds of things that certainly remain in these Web scrapes, such as blogs, personal webpages, and websites for businesses.

This includes everything on Flickr, online markets, registration for voter databases, government webpages, Wikipedia, and academic institutions. There are also illegally obtained collections and Web archives, which frequently contain data that has since been removed from its original place on the Internet.

AI model can repeat the same information that was used to educate them, potentially confidential personal data and intellectual property. Many widely used creative AI models contain obstacles designed to prevent them from disclosing personally identifiable information about people, but researchers have frequently revealed ways to circumvent these limits.

Even if AI model outputs do not technically constitute plagiarism, Zhao claims that they can eat into paid possibilities for creative professionals by, for example, copying a certain artist’s unique visual methods. However, without clarity regarding data sources, it’s difficult to attribute such results to AI training; after all, the AI could be “hallucinating” the problematic content.