In response to the controversy over CGI models learning from artists’ images scraped from the internet without their consent – and potentially replicating their artistic styles – a group of artists have released a new website that allows anyone to see if their works have been used to train the AI.
The site “Have I been trained?” draws from LAION-5B training data used to train Google’s Stable Diffusion and Imagen AI models, among others. To build LAION-5B, robots led by a group of AI researchers crawled billions of websites, including vast repositories of artworks at DeviantArt, ArtStation, Pinterest, Getty Images, and more. Along the way, LAION collected millions of images from artists and copyright holders without consultation, which angered some artists.
During the site visit Have I been trained? website, which is run by an artist group called Spawning, users can search the dataset by text (like an artist’s name) or by an image they upload. They will see the image results next to the caption data linked to each image. It is similar to an older LAION-5B search tool created by Romain Beaumont and a recent effort by Andy Baio and Simon Willison, but with a sleek interface and the ability to perform reverse image search.
Any match in the results means the image could potentially have been used to train AI image generators and could still be used to train the image synthesis models of tomorrow. AI artists can also use the results to guide more precise prompts.
Spawning’s website is part of the group’s goal to set standards around obtaining consent from artists to use their images in future AI training efforts, including develop tools which aim to allow artists to participate or not in AI training.
A cornucopia of data
As mentioned above, Image Synthesis Models (ISMs) like Stable Diffusion learn how to generate images by analyzing millions of images pulled from the Internet. These images are useful for training purposes because they are associated with tags (often called metadata), such as captions and alt text. Linking this metadata to images allows ISMs to learn associations between words (such as artist names) and image styles.
When you type a prompt such as “a painting of a cat by Leonardo DaVinci”, ISM refers to what it knows about every word in that sentence, including pictures of cats and paintings by DaVinci, and how the pixels in these images are usually arranged in relation to each other. Then it composes a result that combines this knowledge into a new image. If a model is trained correctly, it will never return an exact copy of an image used to train it, but some images may be similar in style or composition to the source material.
It would be impractical to pay humans to manually write descriptions of billions of images for a set of image data (although this has been attempted on a much smaller scale), so that all data from “Free” images on the Internet are a tempting target for AI. researchers. They don’t ask for consent because the practice appears to be legal due to US court rulings on internet data scraping. But a recurring theme in AI news stories is that deep learning can find new ways to use public data that weren’t intended before, and do so in ways that could violate privacy. , social norms or community ethics, even if the method is technically legal. .
It should be noted that people using AI image generators usually refer to artists (usually more than one at a time) to mix artistic styles into something new and not for the purpose of committing a copyright violation. copyright or harmfully imitate artists. Even so, some groups like Spawning believe that consent should always be part of the equation, especially as we venture into this uncharted and rapidly developing territory.