Where Do AI Programs Get Their Data? It Turns Out Some Comes from Copyrighted Books, Without Permission
- by Michael Stillman
CatGPT?
Where does the information you get from artificial intelligence (AI) sources like ChatGPT come from? It comes from a lot places, including the reams of data on the internet, but a significant source is books. Many, if not most, are of recent vintage as up-to-date information is needed for best answers. As such, most of these books are under copyright. However, the authors and publishers of these books have not been asked for permission nor compensated. Is this legal, an acceptable use of copyrighted works, or a violation of copyright law? Good question. No one knows the answer since it has not been adjudicated in court.
AI programs gain a lot of their data, and learn how language is used so they can give understandable answers, from training databases. These are databases filled with an enormous amount of information. How about the best known AI program, ChatGPT? Did it learn from a training database? To answer this, we went to the ultimate authority to ask, ChatGPT itself. It responded, “Yes, ChatGPT, like other GPT-3 models, is trained on a large and diverse dataset containing a wide range of text from the internet. This dataset includes books, articles, websites, and other sources of human-generated text. The model learns patterns, language structures, and information from this training data, which it then uses to generate responses to user inputs.”
One such online training database is called “The Pile,” and a subset of The Pile is Books3. The Pile contains data from numerous sources, with Book3 providing the book element. It contains 196,000 books, converted to searchable text. It is not necessarily in a format that would allow you to read it as a book, but the text is there. Most are likely copyrighted but used without permission. It was freely available on the internet to anyone seeking to build an AI model. Its creator made it so, as he wanted even small developers to have a shot at creating a model.
Books3 was recently removed from the internet. It was taken down after Rights Alliance, a group representing Danish publishers, made the request. They determined that 150 titles used were published by their members. The Eye, the website hosting Books3, complied.
This issue is already starting to appear in court and we can expect to see more of this until some sort of decision is reached on where AI training databases and copyright law intersect. It is argued that this is “Fair Use,” a doctrine that allows you to quote brief parts of a book without running afoul of copyright law. This can be argued to be similar, without even direct quoting. It is sort of like conducting research in a library. However, it is also true these databases have copied entire books to do their searching. It is also notable that the authors are not being compensated, while at risk of losing sales to people who would rather do their research through services like ChatGPT. Of course, the database compiler can license the material from the publisher, but that would require many deals with many people, and it might be prohibitively expensive for all but the largest corporations. That is what the Books3 founder sought to avoid. Maybe ChatGPT can come up with an answer to this dilemma.
Note on illustration. What the...? I asked ChatGPT's image generator for a picture of ChatGPT. This is what it gave me. Why? Who knows. Perhaps it has to do with the French word for “cat” being “chat,” but who knows what it's artificial mind was thinking. Hopefully, it's textual answers are a little better.
Leland Little, May 21: Signed Artist Proof of the Monumental G.O.A.T.: A Tribute to Muhammad Ali.
Leland Little, May 21: Assorted Rare Publications Related to H.P. Lovecraft, Including The Recluse Signed by Vincent Starrett.
Leland Little, May 21: Two Issues of The Vagrant, Including the First Appearance of H.P. Lovecraft's "Dagon" in Number Eleven.
Leland Little, May 21: Rare First Printing of Anne of Green Gables, With ALS from the Author.
Leland Little, May 21: First Edition of Hemingway's The Old Man and the Sea, In First Issue Jacket.
Leland Little, May 21: The Limited Paumanok Edition of The Complete Writings of Walt Whitman.
Leland Little, May 21: Beautifully Bound Limited Flaubert Edition of The Works of Guy de Maupassant.
Leland Little, May 21: First Edition of Bonaparte's Celebrated American Ornithology, With Spectacular Hand-Colored Plates.
Leland Little, May 21: A Rare Complete Set of Jardine's The Naturalist's Library, With Hand-Colored Plates.
Leland Little, May 21: Invitation to the Lincoln-Johnson National Inaugural Ball, March 4th, 1865.
Leland Little, May 21: A Scarce Inscribed First Edition of James Baldwin's Nobody Knows My Name.
Leland Little, May 21: Picasso's Le Goût du Bonheur, Limited Edition.
Sotheby's Sell Your Fine Books & Manuscripts
Sotheby’s: The Shem Tov Bible, 1312 | A Masterpiece from the Golden Age of Spain. Sold: 6,960,000 USD
Sotheby’s: Ten Commandments Tablet, 300-800 CE | One of humanity's earliest and most enduring moral codes. Sold: 5,040,000 USD
Sotheby’s: William Blake | Songs of Innocence and of Experience. Sold: 4,320,000 USD
Sotheby’s: The Declaration of Independence | The Holt printing, the only copy in private hands. Sold: 3,360,000 USD
Sotheby's Sell Your Fine Books & Manuscripts
Sotheby’s: Thomas Taylor | The original cover art for Harry Potter and the Philosopher's Stone. Sold: 1,920,000 USD
Sotheby’s: Machiavelli | Il Principe, a previously unrecorded copy of the book where modern political thought began. Sold: 576,000 GBP
Sotheby’s: Leonardo da Vinci | Trattato della pittura, ca. 1639, a very fine pre-publication manuscript. Sold: 381,000 GBP
Sotheby’s: Henri Matisse | Jazz, Paris 1947, the complete portfolio. Sold: 312,000 EUR
Gonnelli Auction 59 Antique prints, paintings and maps May 20th 2025
Gonnelli: Pietro Aquila, Psyche and Proserpina,1690. Starting price 140€
Gonnelli: Jacques Gamelin, Memento homo quia pulvis es et in pulverem reverteris, 1779. Starting price 300€
Gonnelli: Giorgio Ghisi, The final Judgement, 1680. Starting price 480€
Gonnelli Auction 59 Antique prints, paintings and maps May 20th 2025
Gonnelli Goya y Lucientes Francisco, Los Proverbios.1877. Starting price 1000 €
Gonnelli: Domenico Peruzzini, Long bearded old man, 1660. Starting price 2200€
Gonnelli: Enea Vico, Leda and the Swan,1542. Starting price 140€
Gonnelli Auction 59 Antique prints, paintings and maps May 20th 2025
Gonnelli: Andrea Del Sarto [school of], San Giovanni Battista, 1570. Starting price 25000€
Gonnelli: Carlo Maratta, Virgin Mary and Jesus, 1660. Starting Price 1200€
Gonnelli: Louis Brion de La Tour, Sphére de Copernic Sphere de Ptolemée / Le Systême de Ptolemée. Le Systême de Ticho-Brahe…, 1766. Starting price 180€
Gonnelli Auction 59 Antique prints, paintings and maps May 20th 2025
Gonnelli: Marc’Antonio Dal Re, Ville di Delizia o Siano Palaggi Camparecci nello Stato di Milano Divise in Sei Tomi Con espressevi le Piante…, Tomo Primo, 1726. Starting price 7000€
Gonnelli: Katsushika Hokusai, Bird on a branch, 1843. Starting price 100€
Ketterer Rare Books Auction May 26th
Ketterer, May 26: Th. McKenney & J. Hall, History of the Indian tribes of North America, 1836-1844. Est: €50,000
Ketterer, May 26:Biblia latina vulgata, manuscript on thin parchment, around 1250. Est: €70,000
Ketterer, May 26: M. Beckmann, Fanferlieschen Schönefüßchen, 1924. Est: €10,000
Ketterer Rare Books Auction May 26th
Ketterer, May 26: A. Ortelius, Theatrum orbis terrarum, 1574. Est: €50,000
Ketterer, May 26: M. S. Merian, Eurcarum ortus, alimentum et paradoxa metamorphosis, 1717-18. Est: €6,000
Ketterer, May 26:PAN, 9 volumes, 1895-1900. Est: €12,000
Ketterer Rare Books Auction May 26th
Ketterer, May 26: Breviarium Romanum, Latin manuscript, 1474. Est: €15,000
Ketterer, May 26: Quran manuscript from the Saadian period, Maghreb, 16th century. Est: €10,000
Ketterer, May 26: E. Hemingway, The old man and the sea, 1952. First edition in first issue jacket. Presentation copy. Est: €3,000
Ketterer Rare Books Auction May 26th
Ketterer, May 26: Flavius Vegetius Renatus, De re militari libri quatuor, 1553. Est: €3,000
Ketterer, May 26: K. Marx, Das Kapital, 1867. Est: €30,000
Ketterer, May 26: Brassaï, Transmutations, 1967. Est: €6,000