ChatGPT Training: The List of News Orgs (and articles?)
Ask and you shall receive. After a straightforward but persistent dialogue with the “Feb 13 version,” I received — in alphabetical order, no less — an itemized list of 37+ news sources claimed to train OpenAI’s language model as of the cutoff date of September 2021. I also received lists of articles used for the training, which contained the story title, date of publication, and url.
I’ll refrain from listing the steps it took to procure this information, but it went something like this:
After I started seeing significant redundancy in the sources, I asked for all of the sources to be returned in an alphabetical list:
Interestingly, and despite not prompting for it, I was told that this list was exhaustive, at least for the model’s September 2021 cutoff:
Now, it’s clear that ChatGPT can hallucinate occasionally. However, the results included individual articles — with the authors — from the sources identified in training the model. Here are some examples:
Finally, since the discussion had been revealing, I input a few prompts inquiring about Twitter, Facebook, and Google as training sources, which yielded the following answers:
To recap, below are the list of news sources ChatGPT claimed were used to train OpenAI’s “language model”:
- Al Jazeera
- Associated Press (AP)
- BBC News
- Bloomberg News
- Business Insider
- The Conversation (appeared in early lists; missing in the final list)
- The Daily Star
- The Economic Times
- The Globe and Mail
- The Guardian
- The Hindu (appeared in earlier lists; missing in the final list)
- The Independent
- The Irish Times
- The Japan Times
- Kyodo News
- The Moscow Times
- The National
- The New York Times
- The New Zealand Herald
- Philippine Star
- South China Morning Post
- The Straits Times
- Taipei Times
- The Telegraph (appeared in early lists, missing in the final list)
- Today Online
- The Times of India
- The Times of Israel
- USA Today
- The Wall Street Journal
- The Washington Post
- The Washington Times
- Xinhua News Agency
- Yonhap News Agency
A few thoughts: First, I am not a legal scholar and will defer to others concerning the ethical and legal implications of this result. Second, I am generally a fan of OpenAI’s progress in the field (for now), and feel that tools such as ChatGPT offer new ways to explore information and news-related topics. Third, these sources appear in a list of domains from the WebText data (and Common Crawl) used to train earlier GPT models.
I checked some of the individual news articles that were listed in the claimed news source “training data.”
Only one of the results in the list, a story about the NASA Perseverance, led me to a non-404 page. While the news articles that ChatGPT claimed were part of its training data relate to past events, the content— dates, urls, news outlets, and headlines — are algorithmic remixes of reality.
There have been a growing number of posts about ChatGPT providing fake academic citations, but the results here still surprised me given the articles were supposed to be from the news sources in the training data.
Next, I asked for summaries of the nonexistent news articles from the 404 URLs that were provided.
A summary of a news story that returned a 404 result and couldn’t be found in Google or in the Wayback Machine? For the NPR piece, perhaps the keywords at the end the url were responsible for the output. So I asked for a summary of the 2020 BBC article on the UK’s climate supercomputer, which did not have keywords in the url.
ChatGPT isn’t making up facts as it goes along: once it’s wrong, each step you take in the conversational chain is likely to take you further from the truth.
In this case, an early answer to my prompt that had a high probability of being factual (“here are some of the news sources that are included in my training data”) resulted in a slippery slope of reality remixing until a fictional story — not fictional in the absolute sense, as the stories were based on real events and people — led to the creation of associated entities that never existed. This includes the generation of complex, domain-specific URLs linking to imaginary pages hosted on real domains.
Returning to the imaginary NPR story about lab-grown meat, I “ran” a VM instance inside ChatGPT, directing the browser to retrieve the contents of the url.
We can debate the inputs and parameters of these models endlessly, but, like earlier technologies, the outputs are the factor that usually impact reality.