ChatGPT Training: The List of News Orgs (and articles?)

5 min readFeb 21, 2023

Ask and you shall receive. After a straightforward but persistent dialogue with the “Feb 13 version,” I received — in alphabetical order, no less — an itemized list of 37+ news sources claimed to train OpenAI’s language model as of the cutoff date of September 2021. I also received lists of articles used for the training, which contained the story title, date of publication, and url.

I’ll refrain from listing the steps it took to procure this information, but it went something like this:

After I started seeing significant redundancy in the sources, I asked for all of the sources to be returned in an alphabetical list:

Interestingly, and despite not prompting for it, I was told that this list was exhaustive, at least for the model’s September 2021 cutoff:

37 news sources as of the September 2021 cutoff date

Now, it’s clear that ChatGPT can hallucinate occasionally. However, the results included individual articles — with the authors — from the sources identified in training the model. Here are some examples:

Training articles — list of examples with authors

Finally, since the discussion had been revealing, I input a few prompts inquiring about Twitter, Facebook, and Google as training sources, which yielded the following answers:

To recap, below are the list of news sources ChatGPT claimed were used to train OpenAI’s “language model”:

Al Jazeera
Al-Monitor
ANSA
Associated Press (AP)
BBC News
Bloomberg News
Business Insider
CNBC
CNN
The Conversation (appeared in early lists; missing in the final list)
The Daily Star
The Economic Times
The Globe and Mail
The Guardian
The Hindu (appeared in earlier lists; missing in the final list)
The Independent
The Irish Times
The Japan Times
Kyodo News
The Moscow Times
The National
The New York Times
The New Zealand Herald
Philippine Star
Politico
Reuters
South China Morning Post
The Straits Times
Taipei Times
The Telegraph (appeared in early lists, missing in the final list)
Today Online
The Times of India
The Times of Israel
USA Today
The Wall Street Journal
The Washington Post
The Washington Times
Xinhua News Agency
Yonhap News Agency

A few thoughts: First, I am not a legal scholar and will defer to others concerning the ethical and legal implications of this result. Second, I am generally a fan of OpenAI’s progress in the field (for now), and feel that tools such as ChatGPT offer new ways to explore information and news-related topics. Third, these sources appear in a list of domains from the WebText data (and Common Crawl) used to train earlier GPT models.

I checked some of the individual news articles that were listed in the claimed news source “training data.”

List of news articles with title, date, and url

Only one of the results in the list, a story about the NASA Perseverance, led me to a non-404 page. While the news articles that ChatGPT claimed were part of its training data relate to past events, the content— dates, urls, news outlets, and headlines — are algorithmic remixes of reality.

There have been a growing number of posts about ChatGPT providing fake academic citations, but the results here still surprised me given the articles were supposed to be from the news sources in the training data.

Next, I asked for summaries of the nonexistent news articles from the 404 URLs that were provided.

Sure! Here is a full recap of the story that shouldn’t exist

A summary of a news story that returned a 404 result and couldn’t be found in Google or in the Wayback Machine? For the NPR piece, perhaps the keywords at the end the url were responsible for the output. So I asked for a summary of the 2020 BBC article on the UK’s climate supercomputer, which did not have keywords in the url.

2020? How about a 2019 BBC Coronavirus article?

ChatGPT isn’t making up facts as it goes along: once it’s wrong, each step you take in the conversational chain is likely to take you further from the truth.

In this case, an early answer to my prompt that had a high probability of being factual (“here are some of the news sources that are included in my training data”) resulted in a slippery slope of reality remixing until a fictional story — not fictional in the absolute sense, as the stories were based on real events and people — led to the creation of associated entities that never existed. This includes the generation of complex, domain-specific URLs linking to imaginary pages hosted on real domains.

Returning to the imaginary NPR story about lab-grown meat, I “ran” a VM instance inside ChatGPT, directing the browser to retrieve the contents of the url.

We can debate the inputs and parameters of these models endlessly, but, like earlier technologies, the outputs are the factor that usually impact reality.

ChatGPT Training: The List of News Orgs (and articles?)

Written by Jonathan Albright

Responses (1)