Transcript of A.I.’s Original Sin from ...

[00:00:00]

From New York Times, I'm Michael Barbaro. This is The Daily. Today. A Times investigation shows how as the country's biggest technology companies raced to build powerful new artificial intelligence systems, they bent and broke the rules from the start. My colleague, Cade Metz, on what he uncovered. It's Tuesday, April 16th. Cade, when we think about all the artificial intelligence products released the past couple of years, including, of course, these chat bots we've talked a lot about on the show. We so frequently talk about their future, their future capabilities, their influence on society, jobs, our lives. But you recently decided to go back in time to AI's past, to its origins, to understand the decisions that were made basically at the birth of this technology. Why did you decide to do that?

[00:01:29]

Because if If you're thinking about the future of these chat bots, that is defined by their past. The thing you have to realize is that these chat bots learn their skills by analyzing enormous amounts of digital data. What my colleagues and I wanted to do with our investigation was really focus on that effort to gather more data. We wanted to look at the type of data these companies were collecting, how they were gathering it, and how they were feeding it into their systems.

[00:02:08]

When you all undertake this line of reporting, what do you end up finding?

[00:02:13]

We found that three major players in this race, OpenAI, Google, and Meta, as they were locked into this competition to develop better and better artificial intelligence, they were willing to do almost anything to get their hands on this data, including ignoring, and in some cases, violating corporate rules and wading into a legal gray area as they gathered this data.

[00:02:43]

Basically, cutting corners.

[00:02:45]

Cutting corners left and right.

[00:02:47]

Okay, let's start with OpenAI, the flashiest player of all.

[00:02:53]

The most interesting thing we found is that in late 2021, as OpenAI, the startup in San Francisco that built ChatGPT, as they were pulling together the fundamental technology that would power that chatbot, they ran out of data, essentially. They had used just about all the respectable English language text on the internet to build this system. Just let that sink in for a bit.

[00:03:27]

I'm trying to let that sink in. They basically, like a Pac-Man on an old game, just consumed almost all the English words on the internet, which is unfathomable.

[00:03:39]

Wikipedia articles by the thousands, news articles, Reddit threads, digital books, by the millions. We're talking about hundreds of billions, even trillions of words. Wow. So by the end of 2021, OpenAI had no more English language text that they could feed into these systems. But their ambitions are such that they wanted even more. So here we You need to remember that if you're gathering up all the English language text on the internet, a large portion of that is going to be copyrighted. If you're one of these companies gathering data at that scale, You are absolutely gathering copyrighted data as well.

[00:04:35]

Which suggests that from the very beginning, these companies, a company like OpenAI with ChatGPT, is starting to break, bend the rules.

[00:04:46]

Yes, they are determined to build this technology. Thus, they are willing to venture into what is a legal gray area.

[00:04:55]

Given that, what does OpenAI do once it, as you had said, runs out of English language words to mop up and feed into this system?

[00:05:06]

They get together and they say, All right, so what are other options here? And they say, Well, what about all the audio and video on the internet? We could transcribe all the audio and video, turn it into text, and feed that into their systems. Interesting. So a small team at OpenAI, which included its president and co founder, Greg Brockman, built a speech recognition technology called Whisper, which could transcribe audio files into text with high accuracy. Then they gathered up all sorts of audio files from across the internet, including audiobooks, podcasts, and most importantly, YouTube videos.

[00:05:57]

Of which there's a seemingly endless supply. Fair to say, maybe tens of millions of videos.

[00:06:05]

According to my reporting, we're talking about at least a million hours of YouTube videos were scraped off of that video sharing site fed into this speech recognition system in order to produce new text for training OpenAI's chatbot. And YouTube's terms of service do not allow How a company like OpenAI to do this? Youtube, which is owned by Google, explicitly says you are not allowed to, in internet parlance, scrape videos en masse from across YouTube and use those videos to build a new application. That is exactly what OpenAI did. According to my reporting, employees at the company knew that it broke YouTube terms service, but they resolved to do it anyway.

[00:07:04]

So, Kate, this makes me want to understand what's going on over at Google, which, as we have talked about in the past on the show, is itself thinking about and developing its own artificial intelligence model and product.

[00:07:18]

Well, as OpenAI scrapes up all these YouTube videos and starts to use them to build their chatbot, according to my reporting, some employees at Google, at the very least, are aware that this is happening.

[00:07:34]

They are?

[00:07:35]

Yes. Now, when we went to the company about this, a Google spokesman said it did not know that OpenAI was scraping YouTube content and said the company takes legal action over this thing when there's a clear reason to do so. But according to my reporting, at least some Google employees turned a blind eye to OpenAI's activities because Google was also using YouTube content to train its AI. Wow. So if they raise a stink about what OpenAI is doing, they end up shining a spotlight on themselves, and they don't want to do that.

[00:08:14]

I guess I want to understand what Google's relationship is to YouTube, because, of course, Google owns YouTube. So what is it allowed or not allowed to do when it comes to feeding YouTube data into Google's AI models?

[00:08:28]

It's an important distinction. Because Google owns YouTube, it defines what can be done with that data. And Google argues that it has a right to that data, that its terms of service allow it to use that data. However, because of that copyright issue, because the copyright to those videos belonged to you and I, lawyers who I've spoken to say people could take Google to court and try to determine whether or not those terms of service really allow Google to do this. There's another legal gray area here where although Google argues that it's okay, others may argue it's not.

[00:09:15]

Of course, what makes this also interesting is you essentially have one tech company, Google, keeping another tech company, OpenAI's, dirty little secret about basically stealing from YouTube because it doesn't want people to know that it, too, is taking from YouTube. And so these companies are essentially enabling each other as they simultaneously seem to be bending or breaking the rules.

[00:09:43]

What this shows is that there is this belief, and it has been there for years within these companies, among their researchers, that they have a right to this data because they're on a larger mission to build a technology that they believe will transform the world. If you really want to understand this attitude, you can look at our reporting from inside Meta.

[00:10:10]

What does Meta end up doing, according to your reporting?

[00:10:15]

Well, like Google and other companies, Meta had to scramble to build artificial intelligence that could compete with OpenAI. Mark Zuckerberg is calling engineers and executives executives at all hours, pushing them to acquire this data that is needed to improve the chatbot. At one point, my colleagues and I got hold of recordings of these meta executives and engineers discussing this problem, how they could get their hands on more data, where they should try to find it. They explored all sorts of options. They talked about licensing books one by one at $10 a pop and feeding those into the model. They even discussed acquiring the book publisher, Simon & Schuster, and feeding its entire library into their AI model. But ultimately, they decided all that was just too cumbersome, too time-consuming. On the recordings of these meetings, you can hear executives talk about how they were willing to run roughshod over copyright law and ignore the legal concerns and go ahead and scrape the internet and feed this stuff into their models. They acknowledged that they might be sued over this, but they talked about how OpenAI had done this before them, that they, Metta, were just following what they saw as a market precedent.

[00:11:51]

Interesting. So they go from having conversations like, should we buy a publisher that has tons of copyrighted material, suggesting that They're very conscious of the legal terrain and what's right and what's wrong, and instead say, now, let's just follow the open AI model, that blueprint, and just do what we want to do, do what we think we have a right to do, which is to just gobble up all this material across the internet.

[00:12:19]

It's a snapshot of that Silicon Valley attitude that we talked about. Because they believe they are building this transformative technology technology, because they are in this intensely competitive situation where money and power is at stake, they are willing to go there.

[00:12:43]

But what that means is that there is, at the birth of this technology, a original sin that can't really be erased.

[00:12:53]

It can't be erased, and people are beginning to notice, and they are beginning to sue these companies over it. These companies have to have this copyrighted data to build their systems. It is fundamental to their creation. If a lawsuit bars them from using that copyrighted data, that could bring down this technology.

[00:13:32]

We'll be right back. So, Kate, walk us through these lawsuits that are being filed against these AI companies based on the decisions they made early on to use technology as they did and the chances that it could result in these companies not being able to get the data they so desperately say they need.

[00:13:54]

These suits are coming from a wide range of places. They're coming from computer programmers who are concerned that their computer programs have been fed into these systems. They're coming from book authors who have seen their books being used. They're coming from publishing companies. They're coming from news corporations like the New York Times, incidentally, which has filed a lawsuit against OpenAI and Microsoft, news organizations that are concerned over their news articles being used to build these systems.

[00:14:32]

Here, I think it's important to say as a matter of transparency, Cade, that your reporting is separate from that lawsuit. That lawsuit was filed by the business side of the New York Times by people who are not involved in your reporting or in this daily episode, just to get that out of the way.

[00:14:54]

Exactly.

[00:14:55]

I'm assuming that you have spoken to many lawyers about this, and I wonder if there's some insight that you can shed on the basic legal terrain. I mean, do the companies seem to have a strong case that they have a right to this information, or do companies like the Times who are suing them seem to have a pretty strong case that know that decision violates their copyrighted materials.

[00:15:20]

Like so many legal questions, this is incredibly complicated. It comes down to what's called fair use, which is a part of copyright Copyright law that determines whether companies can use copyrighted data to build new things. There are many factors that go into this. There are good arguments on the OpenAI side. There are good arguments on the New York Times side. Copyright law says that you can't take my work and reproduce it and sell it to someone. That's not allowed. But what's called fair use use does allow companies and individuals to use copyrighted works in part. They can take snippets of it. They can take the copyrighted works and transform it into something new That is what OpenAI and others are arguing they're doing. But there are other things to consider. Does that transformative work compete with the individuals and companies that supply the data that own the copyrights.

[00:16:34]

Interesting.

[00:16:34]

Here, the suit between the New York Times Company and OpenAI is illustrative. If the New York Times creates articles that are then used to build a chatbot. Does that chatbot end up competing with the New York Times? Do people end up going to that chatbot for their information rather than going to the Times website and actually reading the article? That is one of the questions that will end up deciding this case and cases like it.

[00:17:13]

What would it mean for these AI companies, for some or even all of these lawsuits to succeed?

[00:17:22]

Well, if these tech companies are required to license the copyrighted data that goes into their systems, if they're required to pay for it, that becomes a problem for these companies. We're talking about digital data the size of the entire Internet. Licensing all that copyrighted data is not necessarily feasible. We quote the venture capital firm Andreessen Horowitz in our story, where one of their lawyers says that it does not work for these companies to license that data. It's too expensive. It's on too large a scale.

[00:18:09]

It would essentially make this technology economically impractical.

[00:18:14]

Exactly. A jury or a judge or a law ruling against OpenAI could fundamentally change the way this technology is built. The extreme case is these companies are no longer allowed to use copyrighted material in building these chat bots, and that means they have to start from scratch. They have to rebuild everything they've built. So this is something that not only imperils what they have today, it imperils what they want to build in the future.

[00:18:47]

And conversely, what happens if the courts rule in favor of these companies and say, You know what? This is fair use. You were fined to have scraped this material and to keep borrowing this material into the future, free of charge.

[00:19:03]

Well, one significant roadblock drops for these companies, and they can continue to gather up all that extra data, including images and sounds and videos and build increasingly powerful systems. But the thing is, even if they can access as much copyrighted material as they want, these companies may still run into a problem. Pretty soon, they're going to run out of digital data on the internet. That human-created data they rely on is going to dry up. They're using up this data faster than humans create it. One research organization estimates that by 2026, these companies will run out of viable data on the internet. Wow.

[00:19:57]

Well, in that case, what would these tech companies do? Where are they going to go if they've already scraped YouTube, if they've already scraped podcast, if they've already gobbled up the internet, and that altogether is not sufficient?

[00:20:14]

What many people inside these companies will tell you, including Sam Altman, the chief executive of OpenAI, they'll tell you that what they will turn to is what's called synthetic data. And what is That is data generated by an AI model that is then used to build a better AI model. It's AI helping to build better AI. That is the vision ultimately they have for the future that they won't need all this human generated text. They'll just have the AI build the text that will feed future versions of AI.

[00:21:03]

So they will feed the AI systems the material that the AI systems themselves create. But is that really a workable solid plan? Is that considered high-quality data? Is that good enough?

[00:21:21]

If you do this on a large scale, you quickly run into problems. As we all know, as we've discussed on this podcast, these systems make mistakes. They hallucinate. They make stuff up, they show biases that they've learned from internet data. If you start using the data generated by the AI to build new AI, those mistakes start to reinforce themselves. The systems start to get trapped in these cul-de-sacs where they end up not getting better, but getting worse.

[00:22:00]

What you're really saying is these AI machines need the unique perfection of the human creative mind.

[00:22:09]

Well, as it stands today, that is absolutely the case. But these companies have grand visions for where this will go. They feel, and they're already starting to experiment with this, that if you have an AI system that is sufficiently powerful, if you make a copy of it, if you have two of these AI models, models, one can produce new data and the other one can judge that data. It can curate that data as a human would. It can provide the human judgment, so to speak. So as one model produces the data, the other one can judge it, discard the bad data, and keep the good data. And that's how they ultimately see these systems creating viable synthetic data. But that has not happened it, and it's unclear whether it will work.

[00:23:04]

It feels like the real lesson of your investigation is that if you have to allegedly steal data to feed your AI model and make it economically feasible, then maybe you have a pretty broken model. And that if you need to create fake data as a result, which, as you just said, undermines AI's goal of mimicking human thinking and language, then maybe you really have a broken model. And so that makes me wonder if the folks you talk to, the companies that we're focused on here, ever ask themselves the question, Could we do this differently? Could we create an AI model that just needs a lot less data?

[00:23:45]

They have thought about other models for decades. The thing to realize here is that is much easier said than done. We're talking about creating systems that can mimic the human brain. That is an incredibly ambitious task. After struggling with that for decades, these companies have finally stumbled on something that they feel works, that is a path to that incredibly ambitious goal, and they're going to continue to push in that direction. Yes, they're exploring other options, but those other options aren't working. What works is more data, and more data and more data. And because they see a path there, they're going to continue down that path. And if there are roadblocks there and they think they can knock them down, they're going to knock them down.

[00:24:45]

But what if the tech companies never get enough or make enough data to get where they think they want to go, even as they're knocking down walls along the way? That does seem like a real possibility.

[00:24:55]

If these companies can't get their hands on more data, then these technologies as they're built today, stop improving. We will see their limitations. We will see how difficult it really is to build a system that can match, let alone surpass the human brain. These companies will be forced to look for other options, technically, and we will see the limitations of these grandiose visions that they have have for the future of artificial intelligence. Okay.

[00:25:39]

Thank you very much. We appreciate that.

[00:25:41]

Glad to be here.

[00:25:49]

We'll be right back. Here's what else you need to know today. Israeli leaders spent Monday debating whether and how to retaliate against Iran's missile and drone attack over the weekend. Herzy Halevi, Israel's military chief of staff, declared that the attack will be responded to. In Washington, a spokesman for the US State Department, Matthew Miller, reiterated American calls for restraint. Of course, we continue to make clear to everyone that we talk to that we want to see de-escalation, that we don't want to see a wider regional war. That's something that's been- But emphasized that a final call about retaliation was up to Israel. Israel is a sovereign country. They have to make their own decisions about how best to defend themselves. What we always try to- And the first criminal trial of a former US President officially got underway on Monday in a Manhattan courtroom. Donald Trump, on trial for allegedly falsifying buying documents to cover up a sex scandal involving a porn star, watched as jury selection began. The initial pool of 96 jurors quickly dwindled. More than half of them were dismissed after indicating that they did not believe that they could be impartial.

[00:27:19]

The day ended without a single juror being chosen. Today's episode was produced by Stella Tan, Michael Simon Johnson, Muj Zady, and Ricky Nowetzky. It was edited by Mark George and Liz O'Balen. Contains original music by Diane Wong, Dan Powell, and Pat McCusker, and was engineered by Chris Wood. Our theme music is by Jim Brunberg and Ben Lansfolk of Wunderland. That's it for The Daily. I'm Michael Barbaro. See you tomorrow.