Transcribe your podcast
[00:00:06]

Pushkin. You're listening to Brave New Planet, a podcast about amazing new technologies that could dramatically improve our world or if we don't make wise choices, could leave us a lot worse off utopia or dystopia. It's up to us. On November 11th, 2016, the Babel Fish burst from fiction into reality, the Babel Fish was conceived 40 years ago in Douglas Adams science fiction classic The Hitchhiker's Guide to the Galaxy. In the story, a hapless earthling finds himself a stowaway on a vegan spaceship.

[00:01:01]

When the alien captain starts an announcement over the loudspeaker, his companion tells him to stick a small yellow fish in his ear.

[00:01:13]

Is important to remember. It's Captain. I can't. You just put this in your ear.

[00:01:19]

Suddenly, he's able to understand the language of the Babel. Fish is small yellow leech like and probably the oddest thing in the universe. It feeds on brain wave energy, absorbing, all unconscious frequencies. The practical upshot of which is that if you stick one in your ear, you instantly understand anything said to you in any form of language.

[00:01:44]

At the time, the idea of sticking an instantaneous, universal translator in your ear seemed charmingly absurd. But a couple of years ago, Google and other companies announced plans to start selling Babel fish, well, not fish, actually, but earbuds that do the same thing. The key breakthrough came in November 2016, when Google replaced the technology behind its Translate program. Overnight, the Internet realized that something extraordinary had happened. A Japanese computer scientist ran a quick test.

[00:02:20]

He dashed off his own Japanese translation of the opening lines of Ernest Hemingway's short story, The Snows of Kilimanjaro, and dared Google Translate to turn it back into English. Here's the opening passage from the Simon and Schuster audio book.

[00:02:38]

Kilimanjaro is a snow-covered mountain nineteen thousand seven hundred ten feet high and is said to be the highest mountain in Africa. Its Western Summit is called the Masai "Ngaje Ngbi", the House of God. Close to the Western Summit, there is the dried and frozen carcass of a leopard. No one has explained what the leopard was seeking at that altitude. Let's just consider that last sentence. No one has explained what the leopard was seeking at that altitude. One day earlier, Google had mangled the back translation, quote, Whether the leopard had what the demands at that altitude, there is no that nobody explained.

[00:03:24]

But now Google Translate returned, quote, No one has ever explained what Leopard wanted at that altitude. It was perfect, except for a missing the. What explains the Great Leap? Well, Google had built a predictive algorithm that taught itself how to translate between English and Japanese by training on a vast library of examples and tweaking its connections to get better and better at predicting the right answer.

[00:03:59]

In many ways, the algorithm was a black box. No one understood precisely how it worked, but it did amazingly well. Predictive algorithms turn out to be remarkably general, they can be applied to predict which movies a Netflix user will want to see next or whether an eye exam or a mammogram indicates disease.

[00:04:22]

But it doesn't stop. Their predictive algorithms are also being trained to make societal decisions. Who to hire for a job, whether to approve a mortgage application, what students to let into college, what arrestees to let out on bail. But what exactly are these big black boxes learning from massive data sets? Are they gaining deep new insights about people or might they sometimes be automating systemic biases? Today's big question, when should predictive algorithms be allowed to make big decisions about people and before they judge us, should we have the right to know what's inside the black box?

[00:05:11]

My name is Eric Lander. I'm a scientist who works on ways to improve human health. I helped lead the Human Genome Project, and today I lead the Broad Institute of MIT and Harvard. In the 21st century, powerful technologies have been appearing at a breathtaking pace related to the Internet, artificial intelligence, genetic engineering and more. They have amazing potential upsides, but we can't ignore the risks that come with them. The decisions aren't just up to scientists or politicians, whether we like it or not.

[00:05:43]

We all of us are the stewards of a brave new planet. This generation's choices will shape the future as never before. Coming up on today's episode of Brave New Planet, Predictive Algorithms.

[00:06:10]

We hear from a physician at Google about how this technology might help keep millions of people with diabetes from going blind, and the idea was, well, if you could retrain the model, you could get to more patients to screen them for disease.

[00:06:24]

The first iteration of the model was on par with the U.S. board certified ophthalmologist.

[00:06:31]

I speak with an A.I. researcher about how predictive algorithms sometimes learn to be sexist and racist.

[00:06:39]

If you typed in I am a white man, you would get positive sentiment if you typed in I'm a black lesbian, for example. Negative sentiment.

[00:06:47]

We hear how algorithms are affecting the criminal justice system. For black defendants, it was much more likely to incorrectly predict that they were going to go on to commit a future crime when they didn't. And for white defendants, it was much more likely to predict that they were going to go on to not commit a future crime when they did. And we hear from a policy expert about whether these systems should be regulated.

[00:07:11]

A lot of the horror stories are about fully implemented tools that were in works for years. There's never a pause button to re-evaluate or look at how a system is working real time.

[00:07:22]

Stay with us. Hey there, I'm Bill Nye, host of Science Rules, where we talk about all the ways in which science rules our universe, you never know what you might learn on our show.

[00:07:34]

Evolution does some pretty funky things, talking about birds, learning from other birds. This is what we call a delicious dilemma in astrophysics. Oh, hey, here's a thing this field doesn't actually understand. Stay tuned. Turn it up. Wow. There are worlds outside our solar system. There are thousands and thousands of other worlds. I can totally talk to this cuttlefish.

[00:07:55]

We're also bringing you expert analysis on the biggest science story of them all, the coronavirus.

[00:08:02]

This is about the health of the whole planet. Everybody has to take a calculated risk. I've just reviewed this literature. How bad does it have to get before everybody pays attention? Whatever your problem, wherever you are in the universe, science rules.

[00:08:17]

Science Rules is out right now. Subscribe and Stitcher, Apple podcast, Spotify or wherever you listen.

[00:08:27]

Chapter one, the big black box. To better understand these algorithms, I decided to speak with one of the creators of the technology that transformed Google Translate.

[00:08:38]

My name is Greg Cerrado and I'm a distinguished scientist at Google Research.

[00:08:42]

Early in his career, Greg had trained in neuroscience, but he soon shifted his focus from organic intelligence to artificial.

[00:08:51]

And that turned out to be really a very lucky moment because I was becoming interested in artificial intelligence at exactly the moment that artificial intelligence was changing so much ever since the field of artificial intelligence started more than 60 years ago.

[00:09:07]

There have been two warring approaches about how to teach machines to do human tasks. We might call them human rules versus machine learning.

[00:09:18]

The way that we used to try to get computers to recognize patterns was to program into them specific rules. So we would say, oh, well, you can tell the difference between a cat and a dog by how long it's whiskers are and what kind of fur it has. And does it have stripes and trying to put these rules into computers. It kind of worked, but it made for a lot of mistakes.

[00:09:43]

The other approach was machine learning. Let the computer figure everything out for itself, somewhat like the biological brain.

[00:09:51]

The machine learning system is actually built of tiny little decision makers or neurons. They start out connected very much in random ways, but we give the system feedback. So, for example, if it's guessing between a cat and a dog and it gets one wrong, we tell the system that it got one wrong and we make little changes inside so that it's much more likely to recognize that cat as a cat and not mistake it for a dog. Over time, the system gets better and better and better.

[00:10:22]

Machine learning has been around for decades with rather unimpressive results.

[00:10:27]

The number of connections and neurons in those early systems was pretty small.

[00:10:33]

We didn't realize until about 2010 that computers had gotten fast enough and the data sets were big enough that these systems could actually learn from patterns and learn from data better than we could describe rules manually.

[00:10:53]

Machine learning made huge leaps. Google itself became the leading driver of machine learning. In 2011, Corrado joined with two colleagues to form a unit called Google Brain. Among other things, they applied a machine learning approach to language translation. The strategy turned out to be remarkably effective.

[00:11:18]

It doesn't learn French the way you would learn French in high school. It learns French the way you would learn French at home, much more like the way that a child learns the language. We give the machine the English sentence, and then we give it an example of a French translation of that whole sentence. We show a whole lot of them, probably more French and English sentences than you could read in your whole life. And by seeing so many examples of entire sentences, this system is able to learn, oh, this is how I would say this in French.

[00:11:55]

That's actually at this point about as good as a bilingual human would produce soon.

[00:12:02]

Google was training predictive algorithms for all sorts of purposes. We use neural network predictors to help rank search results, tell people organize their photos to recognize speech, to find driving directions, to help complete emails. Really anything that you can think of where there's some notion of finding a pattern or making a prediction, artificial intelligence might be at play. Predictive algorithms will become ubiquitous in Commerce. They let Netflix know which movies to recommend to each customer, Amazon, to suggest products users might be interested in purchasing and much more.

[00:12:41]

Well, they're shockingly useful. They can also be inscrutable. Modern neural networks are like a black box. Understanding how they make their predictions can be surprisingly difficult.

[00:12:54]

When you build an artificial neural network, you do not necessarily understand exactly the final state of how it works. Figuring out how it works becomes its own science project.

[00:13:06]

One thing we do know, predictive algorithms are especially sensitive to the choice of examples used to train them.

[00:13:15]

The systems learn to imitate the examples in the data that they see. You don't know how well they will do on things that are very different. So, for example, if you train a system to recognize cats and dogs, but you only ever show it border collies and tabby cats, it's not clear what it will do when you show it a picture of a Chihuahua. If all it's ever seen is border collies, it may not get the right answer.

[00:13:43]

So its concept of dog is going to be limited by the dogs it's seen. That's right. And this is why diversity of data in machine learning systems is so important. You have to have a data set that represents the entire spectrum of possibilities that you expect the system to work under.

[00:14:03]

Teaching algorithms turns out to be not so different than teaching people. They learn what they see.

[00:14:13]

Chapter two, Retin-A, Fundus gappy. It's cool that predictive algorithms can learn to translate languages and suggest movies, but what about more life-changing applications?

[00:14:26]

My name is Lily Peng. I am a physician by training and I am a product manager at Google.

[00:14:32]

I went to visit Dr. Peng because she and her colleagues are using predictive algorithms to help millions of people avoid going blind.

[00:14:41]

So diabetic retinopathy is a complication of diabetes that affects the back of the eye, the retina. One of the devastating complications is vision loss. All patients that have diabetes need to be screened once a year for diabetic retinopathy. This is an asymptomatic disease, which means that you do not feel the symptoms. You do not experience vision loss until it's too late now.

[00:15:04]

Diabetes is epidemic around the world. How many diabetics are there?

[00:15:09]

By most estimates, there are over four hundred million patients in the world with diabetes.

[00:15:14]

How do you screen a patient to see whether they have diabetic retinopathy?

[00:15:20]

You need to have special camera fundis camera and it takes a picture through the peephole of the back of the eye. We have a very small supply of retina specialists and eye doctors, and they do a lot more than reading images. So they need it to scale the reading of these images. Four hundred million people with diabetes, there just aren't enough specialists for all the retinal images that need reading, especially in some countries in Asia where resources are limited and the incidence of diabetes is skyrocketing, two hospitals in southern India recognize the problem and reached out to Google for help.

[00:15:59]

At that point, Google was already sort of well known for image recognition, we were classifying cats and dogs and consumer images and the idea was, well, if you could retrain the model to recognize diabetic retinopathy, you could potentially help the hospitals in India get to more patients to screen them for disease.

[00:16:22]

Did you and your colleagues set out to attack this problem? So when I first started the project, we had about one hundred thirty thousand images from hospitals in India, as well as a screening program in the US. Also, we gathered the army of ophthalmologists to grade them 880000 diagnoses or rendered on one hundred thirty thousand images. So we took this training data and we put it in a machine learning model and how to do the first iteration of the model was on par with the US board certified ophthalmologist.

[00:16:56]

Since then, we've made some improvements.

[00:16:58]

The model and the initial training took about how long? The first time we trained a model. It may have taken a couple of weeks, but then the second time you train the next models and next models, it just it's shorter and shorter. Sometimes overnight. Sometimes overnight.

[00:17:14]

Well, yes. By contrast, how long does it take to train a board-certified ophthalmologist?

[00:17:20]

So that usually takes at least five years. And then you also have additional fellowship years to specialize in the retina.

[00:17:28]

And at the end of that, you only have one board certified ophthalmologist. Yes. At the end of that, you have one very, very well trained doctor. But that's not scaled. Yes. So by contrast, a model like this scale's worldwide and never fatigue. It consistently gives the same diagnosis on the same image, and it obviously takes a much shorter time to train. That being said, it does a very, very narrow task that is just a very small portion of what that doctor can do.

[00:18:03]

The retina screening tools is already being used in India. It was recently approved in Europe and it's under review in the United States.

[00:18:11]

Groups around the world are now working on other challenges in medical imaging, like detecting breast cancers at earlier stages. But I was particularly struck by a surprising discovery by Lilly's team that unexpected information about patients was hiding in their retinal pictures.

[00:18:30]

In the fundis image, there are blood vessels. And so one of the thoughts that we had was because you can see these vessels, I wonder if we can predict cardiovascular disease from the same image. So we did an experiment where we took fundis images and we train a model to predict whether or not that patient would have a heart attack in five years. We found that we could tell whether or not this patient may have a cardiovascular event much better than doctors.

[00:19:03]

It speaks to what might be in this data that we have overlooked. The model could make predictions that doctors couldn't from the same type of data.

[00:19:14]

It turned out the computer could also do a reasonable job of predicting a patient's sex, age and smoking status.

[00:19:23]

The first time I did this with an ophthalmologist, I think she thought I was trolling her. I said, Well, here's pictures. Guess which one is a woman? Guess what? Frontiersmen, man guess which one's a smoker. Guess which one is young, right? These are all tasks that doctors don't generally do with these images. Turns out the model was right, 98, 99 percent of the time. That being said, there are much easier ways of getting the facts of a patients.

[00:19:49]

So so while scientifically interesting, this is one of the most useless clinical predictions ever.

[00:19:56]

So how far can it go if you give preference for rock music or not?

[00:20:02]

What do you think? You know, we tried predicting happiness. That didn't work. So I'm guessing rock music. Oh, probably not. But who knows?

[00:20:13]

So predictive algorithms can learn a remarkable range of tasks and they can even discover hidden patterns that humans miss.

[00:20:21]

We just have to give them enough training data to learn from. Sounds pretty fantastic. What could possibly go wrong?

[00:20:33]

Chapter three, what could possibly go wrong? If predictive algorithms can use massive data to discover unexpected connections between your eye and your heart, what might they be learning about, say, human society? To answer this question, I took a trip to speak with Kate Crawford, the co-founder and co-director of the EHI Now Institute at New York University. When we began, we were the world's first A.I. Institute dedicated to studying the social implications of these tools, to me, these are the biggest challenges that we face right now, simply because we've spent decades looking at these questions from a technical lens at the expense of looking at them at a social and an ethical lens.

[00:21:19]

I knew about Kate's work because we served together on a working group about artificial intelligence for the U.S. National Institutes of Health. I also knew she had an interesting background.

[00:21:31]

I grew up in Australia. I studied a really strange grab bag of disciplines. I studied law, I studied philosophy. And then I got really interested in computer science.

[00:21:42]

And this was happening at the same time as I was writing electronic music on large scale modular synthesizers. And that's still a thing that I do today. It's, um, it's almost like the opposite of artificial intelligence because it's so analog. So I absolutely love it for that reason.

[00:21:58]

In the year 2000, Kate's band released an album entitled 20/20 that included a prescient Song called "Machines Work"... [song:"...So that people have time to think.']

[00:22:16]

It's funny because we use a sample from an early IBM promotional film that was made in the 1960s, which says machines can do the work so that people have time to think.

[00:22:27]

And we actually ended up sort of cutting it and splicing it in the track. So it ends up saying, no, people could do the work so that machines have time to think.

[00:22:35]

And strangely, the more that I've been working in the sort of machine learning space, I think, yeah, there's a lot of ways in which actually people are doing the work so that machines can do all the thinking thing.

[00:22:50]

Kate gave me a crash course on how predictive algorithms not only teach themselves language skills, but also in the process acquire human prejudices, even in something as seemingly benign as language translation.

[00:23:06]

So in many cases, if you say translate a sentence like she is a doctor into a language like Turkish, and then you translate it back into English and you're saying Turkish because Turkish has pronouns that are not gendered.

[00:23:20]

Precisely. And so you would expect that you would get the same sentence back, but you do not. It will say he is a doctor.

[00:23:27]

So she is a doctor, was translated into gender neutral Turkish as all buyers doctor, which was then back translated into English as he is a doctor. In fact, you could see how much the predictive algorithms had learned about gender roles just by giving Google Translate a bunch of gender-neutral sentences in Turkish.

[00:23:49]

You got he is an engineer.

[00:23:52]

She is a cook. He is a soldier. But she is a teacher. He is a friend. But she is a lover. He is happy and she is unhappy. I find that one particularly odd.

[00:24:03]

And it's not just language translation that's problematic. The same sort of issues arise in language understanding predictive algorithms were trained to learn analogies by reading lots of texts. They concluded that dog is the puppy as cat is the kitten and man is to king as woman is to queen. But they also automatically inferred that man is a computer programmer as woman is the homemaker. And with the rise of social media, Google used text on the Internet to train predictive algorithms to infer the sentiment of tweets and online reviews.

[00:24:43]

Is it a positive sentiment? Is it a negative sentiment? I believe it was Google who released a sentiment engine and you could just try it online and put in a sentence and see what you get. And again, similar problems emerged. If you typed in I am a white man, you would get positive sentiment if you typed in I'm a black lesbian, for example.

[00:25:01]

Negative sentiment, just as Greg Cerrado explained with Chihuahua's and border collies, the predictive algorithms were learning from the examples they found in the world.

[00:25:12]

And those examples reflected a lot about past practices and prejudices.

[00:25:18]

If we think about where you might be scraping large amounts of text from, say, Reddit, for example, and you're not thinking about how that sentiment might be biased against certain groups, then you're just basically importing that directly into your tool.

[00:25:33]

But it's not just conversations on Reddit. There's the cautionary tale of what happens when Amazon let a computer teach itself how to sift through mountains of resumes for computer programming jobs to find the best candidates to interview. So they set up this system, they designed it, and what they found was that very quickly this system had learned to discard and really demote the applications from women.

[00:25:59]

And specifically, if you had a women's college mentioned and even if you had the word women's on your resume, your application would go to the bottom of the pile.

[00:26:10]

All right. So how does it learn that? So, first of all, we take a look at who is generally hired by Amazon. And, of course, they have a very heavily skewed male workforce. And so the system is learning that these are the sorts of people who will tend to be hired and promoted. And it is not a surprise then that they actually found it impossible to really retrain the system. They ended up abandoning this tool because simply correcting for a bias is very hard to do when all of your ground truth data is so profoundly skewed in a particular direction.

[00:26:43]

So Amazon dropped this particular machine learning project and Google fixed the Turkish to English problem today, Google Translate gives both he is a doctor and she is a doctor as translation options.

[00:26:57]

But biases keep popping up in predictive algorithms. In many settings, there's no systematic way to prevent them.

[00:27:04]

Instead, spotting and fixing biases has become a game of whack a mole. Chapter four quarterbacks. Perhaps it's no surprise that algorithms trained in the wild west of the Internet or on tech industry hiring practices learn serious biases. But what about more sober settings like a hospital?

[00:27:31]

I talked with someone recently discovered similar problems with potentially life threatening consequences.

[00:27:38]

Hi, I'm Christine Vogeli. I'm the director of evaluation research at Partners Health Care here in Boston.

[00:27:45]

Partners Health Care, recently rebranded as Mass General Brigham is the largest health care provider in Massachusetts, a system that has 6000 doctors and a dozen hospitals and serves more than a million patients.

[00:28:00]

As Christine explained to me, the role of health care providers in the U.S. has been shifting.

[00:28:05]

The responsibility for controlling costs and ensuring high quality services is now being put down on the hospitals and the doctors. And to me, this makes a lot of sense, right? We really should be the ones responsible for ensuring that there's good quality care and that we're doing it efficiently.

[00:28:22]

Health care providers are especially focusing their attention on what they call high risk patients.

[00:28:29]

Really, what it means is that they have both multiple chronic illnesses and relatively acute chronic illnesses.

[00:28:36]

So give me a set of conditions that a patient might have.

[00:28:40]

Right. So somebody, for example, with cardiovascular disease occurring with diabetes and, you know, maybe they also have depression. They're just kind of suffering and trying to get used to having that complex illness and how to manage it.

[00:28:51]

Partners Health Care offers a program to help these complex patients.

[00:28:55]

We have a nurse or a social worker who works as a manager who would help everything from education to care coordination services. But really, that care manager works essentially as a quarterback, arranges everything, but also provides Hands-On care to the patient and the caregiver.

[00:29:12]

Yeah, I think it's a wonder how we expect patients to go figure out all the things they're supposed to be doing and how to interact with the medical system without a quarterback.

[00:29:22]

It's incredibly complex. These patients have multiple specialists who are interacting with the primary care physician. They need somebody to be able to tie it together and be able to create a care plan for them that they can follow. And it pulls everything together from all those specialist partners.

[00:29:38]

Health Care found that providing complex patients with quarterbacks both saved money and improved patient's health.

[00:29:46]

For example, they had fewer emergency visits each year, so partners developed a program to identify the top three percent of patients with the greatest need for the service. Most were recommended by their physicians, but they also used a predictive algorithm provided by a major health insurance company that assigns each patient a risk score. What does the algorithm do?

[00:30:10]

When you look at the Web page, it really describes itself as a tool to help identify high risk patients.

[00:30:17]

And that term is really interesting term to me. What makes a patient high risk? So I think from an insurance perspective, risk means these patients are going to be expensive from a health care organization perspective. These are patients who we think we could help. And that's the fundamental challenge on this one.

[00:30:37]

When the team began to look closely at the results, they noticed that people recommended by the algorithm were strikingly different than those recommended by their doctor.

[00:30:47]

We noticed that black patients overall were underrepresented patients with similar numbers of chronic illnesses. If they were black, they had a lower risk score than if they were white. And it didn't make sense to us.

[00:31:00]

Black patients identified by the algorithm turned out to have 26 percent more chronic illnesses than white patients with the same risk scores. So what was wrong with the algorithm?

[00:31:13]

It was because, given a certain level of illness, black and minority patients tend to use fewer health care services and whites tend to use more even if they have the same level of chronic, even if they have the same level of chronic conditions.

[00:31:28]

That's right. So in some sense, the algorithm is correctly predicting the cost associated with the patient, but not the need.

[00:31:36]

Exactly.

[00:31:36]

It predicts costs very well, but we're interested in understanding patients who are sick and have needs.

[00:31:43]

It's important to say that the algorithm only used information about insurance claims and medical costs. It didn't use any information about a patient's race.

[00:31:53]

But of course, these factors are correlated with race due to longstanding issues in American society.

[00:32:00]

Frankly, we have fewer minority physicians than we do white physicians. So the level of trust minorities with the health care system, we've observed, it's lower. And we also know that there are just systematic barriers to care that certain groups of. Patients experience more so, for example, race and poverty go together and job flexibility, so all these issues with scheduling, being able to come in and being able to access services are just heightened for minority populations relative to white populations.

[00:32:32]

So someone who just has less economic resources might not be able to get off work, might not be able to get off work, might not have the flexibility with child care to be able to come in for a visit when they need to.

[00:32:44]

Exactly.

[00:32:45]

So it means that if one only relied on the algorithm, you wouldn't be targeting the right people.

[00:32:52]

Yes, we would be targeting more advantaged patients who tend to use a lot of health care services.

[00:32:57]

When they corrected the problem, the proportion of black patients in the high risk group jumped from 18 percent to 47 percent.

[00:33:06]

Christine, together with colleagues from several other institutions, wrote up a paper describing their findings. It was published in Science, the nation's leading research journal. In 2019, it made a big splash, not least because many other hospital systems were using the algorithm and others like it.

[00:33:26]

We've since changed the algorithm that we use to one that uses exclusively information about chronic illness and not health care utilization.

[00:33:36]

And has that worked? We're still testing. We think it's going to work, but as in all of these things, you really need to test it. You need to understand and see if there's actually any biases. In the end, you can't just adopt an algorithm. It's very important to be very conscious about what you're predicting. It's also very important to think about what are the factors you're putting into that prediction algorithm. Even if you believe the ingredients are right, you do actually have to see how it works in practice.

[00:34:03]

Anything that has to do with people's lives, you know, you have to be transparent about it. Chapter five Compas. Transparency. Christine Voegeli and her colleagues were able to get to the bottom of the issue with the medical risk prediction because they had ready access to the partners health care data and could test the algorithm. Unfortunately, that's not always the case. I traveled to New York to speak with a person who's arguably done more than anyone to focus attention on the consequences of algorithmic bias.

[00:34:41]

My name is Julia Angwin. I'm a journalist. I've been writing about technology for 25 years, mostly at the Wall Street Journal and ProPublica.

[00:34:50]

Julia grew up in Silicon Valley as the child of a mathematician and a chemist. She studied math at the University of Chicago but decided to pursue a career in journalism. Her quantitative skills gave her a unique lens to report on the societal implications of technology, and she eventually became interested in investigating high stakes algorithms.

[00:35:14]

When I learned that there was actually an algorithm that judges use to help decide what to sentence people. I was stunned. I thought, this is shocking. I can't believe it exists and I'm going to investigate it. What we're talking about is a score that is assigned to criminal defendants in many jurisdictions in this country that aims to predict whether they will go on to commit a future crime. It's known as a risk assessment score. And the one that we chose to look at was called the campus risk assessment score.

[00:35:47]

Based on the answers to a long list of questions, Compass gives defendants a risk score from one to 10.

[00:35:56]

In some jurisdictions, judges use the composite score to decide whether a defendant should be released on bail before trial.

[00:36:04]

In others, judges use it to decide the length of sentence to impose on defendants who plead guilty or who are convicted of trial.

[00:36:12]

Julia had a suspicion that the algorithm might reflect bias against black defendants.

[00:36:18]

Attorney General Eric Holder had actually given a big speech saying he was concerned about the use of these growers and whether they were exacerbating racial bias. And so that was one of the reasons we wanted to investigate.

[00:36:30]

But investigating wasn't easy. Unlike Christine Voegeli at Partners Health Care, Julia couldn't inspect the compas algorithm itself that Compas isn't the modern neural network. It was developed by a company that's now called Equivalent. And it's a much simpler algorithm. It's basically a linear equation that should be easy to understand, but it's a black box of a different sort. The algorithm is opaque because to date, Equivalent has insisted on keeping it a trade secret. Julia also had no way to download defendants compas scores from a website, so she had to gather the data herself.

[00:37:11]

Her team decided to focus on Broward County, Florida.

[00:37:16]

Florida has great public records laws and so we filed a public records request and we did end up getting 18000 scores. We got scores for everyone who was arrested for a two year period.

[00:37:28]

Eighteen thousand scores. All right. So then what did you do to evaluate these scores?

[00:37:34]

Well, the first thing we did when we got the eighteen thousand scores was actually we just threw them into a bar chart, black and white defendants. We immediately noticed there was a really different looking distributions for black defendants. The scores were evenly distributed, meaning one through ten, lowest risk, the highest risk. There's equal numbers of black defendants in every one of those buckets. For white defendants, the scores were heavily clustered in the low risk range. And so we thought there's two options.

[00:38:05]

All the white people getting scored in Broward County are legitimately really low risk. They're all Mother Teresa or there's something weird going on.

[00:38:14]

Drew, you sorted the defendants into those who were rearrested over the next two years and those who weren't. She compared the composite scores that had been assigned to each group.

[00:38:25]

For black defendants, it was much more likely to incorrectly predict that they were going to go on to commit a future crime when they didn't. And for white defendants, it was much more likely to predict that they were going to go on to not commit a future crime. When they did, they were twice as many false positives for black defendants as white and twice as many false negatives for white defendants as black defendants.

[00:38:47]

Julia described the story of two people whose arrest histories illustrate this difference.

[00:38:53]

A young eighteen year old black girl named Breccia Bordin had been arrested after picking up a kid's bicycle from their front yard, riding it a few blocks. The mom came out, yelled at her, said, That's my kid's bike. She gave it back. But actually by then the neighbor had called the police and so she was arrested for that.

[00:39:14]

And we compared her with a white man who had. Stolen about 80 dollars worth of stuff from a drug store, Vernon Prader, one teenager, Brayshaw Bordin, got booked into jail. She got a high compas score and eight, predicting a high risk that she'd get rearrested. And Vernon Prader, he got a low score, a three.

[00:39:39]

Now, he had already committed two armed robberies and had served time, she was 18, she given back the bike and of course, this course turned out to be completely wrong. She did not go on to commit a future crime in the next two years. And he actually went on to break into a warehouse, steal thousands of dollars of electronics. And he's serving a 10 year sentence.

[00:40:03]

And so that's what the difference between a false positive and false negative looks like, it looks like Rachel Borden and Vernon Prader. Chapter six Criminal Attitudes. Julia Angwin and her team spent over a year doing research in May 2016, ProPublica published their article headlined Machine Bias, the subtitle quote, There's software used across the country to predict future criminals and it's biased against blacks. Julia's team released all the data they had collected so that anyone could check or dispute their conclusions.

[00:40:49]

What happened next was truly remarkable. The ProPublica article provoked an outcry from some statisticians who argued that the data actually proved Compas wasn't biased.

[00:41:02]

How could they reach the opposite conclusion? It turned out the answer depended on how you define bias. ProPublica had analyzed the composite scores by looking backward after the outcomes were known among people who were not rearrested. They found that black people had been assigned much higher risk scores than white people. That seemed pretty unfair. But statisticians use the word bias to describe how a predictor performs when looking forward before the outcomes happened. It turns out that black people and white people who received the same risk score had roughly the same chance of being rearrested.

[00:41:46]

That seems pretty fair. So whether Compass was fair or unfair depended on your definition of fairness. This sparked an explosion of academic research, mathematicians showed there's no way out of the problem. They proved a theorem saying it's impossible to build a risk predictor that's fair when looking both backward and forward unless the arrest rates for black people and white people are identical, which they aren't. The ProPublica article also focused attention on many other ways in which compas scores are biased, like the health care algorithm that Christine Voegeli studied.

[00:42:30]

Composite scores don't explicitly ask about a person's race, but race is closely correlated with both the training data and the inputs to the algorithm. First, the training data compass isn't actually trained to predict the probability that a person will commit another crime. Instead, it's trained to predict whether a person will be arrested for committing another crime. The problem is there's abundant evidence that in situations where black people and white people commit crimes at the same rate, for example, illegal drug use, black people are much more likely to get arrested.

[00:43:08]

So Compas is being trained on an unfair outcome.

[00:43:12]

Second, the questionnaire used to calculate composite scores is pretty revealing. Some sections assess peers, work and social environment. The questions include how many of your friends and acquaintances have ever been arrested, how many have been crime victims?

[00:43:32]

How often do you have trouble paying bills, other sections or titled Criminal Personality and Criminal Attitude?

[00:43:41]

They ask people to agree or disagree with such statement says the law doesn't help average people or many people get into trouble because society has given them no education, jobs or future. In a nutshell, the predictor penalizes defendants who are honest enough to admit they live in high crime neighborhoods where they don't fully trust the system.

[00:44:06]

From the questionnaire, it's not hard to guess how a teenage black girl arrested for something so minor s riding someone else's bicycle a few blocks and returning. It might have received a composite score of eight. And it's not hard to imagine why racially correlated questions would do a good job of predicting racially correlated arrest rates. ProPublica didn't win a Pulitzer Prize for its article, but it was a remarkable public service. Chapter seven, Minority Report. Putting aside the details of Compas, I wanted to find out more about the role of predictive algorithms in courts.

[00:44:52]

I reached out to one of the leading legal scholars in the country.

[00:44:55]

I'm Martha Minow. I'm a law professor at Harvard and I have recently immersed myself in issues of algorithmic fairness.

[00:45:05]

Martha Minow has a remarkable resume from 2009 to 2017. She served as dean of the Harvard Law School, following now Supreme Court Justice Elena Kagan. Martha. Also served on the board of the government sponsored Legal Services Corporation, which provides legal assistance to low income Americans.

[00:45:27]

She was appointed by her former law student, President Barack Obama, and became very interested in and concerned about the increasing use of algorithms in worlds that touch on my preoccupations with equal protection, due process, constitutional rights, fairness, anti-discrimination.

[00:45:47]

Martha recently cosigned a statement with 26 other lawyers and scientists raising, quote, grave concerns about the use of predictive algorithms for pre-trial risk assessment. I asked her how courts had gotten involved in the business of prediction.

[00:46:05]

Criminal justice system has flirted with the use of prediction forever, including discussions from the 19th century on in this country about dangerousness and whether people should be detained preventively.

[00:46:20]

So far, that's not permitted in the United States, it appears in Minority Report and other interesting movies, the movie, starring Tom Cruise, tells the story of a future in which the Precrime division of the police arrest people for crimes they haven't yet committed.

[00:46:39]

I'm placing you under arrest for the future murder of Sarah Marks. We are arresting individuals who have broken no law.

[00:46:43]

But will the use of prediction in the context of sentencing is part of this rather large sphere of discretion that judges have to decide what kind of sentence fits the crime.

[00:46:59]

You're saying in sentencing one is allowed to use essentially information from the Precrime division about crimes that haven't been committed yet?

[00:47:09]

Well, I am horrified by that suggestion, but I think it's fair to raise it as a concern.

[00:47:16]

The problem is, if we actually acknowledge purposes of the criminal justice system, some of them start to get into the future. So if one purpose is simply incapacitation, prevent this person from walking the streets because they might hurt someone else. There's a prediction built-in. So judges have been factoring in predictions about a defendant's future behavior for a long time and judges certainly aren't perfect, they can be biased or sometimes just cranky. There are even studies showing the judges hand down harsher sentences before lunch breaks than after.

[00:47:56]

Now, the defenders of risk prediction score will say, well, it's always not what's the ideal?

[00:48:03]

But compared to what? And if the alternative is we're relying entirely on the individual judges and their prejudices, their lack of education, what they had for lunch, isn't this better that it'll provide some kind of scaffold for more consistency?

[00:48:23]

Journalist Julia Angwin has heard the same arguments.

[00:48:27]

Some good friends, right, who really believe in the use of these criminal score algorithms have said to me, look, Julia, the fact is judges are terribly biased and this is an improvement. And my feeling is it's probably true for some judges and maybe less true for other judges. But I don't think it is a reason to automate bias. Right. Like, I don't understand why you say, OK, humans are flawed, so why don't we make a flawed algorithm and bake it into every decision?

[00:48:57]

Because then it's a really intractable.

[00:49:00]

Martha also worries that numerical risk scores are misleading. The judges think high numbers mean people are very likely to commit violent crime. In fact, the actual probability of violence is very low, about eight percent, according to a public assessment. And she thinks numerical scores can hold judges into a false sense of certainty.

[00:49:24]

There's an appearance of objectivity because it's math. But is it really then for lawyers, they may have had no math, no numeracy education since high school?

[00:49:37]

Many people go to law in part because they don't want to do anything with numbers.

[00:49:42]

And there is a larger problem, which is the deference to expertise, particularly scientific expertise.

[00:49:51]

Finally, I wanted to ask Martha if defendants have a constitutional right to know what's inside the black box that's helping to determine their fate. I confess I thought the answer was an obvious yes until I read a 2016 decision by Wisconsin Supreme Court.

[00:50:10]

The defendants in that case, Eric Lumis, pled guilty to operating a car without the owner's permission and fleeing a traffic officer when Lumis was sentenced. The presentencing report given to the judge included a composite score that predicted Loomis had a high risk for committing future crimes. He was sentenced to six years in prison. Lumis appealed, arguing that his inability to inspect the composite algorithm violated his constitutional right to due process. Wisconsin's Supreme Court ultimately decided that Lumis had no right to know how Compas worked.

[00:50:51]

Why?

[00:50:52]

First, the Wisconsin court said the score was just one of several inputs to the judge's sentencing decision. Second, the court said even if Lumis didn't know how the score was determined, he could still dispute its accuracy. Lumis appealed to the U.S. Supreme Court, but it declined to hear the case.

[00:51:13]

I find that troubling and not persuasive.

[00:51:17]

It was up to you. How would you change the law?

[00:51:21]

I actually would require transparency for any use of any algorithm by a government agency or court that has the consequence of influencing not just deciding, but influencing decisions about individuals rights. And those rights could be rights to liberty, property opportunities.

[00:51:45]

So transparency, transparency, be able to see what this algorithm does. Absolutely.

[00:51:50]

And have the code and be able to give it to your own lawyer and your own experts.

[00:51:55]

But should a state be able to buy a computer program that's proprietary?

[00:52:01]

I mean, it would say, well, I'd love to give it to you, but it's proprietary. I can't. Should that be OK?

[00:52:06]

I think not, because if that then limits the transparency, that seems a breach. But, you know, this is a major problem, the outsourcing of government activity that has the effect of bypassing restrictions. Take another example. When the US government hires private contractors to engage in war activities, they are not governed by the same rules that govern the U.S. military.

[00:52:33]

So using the government can get around. Constitutional limitations on the government by just outsourcing it to somebody who's not the government, it's currently the case and I think that's wrong.

[00:52:45]

For her part, journalist Julia Angwin is baffled by the Wisconsin court's ruling.

[00:52:51]

I mean, we have this idea that you should be able to argue against whatever accusations are made. But I don't know how you make an argument against a score like the score says you're a seven, but you think you're a four. How do you make that argument? If you don't know how that seven was calculated? You can't make an argument that you're a four.

[00:53:15]

Chapter eight. Robo recruiter. Even if you never find yourself in a criminal court filling out a composite questionnaire, that doesn't mean you won't be judged by a predictive algorithm. There's actually a good chance it'll happen the next time you go looking for a job. I spoke to a scientist at a high tech company that screens job applicants.

[00:53:40]

My name is Lindsay Loga and I'm actually educated as a physicist. But now working for a company called HireVue Review is a video interviewing platform. Companies create an interview. Candidates can take it at any time that's convenient for them. So they go through the questions and they record themselves answering. So it's really a great substitute for kind of the resume phone screening part of the process. When a candidate takes a video interview, they're creating thousands of unique points of data, a candidate's verbal and nonverbal cues give us insight into their emotional engagement, thinking and problem solving style.

[00:54:25]

This combination of cutting edge A.I. and validated science is the perfect partner for making data driven talent decisions. HireVue. You know, we'll have a customer and they are hiring for something like a call center, say sales calls, and what we do is we look at past employees that applied and we look at their video interviews. We look at the words they said, tone of voice, pauses and facial expressions, things like that. And we look for patterns in how those people with good sales numbers behave as compared to people of low sales numbers.

[00:55:05]

And then we have this algorithm that scores new candidates as they come in. And so we help kind of get those more promising candidates to the top of the pile so they're seen more quickly.

[00:55:16]

So HireVue trains a predictive algorithm on video interviews of past applicants who turned out to be successful employees, but how does HireVue know its program isn't learning sexism or racism or other similar biases? There are lots of reasons to worry.

[00:55:35]

For example, studies from MIT have shown, but facial recognition algorithms can have a hard time reading emotions from black people's faces. And how would Higher Views program evaluate videos from people who might look or sound different than the average employee? Say people who don't speak English as a native language, who are disabled, who are on the autism spectrum, or even people who are just a little quirky. Well, Lindsay says HireVue tests for certain kinds of bias, so we audit the algorithm after the fact and see if it's scoring different groups differently in terms of age, race and gender.

[00:56:14]

So if we do see that happening a lot of times that's probably coming from the training data. So maybe there is only one female software engineer in this data set. The model might mimic that bias. If we do see any of that adverse impact, we simply remove the features that are causing it so we can say this model is being sexist.

[00:56:35]

How does the model even know what gender the person is?

[00:56:39]

So we look at all the features and we find the features that are most correlated to gender. And if there are, we simply remove some of those features.

[00:56:46]

I asked Lindsay why people should believe higher views or any company's assurances or whether something more was needed.

[00:56:55]

You seem thoughtful about this, but there will be many people coming into the industry over time who might not be as thoughtful or as sophisticated as you are.

[00:57:04]

Do you think it would be a good idea to have third parties come in to certify the audits for bias? I know that's a hard question.

[00:57:16]

I guess I. I kind of lean towards no. So you're talking about having a third party entity that comes in and assesses certifies the audit know, because you've described what I think is a really impressive process.

[00:57:33]

But, of course, how do we know what's true? You know, you could reveal all your algorithms, but probably not the thing you want to do. And so the next best thing is a certifier says, yes, this audit has been done. You know, your financials presumably get audited. Why not the results of the algorithm?

[00:57:52]

I guess a little of the reason I I'm not sure about the certification is just as mostly just because I feel like I don't know how it would work exactly. Like, you're right totally that finances are audited. I haven't thought about it enough to have, like, a strong opinion that it should happen because it's like, OK, we have all these different models. It's constantly changing.

[00:58:12]

How do they audit every single model all the time?

[00:58:17]

I was impressed with Lindsay's willingness as a scientist to think in real time about a hard question, and it turns out she kept thinking about it afterwards. A few months later, she wrote back to me to say that she changed her mind.

[00:58:34]

We do have a lot of private information, but if we don't share it, people tend to assume the worst. So I've decided, after thinking about it quite a bit, that I definitely support the third party auditing of algorithms. Sometimes people, you know, assume we're doing horrible, horrible things and that can be frustrating. But I do think the more transparent we can be about what we are doing is important.

[00:58:57]

Several months later, Lindsay emailed again to say that HireVue was now undergoing a third party audit.

[00:59:06]

She says she's excited to learn from the results. Chapter nine, confronting the black box. So HireVue, at first reluctant, says it's now engaging external auditors. What about equivalent, whose composite scores can heavily influence prison sentences, but which has steadfastly refused to let anyone even see how there's a simple algorithm works?

[00:59:36]

Well, just before we release this podcast, I checked back with them. A company spokesperson wrote that Equipement now agrees that the Kompas scoring process, quote, should be made available for third party examination.

[00:59:50]

But they weren't releasing it yet because they first wanted to file for copyright protection on their simple algorithm, so we're still waiting. You might ask, should it be up to the companies to decide, aren't there laws or regulations? The answer is there's not much. Governments are just now waking up to the idea that they have a role to play. I traveled back to New York City to talk to someone who's been involved in this question.

[01:00:22]

My name's Rashida Richardson and I'm a civil rights lawyer that focuses on the social implications of artificial intelligence.

[01:00:30]

Rasheeda served as the director of policy research at AEI Now Institute at NYU, where she worked with Kate Crawford, the Australian expert in algorithmic bias that I spoke to earlier in the episode.

[01:00:42]

In twenty eighteen, New York City became the first jurisdiction in the U.S. to create a task force to come up with recommendations about government use of predictive algorithms or, as they call them, automated decision systems. Unfortunately, the task force got bogged down in details and wasn't very productive.

[01:01:04]

In response, Washita led a group of 27 experts that wrote a 56 page shadow report entitled Confronting Black Boxes that offered concrete proposals.

[01:01:18]

New York City, it turns out, uses quite a few algorithms to make major decisions.

[01:01:25]

You have the school matching algorithms. You have an algorithm used by the child welfare agency. Here you have public benefits, algorithms that are used to determine who will qualify or have their public benefits, whether that's Medicaid or temporary food assistance terminated or whether they'll receive access to those benefits. You have a gang database which tries to identify who is likely to be in a gang, and that's both used by the DA's office and the police department.

[01:01:58]

If you have to make a guess, how many predictive algorithms are used by the city of New York?

[01:02:06]

I'd say upwards to 30.

[01:02:09]

And I'm underestimating with that number how many of these 30 plus algorithms are transparent about how they work about their code?

[01:02:22]

None. So what should New York do, it was up to you what should be the behavior of a responsible city with respect to the algorithms it uses?

[01:02:32]

I think the first step is creating greater transparency, some annual acknowledgement of what is being used, how it's being used, whether it's been tested or had a validation study.

[01:02:44]

And then you would also want general information about the inputs or factors that are used by these systems to make predictions, because in some cases you have factories that are just discriminatory or proxies for protected statuses like race, gender, ability, status. All right.

[01:03:02]

So step one, disclose what systems you're using. Yes. And then the second step, I think, is creating a system of audit, both prior to procurement and then once procured ongoing auditing of the system to at least have a gauge on what it's doing real time. A lot of the horror stories we hear are about fully implemented tools that were in works for years. There's never a pause button to reevaluate or look at how a system is working real time.

[01:03:34]

And even when I did studies on the use of predictive policing systems, I looked at 13 jurisdictions. Only one of them actually did a retrospective review of their system.

[01:03:45]

So what's your theory about? How do you get the auditing done? If you are going to outsource to third parties? I think it's going to have to be some approval process to assess their level of independence, but also any conflict of interest issues that may come up and then also doing some thinking about what types of expertise are needed.

[01:04:05]

Because I think if you don't necessarily have someone who understands that social context or even the history of a certain government sector, then you could have a tool that is technically accurate and meets all of the technical standards. But it's still reproducing hard because it's not paying attention to that social context.

[01:04:24]

Should the government be permitted to purchase an automated decision system where the code can't be disclosed by contract?

[01:04:37]

No, and in fact, there is movement around creating more provisions that vendors must waive trade secrecy claims once they enter a contract with the government.

[01:04:48]

Rasheda says we need laws to regulate the use of predictive algorithms both by governments and by private companies like HireVue. We're beginning to see bills being explored in different states. Massachusetts, Vermont and Washington, D.C., are considering setting up commissions to look at the government use of predictive algorithms.

[01:05:09]

Idaho recently passed a first in the nation law requiring that pretrial risk algorithms be free of bias and transparent. It blocks manufacturers of tools like Compas from claiming trade, secret protection and at the national level.

[01:05:27]

A bill was recently introduced in the U.S. Congress, the Algorithmic Accountability Act. The bill would require that private companies insure certain types of algorithms are audited for bias. Unfortunately, it doesn't require that the results of the audit are made public, so there's still a long way to go. Washita thinks it's important that regulations don't just focus on technical issues. They need to look at the larger context.

[01:05:58]

Part of the problems that we're identifying with these systems is that they're amplifying and reproducing a lot of the historical and current discrimination that we see in society.

[01:06:08]

There are large questions we've been unable to answer as a society of how do you deal with the compounded effect of 50 years of discrimination? And we don't have a simple answer. And there's not necessarily going to be a technical solution. But I think having access to more data and an understanding of how these systems are working will help us evaluate whether these tools are even being a value added in addressing the larger social questions.

[01:06:33]

Finally, Kate Crawford says laws alone likely won't be enough. There's another thing we need to focus on.

[01:06:42]

In the end, it really matters who is in the room designing these systems. If you have people sitting around a conference table, they all look the same. Perhaps they all did the same type of engineering degree. Perhaps they're all men. Perhaps they're all pretty middle class or pretty well off. They're going to be designing systems that reflect their world view.

[01:07:01]

What we're learning is that the more diverse those rooms are and the more we can question those kinds of assumptions, the better we can actually design systems for a diverse world. Conclusion, choose your planet. So there you have it, stewards of the brave new planet, predictive algorithms, a 60 year old dream of artificial intelligence machines making human like decisions has finally become a reality.

[01:07:36]

If a task can be turned into a prediction problem and if you've got a mountain of training data, algorithms can learn to do the job. Countless applications are possible translating languages instantaneously, providing expert medical diagnoses for eye diseases and cancer to patients anywhere, improving drug development all at levels comparable to or better than human experts. But it's also letting governments and companies make automatic decisions about you. Whether you should get admitted to college, be hired for a job, get a loan, get housing assistance, be granted bail or get medical attention.

[01:08:20]

The problem is that algorithms that learn to make human like decisions based on past human outcomes can acquire a lot of human biases about gender, race, class and more often masquerading as objective judgment. Even worse, you usually don't have a right even to know you're being judged by a machine or what's inside the black box, or whether the algorithms are accurate or fair.

[01:08:51]

Should laws require that automated decision systems used by governments or companies be transparent? Should they require public auditing for accuracy, fairness? And what exactly is fairness anyway? Governments are just beginning to wake up to these issues and they're not sure what they should do in the coming years. They'll decide what rules to set or perhaps to do nothing at all. So what can you do? A lot? It turns out you don't have to be an expert and you don't have to do it alone.

[01:09:27]

Start by learning a bit more, invite friends over virtually or in person when it's safe for dinner and debate about what we should do or organize a conversation at a book club, a faith group or a campus event, and then email your city or state representatives to ask what they're doing about the issue, maybe even proposing first steps like setting up a task force. When people get engaged, action happens.

[01:09:58]

You'll find lots of resources and ideas at our website. Brave New Planet Dog. It's time to choose our planet. The future is up to us.

[01:10:11]

Machines have time to think.

[01:10:21]

Brave New Planet is a co-production of the Broad Institute of MIT and Harvard, Pushkin Industries and The Boston Globe with support from the Alfred P. Sloan Foundation. Our show is produced by Rebecca Douglas with Merridew theme song composed by Ned Porter, Mastering and Sound Design by James Gava, fact checking by Joseph Fridmann and a Stitt and Enchante. Special thanks to Christine Heenan and Rachel Roberts at Clarendon Communications.

[01:10:49]

To Lee McGuire, Kristen Zerilli and Justin Levine. Our hands at the Road to Meal Lobell and Heather Fain at Pushkin and to Eliane Broked, who made the Broad Institute possible. This is Brave New Planet. I'm Eric Lander.