Transcript of How bad data keeps us from ...

[00:00:04]

Hi, Italy's Hugh, you're listening to Ted talks daily, artificial intelligence is already a big force in our business world and our daily lives. But as data science executive Minnoch Mazumdar lays out, the data that A.I. depends on has fallen short. And that means we're building possibly harmful biases that hurt marginalized groups into our algorithms.

[00:00:26]

His talk from Ted Selon 20-20 raises some serious red flags, but it offers a way to reset before biases grow even more damaging.

[00:00:38]

I could add 16 trillion dollars to the global economy in next 10 years. This economy is not going to be billed by billions of people or millions of factories, but by computers, an algorithm. We have already seen amazing benefits of A.I. in simplifying tasks, bringing efficiencies and improving our lives.

[00:01:05]

However, when it comes to fair and equitable policy decision making, it has not lived up to its promise. Is becoming a gatekeeper to the economy, deciding who gets a job and who gets an access to a loan. AIG is only reinforcing and accelerating our buyers at speed and scale, which societal implications so is failing us? Are we designing these algorithms to deliver biased and wrong decisions? As a data scientist, I'm here to tell you it's not the algorithm, but the bias data that's responsible for these decisions to make it possible for humanity and society.

[00:01:54]

We need Argenta reset instead of algorithms. We need to focus on the data. We're spending time and money to scale the expense of designing and collecting high quality and contextual data. We need to stop the data or the bias that we already have and focus on three things data, infrastructure, data quality and data literacy. In June of this year, we saw embarrassing bias in the Duke University, a model called Pulse, which enhanced a blurry image into a recognizable photograph of a person.

[00:02:36]

This algorithm incorrectly enhanced a non white image into a Caucasian image. African-American images where underrepresented in the training set leading to wrong decisions and predictions. Probably this is not the first time you have seen a misidentify, a black person's image, despite an improved methodology. The underrepresentation of racial and ethnic populations still left us with a bias results. This research is academic. However, not all data biases or academic biases have real consequences. Take the 20 20 US Census.

[00:03:22]

The census is the foundation for many social and economic policy decisions. Therefore, the census is required to count hundred percent of the population in the United States. However, with the pandemic and the politics of the citizenship question, undercounting of minorities is a real possibility. I suspect significant undercounting of minority groups who are hard to locate, contact, persuade and interview for the census undercounting will introduce bias and the quality of our data infrastructure. Let's look at undercounts in twenty ten census, 16 million people who are admitted in the final accounts.

[00:04:08]

This is as large as the total population of Arizona, Arkansas, Oklahoma and Iowa put together for that year. We have also seen about a million kids under age of five undercounted in 2010 census. Now undercounting of minorities is common in other national census, as minorities can be harder to reach their mistrust towards the government or their living area under political unrest, for example. The Australian census in twenty sixteen undercounted Aboriginal and Torres Strait population by about seventeen point five percent, we estimate undercounting in twenty twenty to be much higher than 2010.

[00:04:57]

And the implications of this bias can be massive. Census is the most trusted, open and publicly available rich data on population composition and characteristics. While businesses have proprietary information and consumers to Census Bureau reports, definitive public accounts on age, gender, ethnicity, race, employment, family status as well as geographic distribution, which are foundation of the population data infrastructure. When minorities are undercounted in models supporting public transportation, housing, health care insurance are likely to overlook the communities that require these services.

[00:05:42]

The most first step to improving results is to make that database representative of age, gender, ethnicity and race persons, says data. Since this is so important, we have to make every effort to count 100 percent. Investing in this data quality and accuracy is essential to making it possible not for only few and privileged, but for everyone in the society, most systems use the data that's already available or collected for some other purposes because it's convenient and cheap.

[00:06:19]

Yet data quality is a discipline that requires commitment, real commitment.

[00:06:26]

This attention to the definition, data collection and measurement of the bias is not only underappreciated in the world of speed, scale and convenience is often ignored. As part of Nielsen data science team. I went to feel visit to collect data visiting retail stores outside Shanghai and Bangalore. The goal of that visit was to measure retail sales from those stores. We drove miles outside the city, found these small stores informal, hard to reach. And you may be wondering why are we interested in the specific stores?

[00:07:05]

We could have selected a store in the city where the electronic data could be easily integrated into a data pipeline. Cheap, convenient and easy. Why are we so obsessed with the quality and accuracy of the data from this stores? The answer is simple because the data from these rural stores matter. According to International Labor Organization, 40 percent Chinese and sixty five percent of Indians live in the rural areas. Imagine the bias in decision when sixty five percent of consumption in India is excluded in models, meaning the decision will favor the Arbon over the rural.

[00:07:49]

Without this urban context and signals on livelihood, lifestyle, economy and values, retail brands will make wrong investments on pricing, advertising and marketing, or the Arbon bias will lead to wrong rural policy decisions with regards to health and other investments.

[00:08:11]

Wrong decisions are not the problem with the algorithm. It's a problem of the data that excludes area intent to measure in the first place. The data in the context is a priority, not the algorithms. Let's look at another example. I visited these remote trailer park home in Oregon State and New York City apartments to invite these homes to participate in Nillson panels. Panels are statistically representative sample of homes that we invite to participate in the measurement over a period of time.

[00:08:49]

Our mission to include everybody in the measurement led us to collect data from this Hispanic and African homes who use over the air TV reception to an antenna.

[00:09:02]

Per newsroom data, these homes constitute 15 percent of US households, which is about forty five million people, commitment and focus on quality means we made every effort to collect information from these 15 percent hard to reach groups. Why does it matter? This is a sizable group that's very, very important to the marketers brands as well as the media companies. With all the data, the marketers and brand and their models would not be able to reach these folks as well as show ads to these very, very important minority populations.

[00:09:44]

And without the ad revenue, the broadcasters such as Telemundo or Univision would not being able to deliver free content, including news media, which is so foundational to our democracy. This data is essential for businesses and society. Ah, once in a lifetime opportunity to to reduce human bias in A.I. stocks with the data. Instead of racing to build new algorithms, my mission is to build a better data infrastructure that makes it possible. I hope you will join me in my mission as well.

[00:10:24]

Thank you. One more thing, do you have an idea worth spreading? Ted is hosting a global idea search with a mission to hear big, bold ideas from every corner of the world. A select group of people from the application pool will be invited to give TED talks, either virtually or in person applied to give your own TED.

[00:10:46]

Talk it go Ted. Dotcom ideas. Search applications are due by January 21st, 2021. PR ex.