Our Relationship Status with Data: It’s complicated

Artificial intelligence works by harvesting data on human interaction — which can betray our biases

Illustrator: Peyton Garcia

With numerous news reports out about the recent Cambridge Analytica scandal, public awareness of the permeability and accessibility of data in the digital domain has dramatically increased. As a result, people have become more critical of the ways social media and marketers collect user data in an effort to capture their attention for as long as possible.

But while these issues of data are salient, the public lacks a clear understanding of how machines use data in everyday life. Furthermore, the machines propagate and reflect human decisions and biases through a process called machine learning.

Machine learning, also popularly dubbed “artificial intelligence,” is well-known and not well understood. It runs on algorithms, which take in inputs and then produce a prediction or result. This process has changed the way we interact online and offline. For example, algorithms drive YouTube’s video recommendations systems and Facebook’s photo-facial-recognition software. In the case of machine learning, the algorithms’ inputs are massive datasets that give feedback to the algorithm about what output is the “best” according to the scenario.

To determine the “best” outcome, an algorithm’s judgment must mirror human judgment. Because of this, unsurprisingly, most of the data that drives it comes from us. A dataset that includes a record of our actions, words, or demographics on a social media platform can be analyzed and used to unlock our decision-making process. A different dataset that includes a list of all the titles, authors, years, and topics of American journal articles with the keyword “Russia” can inform a machine on the trends of America-Russia relations. The machine essentially “learns” how we think and uses that information to mirror, exploit, or enhance our interactions with it.

But because the data is a mirror of our judgments, it also mirrors our biases.

Public scrutiny of artificial intelligence focuses on the legal consequences and privacy issues, but not so much on the nature of our technological interactions. The public dialogue should expand to also investigate the way machines influence the human psyche. To begin that conversation, we must talk about some ethical issues with data.

Because we rely so heavily on data to guide our interactions, researchers have started turning their focus to the problems with the data inputs. Because machines learn indiscriminately, human biases evident in the data become evident in the machines’ resulting actions.

“Algorithm bias,” in which the decision an algorithm makes is skewed by the human-data inputs, is an issue that started to become heavily explored within the past two years. On May 23, 2016, ProPublica released an investigative article titled “Machine Bias” about how the algorithm-driven risk assessment that rates the risk of a criminal to commit another crime is biased against black defendants. According to the report, “the formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants” while “white defendants were mislabeled as low risk more often than black defendants.” This assessment is used in court during the criminal sentencing and at every step of the process where the defendant could be set free. This issue of discrimination is made even more urgent and serious when one considers the trust that people have in data in general, the opacity of the algorithm itself, and the way it reinforces disparities in our justice system. In this sense, many of the issues present in algorithms reflect the issues that we have in real life.

When researchers and advocates started to investigate other services, they often came up with similar reports of unfairness in the algorithmic decision. For example, Google Translate has faced criticism for translating gender-neutral pronouns into English as “she” when associated with the word “cook” or “nurse,” while it translates the pronouns into “he” when associated with the words “engineer” and “soldier.” Google responded by explaining that Translate acquires the word associations through the “learning patterns from many millions of examples of translations seen out on the web.” And this is the crux of the problem—the algorithms’ biases are ultimately the users’ biases. If a user draws a man when prompted to draw a scientist, or the literature that the dataset draws from is predominantly male, the algorithm learns the word association. Thus, some academics, such as MIT Media Research Scientist Rahul Bhargava, have raised the point that instead of calling the field “machine learning,” we should be calling it “machine teaching.” The algorithms’ biases only exist because we’re incorporating our own biases into them.

As both the risk assessment and Google Translate biases have shown, this algorithmic problem is a two-way street that encompasses both the way data has influence over our lives and the influence we have over that data. Because this relationship is so complex, there are no easy solutions.

However, researchers at Microsoft are currently trying to mitigate the semantic gender bias through publicly available language models. Additionally, they are working with scientists at Boston University to create a gender-bias-free dataset by delinking any illogical associations.

But because the data is a mirror of our judgments, it also mirrors our biases.

Yet this is only the start of an exploration of our relationship with data. Another facet is being explored through Mapping Prejudice, a project supported by University of Minnesota’s John R Borchert Map Library. This project uses data to expose our historical prejudices and is committed to mapping out all the locations of racial restrictions of property in Hennepin County.

This is just one example of how we can harness the power of data to create a more equitable and transparent society.

As data-driven services increase the speed and convenience of our lives, we must be aware and not be caught up in a cycle that reinforces disparities in real life and online. By continuing to define and refine our roles in this complex relationship, we can contribute to public discourse about data in ways that are beneficial to us all. Let’s focus not only on the legal consequences and privacy issues when the media reports big scandals like Cambridge Analytica, but also on the ways that our everyday decisions and interactions are shaping our society and technology.

Issues such as algorithm bias stem from our own societal faults, and fixes like bias-free datasets are only temporary solutions. It is paramount that we understand this as the world becomes more reliant on technology. Data, by nature, is not a positive or negative thing. We influence it, and in turn, it influences us. We are able to use it to exploit or explore the human psyche. Research into this human-data relationship is new, but regardless of how developed the field becomes in the future, our relationship status will remain unchanged: It’s complicated.