Last week, news broke that Google fired the co-lead of its ethical AI team, Dr. Timnit Gebru. Since then, the artificial intelligence community has been taking stock of a host of issues. Chief among them is the intersection of corporate power and research institutions: can a company effectively question its own products? Under the surface of this firing are long-standing and intensifying questions about the ethical implications of artificial intelligence that have yet to fully translate to the wider public.
Dr. Gebru’s departure was precipitated by a paper she co-authored with four other Google employees. The draft paper, titled “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” examines the potential multifaceted hazards of a new class of large machine learning models that are widely used by Google and other large technology companies.
Since her firing, Dr. Gebru and her colleagues’ paper has leaked online. Unlike recent years, where we have seen public fretting from figures like Elon Musk that rogue AI would somehow turn the world into paper clips, the paper by Gebru et al. is grounded in present-tense issues. Throughout, I will occasionally refer to Dr. Gebru in the singular for the sake of simplicity; please keep in mind that she had several co-authors on this paper. Their identities are being protected as current employees of Google.
The paper centers around problems that are inherent in a quickly evolving class of artificial intelligence techniques that attempt to model human language. In simplified terms, these models are designed to do one of two tasks: they either predict what comes next after a string of text, or they predict what should go in the middle of a sentence. While this may sound less than exciting, it is massively important—it is a core part of the technology that powers machine translation, automated speech recognition, and virtually anything else to do with language. Imagine something like a sophisticated statistical “Mad Lib” solver and you’re on the right path.
In recent years, these technologies have grown vastly more powerful due to a confluence of factors that constitute a familiar story: ever-larger datasets help to train bigger and more complex models enabled by radical improvements in computer infrastructure.
At present, the largest of these models is OpenAI’s GPT-3. The model uses some 175 billion parameters, and each must be tuned or “trained” on example data via an iterative process. In the case of GPT-3, that data is 570 gigabytes of text data drawn from a variety of sources from the Internet. The previous largest model, Microsoft’s T-NLG announced in February of this year, used less than 10% of the parameters and 30% of the training data. In short, this is a very quickly moving space.
Dr. Gebru and her colleagues ask a straightforward but provocative question: How big is too big? To understand that, we have to look closely at what it means to train a machine learning model.
Machine learning researchers and computer scientists tend to divide calculation of costs between “training” time and “inference” time. Training refers to the process of “teaching” the model on the basis of large datasets. That process is typically iterative; it takes many cycles of repetition to develop a robust model. Researchers must also be careful that the training data is diverse enough that models don’t “overtrain” and become fixated on idiosyncratic patterns.
For example, imagine you have a set of “noisy” points like the black dots below:
Your goal is to come up with a function that will predict a new set of points in the future from the same underlying process — recognizing that this particular collection of black dots has some measurement error and uncertainty in it.
The two lines show two possible solutions. The first is a very simple line. While it doesn’t perfectly “fit” the dots, it captures the basic trend. The second is the blue line, which perfectly fits each and every dot, and relies on a more complicated function. If you measure the accuracy of these two models on just this data, the blue line is better. In fact, it is technically perfect.
But is it? You can imagine that over time, with new datasets that won’t have the dots in exactly the same place, the simple line will actually be more accurate. The simple line is less complex, but actually captures the data better by not being overly influenced by just one example.
Finding the right balance in model complexity is difficult work. Too simple, and the model can’t capture important nuances—too complex and it becomes fragile, or at worst a mere portrait of the training data. The goal, as in so many things, is to find the sweet spot in the middle: a tool that is general enough to handle the messiness of the real world without sacrificing too much in terms of accuracy.
Finding that sweet spot often requires vast computational resources, particularly in the modern context. This is partly due to the inherent need for repetition in “teaching” the model, and also the need to ensure that the model doesn’t become too dependent on one narrow slice of training data.
In comparison to training, “inference” — that is, actually using the model to make a prediction — is much cheaper. The expensive part is finding the right values for all those parameters discussed above. Once you have them, making predictions with the model is relatively easy.
If you squint, a rough analogy can be made to raising children. The cost of raising an infant to school age is profound in comparison to babysitting a kid for a day. The hard part is teaching an infant all the basics of human life: not to touch the stove and stick fingers in the electrical sockets. Once the child, or the model, has reached maturity, it “operates” relatively cheaply in comparison to the cost of its upbringing.
Google and others contend that innovation has helped to offset these costs and prevent them from becoming another worrying source of carbon pollution. However, Dr. Gebru rightly points out that the wider enterprise of cloud computing is not necessarily carbon neutral, whatever Google’s particular commitments may be.
She further argues that the benefits and costs of these models are lopsided. Is it fair to ask residents of the Maldives, expected to be underwater before the end of the century, to foot the environmental bill for marginally improved language models — particularly when those language models don’t include Dhivehi? In short, the people most exposed to the risks of climate change are some of the least positioned to benefit from technologies like Google Home or Amazon’s Alexa.
Put more pointedly: What good is a Siri that doesn’t understand your language when the water is rising?
Should we hasten the flood so that gadgets become slightly less stupid? How about when such machines do other more important work, like predicting the structure of medically relevant proteins? To be sure, these are not easy questions — but they are important ones, and ones that should be widely assessed.
Next among Dr. Gebru’s concerns is “unfathomable” training data. Those 570 gigabytes of Internet data referenced above largely come from sources that represent a very narrow slice of society. In the analogy above between training a model and raising a child, imagine a child raised largely by Reddit or Twitter, with occasional babysitting from Wikipedia. Are you confident that a well-rounded, tolerant person would result? Would you take that teenager to an important company dinner, trusting that they wouldn’t say anything offensive or laughably wrong?
The deeper problem Dr. Gebru raises is that the training datasets themselves are too large to effectively audit or even fully understand as a researcher. How can we assess the ethics of something so vast we can’t effectively explore it? The risk, Gebru’s team and others argue, is that by averaging together such sources, models will come to “encode hegemonic worldviews.”
The risk only becomes more pronounced as the models become increasingly convincing. Advances in language models have enabled automated text that seems coherent across sentences and even paragraphs.
Take this example of GPT-3 answering questions about the Russian private military company, the Wagner Group:
The model does an admirable job of answering basic questions. The responses feel logically and intentionally composed.
The example comes from a paper earlier this year by two researchers from the Middlebury Institute of International Studies. They evaluated the risk that a language model like GPT-3 could be weaponized.
The researchers concluded that GPT-3 was highly capable of emulating “ideologically consistent” propaganda found in many online extremist communities. In their assessment, tools like GPT-3 could be used to generate massive amounts of text that could be “lightly edited” into a digital blizzard of misinformation.
Their work pierces the analogy about AI models and children that I made above. Unlike humans, machines are indefatigable. Models can be used to misinform and propagandize tirelessly. Moreover, models are often regarded by people as being nearly oracular thanks to their technical trappings — even as they continue to make silly mistakes that reveal their lopsided educations.
As uneven as human law can be, we have even flimsier frameworks for applying justice to wayward machines. How do you hold a mathematical model to account? Where in its billions of parameters can you locate agency or motivation?
So often, the dialog about AI safety imagines distant futures rather than the unsettling present. We imagine a world turned into paper clips by an overzealous machine, or imagine robotic overlords that conveniently reify our fears about our own social machinery. Public figures like Musk tell us we’re right to fear machines. The threat they warn of is always generalized and distant.
The last 50 years or so of technological progress has been like driving ever faster on a rugged, occasionally potholed road. The speed gets faster and faster, but the road doesn’t get smoother. If anything, the potholes only get deeper. The fear that something dramatic is just around the corner at such speed is understandable. But it is also an escapist fear.
Experts like Dr. Gebru remind us that, while Frankenstein’s paperclip might be around the bend tomorrow, we also need to ask hard questions today about who really pays the price in terms of climate.
Don’t settle for a mental picture of Terminator and the obvious jokes about killer robots. Instead, contemplate extremists and latter-day Jacobins armed to the teeth with an infinite press. Consider the dulcet tones of a machine voice, raised on the Internet, with racism and sexism surgically suppressed but never entirely gone.
Imagine the hyperculture — born of an amalgam of data and images too vast to be seen, unrelenting in its production of yet more monoculture. The future might be like falling down an endless elevator shaft: one long shriek of infinite artificial novelty, but somehow all the same.
The crime you should fear is not a robot murdering you, though sadly you can’t discount it entirely. Worry instead about the electricity bill raised by extremists and marketers alike running rampant with Jefferson’s polygraph crossed with Mickey’s magic broom from The Sorcerer’s Apprentice.
A woman who has dedicated her career to asking these kinds of questions was fired. Whatever your opinion about that, we need more minds on this at Google and elsewhere — not fewer. Given her contributions so far, Dr. Timnit Gebru will find a way to continue to ask impolitic but necessary questions. Will the rest of us take them up?