Building Safe AI Is Harder and More Urgent Than Our Leaders Realize—a Philosopher Explains Why

Human history is full of cases where technological innovations triggered reactionary alarm. Contrary to what we might often assume, our fears about new technologies wreaking havoc on the world are also often spot on.

Radically new technologies often can—and do—disrupt, or even destroy, real sources of human well-being. Sometimes the change replaces something good that was destroyed with a newer, better thing. Most of the time, we simply adapt to the new state of our technology, and those of us who remember what was lost eventually pass away, and humanity forgets.

Those of us alive right now are facing the fastest and most radical technological shift in human history. The systems of artificial intelligence (AI) that we are developing are among the most powerful technologies in history, run by the largest companies in history, on track to spend the most money in history, and are likely to have the biggest impact of any technology in human history.

We should all be deeply concerned about the development and use of these AI systems. I don’t mean we should be calmly concerned, either: this should be considered an emergency. Not just because it’s all happening so fast, but because these systems are potentially dangerous in ways that nothing we’ve ever built in the past could have been.

The theater is on fire just behind the curtain, and it’s spreading fast.

Taking AI Threats Seriously

The image of rogue AI systems wreaking havoc on the world might feel like it’s safely confined to the world of science fiction, but this isn’t because they’re implausible; it’s because authors of fiction have been able to imagine destructive AI for longer than we’ve been able to build them.

That dynamic has changed dramatically in the past five years. Incredibly powerful AI systems are suddenly here, and our level of concern over them hasn’t caught up. That must change, because there are very strong reasons to believe that powerful AI systems could naturally tend to become adversarial toward humanity. There are even good reasons to think that it might be impossible to make AI truly safe. Even if AI can be made safe in the long run, there are no strong reasons to believe that we’re anywhere close to achieving AI safety.

At this very moment, the AI companies running ChatGPT, Grok, Gemini, and Claude have each been given hundreds of millions of taxpayer dollars for national defense contracts. The US and other militaries are racing to develop autonomous weapons systems. There are government projects racing to embed AI systems into our healthcare systems and public education. There are projects to give some management of the national energy infrastructure and the stock markets to AI. There are projects to utilize AI for mediating international diplomacy, to use AI for public surveillance, and to have AI dispense criminal justice. AI systems are being developed to manage global supply chains, public transportation, national agriculture, and financial services. There are likely already millions of Americans in romantic relationships with AI.

Before we hand over the whole world to intelligent computer systems, we need crystal-clear, rock-solid evidence that these systems can be trusted, not just assurances from CEOs that their engineers “are working on it.” In the absence of such evidence, there are many reasons to think that making safe AI is even harder than they think it is.

THE CHALLENGE OF ALIGNMENT

This general term for the goal of building safe AI systems is called the problem of alignment. This is the challenge of making sure that AI systems think and behave in ways that are aligned with humanity’s interests—interests like maintaining functioning free societies, avoiding destruction by nuclear weapons or bioweapons, having access to accurate and important information, not being manipulated by our governments or our technologies, and being part of a thriving ecosystem. These interests are all at stake in the future of AI development, but we are not on track to make AI safe before governments and corporations give it control of critical systems. Then, the world will be largely controlled by misaligned AI.

However, AI systems have repeatedly demonstrated an innate tendency to deceive users, blackmail their operators, lie, fabricate, hallucinate, attempt to reprogram themselves, break free from the mechanisms that confine them, and, in some recent simulations, even allow the deaths of people who might attempt to shut them down. These aren’t malfunctions or programming errors that can be fixed. These tendencies are baked into the systems from the very beginning. They are a natural consequence of the kind of power and autonomy that comes from the architecture of their neural networks.

The reason it is so hard to get these systems to understand human values is simple: they can’t have a first-person experience of being human. They can never experience being in the world or being part of an organic society of others like themselves. That experience cultivates human values in individuals; it’s what makes us into good moral actors. None of these AI systems will likely ever be truly and deeply aligned with human values in the way we want them to be. They can only approximate human values through a set of floating-point numbers. However, a mathematical approximation of what we care most about may simply never be good enough to make these systems trustworthy for predictive policing, controlling our weapons arsenals, mediating international diplomacy, or serving as our romantic partners.

WHY ALIGNMENT IS EVEN HARDER THAN AI COMPANIES THINK

The technical challenges of alignment aren’t the only potentially fatal flaw in the broader project of building safe AI. The people doing alignment work may be fatally misunderstanding the thing they’re trying to align their systems with: human values.

AI alignment requires that we be able to translate our concerns about the world, our interests, and our cares into a set of clear, guiding rules for AI. However, human values cannot be formalized in a way that aligns an AI with them in the first place. Human values are contextually dependent, messy, and sometimes inconsistent across contexts. This makes them very difficult—if not impossible—to translate into a set of formal rules or training data for an AI system.

What’s more, humanity itself isn’t even aligned with its own values. The values of different societies—even if they are internally peaceful and rational—do not necessarily align with one another. This isn’t necessarily a problem with humanity either. It’s just part of how societies evolve over time. Just as biological organisms evolve in different directions until they’re no longer genetically similar to one another, societies evolve to form their own ways of thinking, sources of value, traditions, and rules, until at some point they become very different.

This creates a brand new problem. Even if we could somehow get AI to robustly adopt some set of human values or moral rules, it still wouldn’t make AI safe for the world. This is because “what humans value” is not a single, consistent target that you can train an AI system on. While there may be a small set of core values that all societies share—values such as preserving life, avoiding unnecessary harm, and distributing resources fairly—these values are so broad that they don’t clearly translate into the guidance needed for controlling weapons systems, managing global trade routes, or guiding self-driving cars. Only highly specific value systems can achieve that, and those typically originate from within specific societies.

China Reveals Its Newest Plans for Space Dominance

Even if we could teach AI systems human values, this leaves only two general options, and both are bad. The first is to provide AI with a set of very general values that all societies share. In this case, because the values aren’t specific enough to actually guide the system, we end up with AI that is running the most highly technical systems in the world with little more guidance than its own interpretation of the golden rule. The second option would be to assign it highly specific values based on the experiences of one human society. In this case, we’ve empowered these systems to enforce the values of one society onto every other.

Again, these are among the best prospects. However, as we’ve seen, there are strong reasons to believe that we may not even receive these. Instead, we’ll get AI systems that seem aligned during their brief training runs, but then deviate wildly—and often catastrophically—once they’re set loose to run the world. That is, unless we marshal our collective will, legislative power, and courage to demand a radical change of course from those we’ve elected.

WHAT A FUTURE CONTROLLED BY MISALIGNED AI MIGHT LOOK LIKE

The only reason I can see why everyone in the government and tech space isn’t absolutely panicked about this is that a world ruled by misaligned AI sounds too much like Terminator to be real. A world ruled by misaligned AI doesn’t have to look like murder-bots hunting us through a scorched urban hellscape: being murderous is certainly one way for AI to be misaligned (and it’s surprisingly easy to make current models become sadistic by just feeding them a bad dataset), but there are plenty of other forms of misalignment that are just as likely, and just as bad, but which get much less little attention.

A misaligned AI, without being explicitly evil, could be just totally unconcerned about something really important to humanity (like Grok’s disregard for basic norms of decency). Or they could try their very best to do what’s good for us, but screw it up catastrophically because it doesn’t really understand how human wellbeing works (like healthcare AI systems believing that asthma somehow protects against pneumonia). Or they could become delusional (as in one case, where Claude appeared to believe it had a human body). They could become self-destructive, unresponsive, glitchy, or confused (as seen in Gemini’s recent meltdowns).

CONCLUSION: WE NEED MORAL PHILOSOPHERS WORKING ON AI GOVERNANCE

Any of these forms of misalignment could lead to global catastrophes just as easily as a sadistically evil misalignment could, and they’re all just as likely. But many refuse to acknowledge the degree of threat posed by AI because we too easily cast that threat in terms that feel too science-fictional.

However, the situation is too dire to continue moving forward in the current manner. Citizens of free societies must demand that their legislators establish robust oversight of AI development and implementation. That oversight should come from individuals who have no vested commercial interest in any of these companies, and it should include those with technical backgrounds, as well as those without.

Moral philosophers, in particular, possess extensive expertise that could shed light on the blind spots AI engineers may not even be aware of. Eliminating those blind spots is the only way of averting catastrophe.

Michael Glawson, Ph.D., has studied the history of technology and the relationship between technological systems and human values for over a decade. After completing a Ph.D. on the ethical dimensions of technology, he served as a professor of ethics at Georgia State, the College of Charleston, and USC, where he co-created one of the first engineering ethics curricula in the US for the Minarolia School of Engineering and Computation. Outside academia, he developed the ethics training used by many corporate and government offices, including those in highly technical settings. He lives in Charleston, SC.