The Basic Argument for AI Safety

Richard Y Chappell

Jan 5

High-stakes uncertainty warrants caution and research

25 Comments

Elliott Thornley

Often I find that philosophers' dismissals of AI risk are driven by a sort of fatalism, and that they can sometimes be swayed by making a quick case for tractability along the following lines. With AI risk, the dangerous technology doesn't exist yet (in contrast to nukes), we can shape its features (in contrast to pandemics and asteroids), and the barriers to causing harm are high (in contrast to engineered pandemics and climate change). To make a difference, we just need to persuade a small number of (admittedly very powerful and motivated) people to do things a little bit differently.

And it helps to note that reducing AI risk is especially tractable *for philosophers*. To address nuclear, pandemic, asteroid, or climate risks, you have to learn a new field and your philosophical skills are approximately useless. With AI risks, that's not true. Philosophical skills are useful. And since Claude Code can now design and run ML experiments for you, you don't even have to learn much of a new field.

Premise zero should be "ASI defined" and premise one should be "ASI possible." Most dismissive views seem to imagine ASI is something very different from what you do, and further they don't believe it is possible at all in some sense, though this is easy to conflate with "ASI soon."

Richard Y Chappell

For ~any functional operationalization of intelligence, what do they think is the in-principle barrier to an artificial realization of it that surpasses human capabilities? (Some may get hung up on internal questions of qualia etc., so it's worth being clear that the sense of intelligence I'm concerned with doesn't depend on qualia; just functional processing that results in instrumentally effective planning / goal-directed behavior. But it's also puzzling to me that anyone would get hung up on this point.)

I don't know. My experience is that people point to the absence of a single, precise, agreed-upon definition as reason alone for dismissal and don't engage with questions like these except to further question what "human capabilities" are or what "surpass" means.

They won't offer their own definitions here and then shift the burden back to you. When you offer one, they'll complain that it's awfully specific and thereby unlikely for x, y, z reasons.

Richard Y Chappell

Seems like they're implicitly playing a social game of "can you *force* me to agree with you? No? Haha!" rather than engaging in truth-seeking inquiry.

But if anyone reading this is skeptical of the in-principle possibility of artificial superintelligence (and willing to engage in good faith), I'd be curious to hear more!

I think ASI in the sense of highly functional optimizations is very possible and on the horizon if not already here. I am not sure I would use the word intelligence to describe such a thing because intelligence seems to entail some sort of sentience or intellect, ie some kind of mental state. But outside of the mental state aspects, I think artificial super functions are incredibly likely.

Jan 5Edited

I like that you've not neglected gradual disempowerment! The situation seems almost like this: Okay, we have AGI/ASI. Solved alignment? No? AI takeover and extinction. Solved alignment, yes? Is ASI under control by some one guy/company? "Human takeover" i.e. Extreme power concentration. Value lock-in. No? Okay, well eventually: Gradual disempowerment. Existential risks on all sides. :( Seems like the most robust way of reducing existential risks (not just extinction risk) is to enforce a Pause / Moratorium (perhaps backed up by Deterrence, cf. Hendrycks' MAIM) while Humanity works hard at solving both the alignment problem and also the governance/coordination problems (and Philosophers like yourself and also political scientists, economists etc. are much needed here!). If we only solve the former but not the latter, we open humanity to various existential risks. And a Pause looks necessary to solve those two sets of problems, especially if AGI/ASI is on the horizon. So in short, preventing existential risks (understood as threats to our long-term global flourishing, of which survival is one part) as a whole requires Pause.

And it seems undeniable (unless one wants to have some supreme arrogance in humanity) that some safe, aligned AGI could have far greater likelihood in solving the problems of philosophy than human philosophers (ctrl-f-replace with science and scientists and this also makes sense—indeed It's what animates Demis Hassabis). So philosophers, whether they prioritize global flourishing or just philosophy alone, should prioritize AI safety first.

I think this argument is sound, I do not, however, believe that government regulations should play a role. Unfortunately I think many people jump to that conclusion from the cogent argument you present. In fact I think governments increase the very risk of harm this argument is concerned with. I believe that governments themselves are a very risky organization that should be investigated and approached more cautiously as they are currently the greatest risk to humans.

Richard Y Chappell

Jan 6Edited

Do you also think nuclear weapons should be deregulated?

(I agree that governments are dangerous. But anarchy seems worse, esp. where dangerous technology is concerned!)

Jan 6Edited

No, I believe nuclear weapons should be regulated. I believe a great many things should be regulated. I just don’t think the government should be the only organization allowed to regulate such things. But nobody is powerful enough to regulate the government*! That appears to be the unfortunate reality we live in and it appears to be what causes the biggest threat to humanity in my opinion. Essentially I think any organization should be allowed to regulate society, but many of the things we would want regulated, governments do the most! For example think of the most obvious examples such as murder, theft, kidnapping, imprisonment, etc as well as the greater risks to humanity as a whole such as owning nukes or using AI for harmful purposes.

> But if anyone reading this is skeptical of the in-principle possibility of artificial superintelligence (and willing to engage in good faith), I'd be curious to hear more!

It seems that you are making controversial assumptions about the nature of the mind and cognition. If you assume that either physicalism or epiphenomenalism is true, and that cognition merely involves symbols and relations between those symbols, then it makes more sense to think that ASI is possible. But I reject these views. I do not think that we think using arbitrary "qualia" whose content is irrelevant and which merely play a relational role in a structure (and of course, I also don't think that we think using representational brain states or processes). Rather than just using arbitrary symbols (of course we also make use of symbols a lot), I think that cognition involves universals, and that our ability to comprehend universals and their relations relies on the non-physical nature of the mind.

Even putting aside the concerns about the metaphysical possibility of ASI, it is not quite clear to me how it is supposed to pose a risk of extinction. It is not as though intelligence grants superpowers to the possessor of it. As long as it is just running on servers and is not connected to automotive machines or weapons systems, any significant extinction risk would presumably come from listening to what it says.

In fact, it seems that the real risk of AI is almost entirely due to believing that true AI is possible (at least if it is not in fact possible). This might explain why some people have such a dismissive approach towards talk of AI risk -- it might be their way of protecting against what they see as the real risk of AI, which is that people will put too much faith in it because they falsely believe that it has true understanding (another related risk is that people will think that AI programs are moral patients -- although that relies on the further assumption that AI programs can be conscious).

If people think that it is capable of true understanding, then instead of thinking for themselves, they might listen to whatever the AI says, and are more likely to do things like hook it up to nuclear systems or otherwise be too willing to rely on it. So, personally, I wish people focused more on the limitations of AI and how we should not be taking it too seriously, rather than implausible scenarios involving rogue AI.

Richard Y Chappell

Thanks for explaining your view!

Here's a sub-argument:

(1) Any philosophy of mind that predicts ASI is impossible would also predict that present-day AI is impossible.

(2) Present-day AI is not impossible.

so (3) there is no reason from the philosophy of mind to think that ASI is impossible.

The underlying thought: Either "grasping universals" is not necessary for dangerous cognitive capabilities, or any suitably capable functional architecture (of a sort that can ace the Math Olympiad, write professional-level code, and describe how to synthesize pandemic-risking biological viruses) is also capable of grasping universals.

On the risks: Superintelligence with free run of the internet is sufficient to (i) make (or steal) money, (ii) hire people and buy resources (e.g. to start up a robotics factory or research lab), and then (iii) deploy those resources to achieve things. None of these steps relies on others having any beliefs about AI at all.

Thanks for the reply.

> (1) Any philosophy of mind that predicts ASI is impossible would also predict that present-day AI is impossible.

I don't see why I should accept this premise. Hypothetical ASI is presumably very different from present-day AI. As far as I know, present-day AI still makes stupid mistakes of the sort that humans would not make. If we imagine that the ASI might plausibly succeed in executing some elaborate plot to overthrow or exterminate humans, presumably we need to imagine that such problems have been solved. If such errors are the inevitable result of lacking true understanding, then a program that works well enough to pose a serious risk of the sort imagined might not be possible.

Also, it seems plausible that certain kinds of human behavior, such as answering factual questions based on memory, rely much less on understanding than others do. It might be that AI can imitate the former pretty well, but not the latter. If you try to play along while watching Jeopardy, for example, you might just blurt out the first thing that comes to mind based on the clue. This is plausibly very different from a case where we are evaluating philosophical arguments or making elaborate plans.

> On the risks: Superintelligence with free run of the internet is sufficient to (i) make (or steal) money, (ii) hire people and buy resources (e.g. to start up a robotics factory or research lab), and then (iii) deploy those resources to achieve things. None of these steps relies on others having any beliefs about AI at all.

But that requires human intermediaries. If we imagine that the AI could, for example, simply pay someone to build a nuclear weapon, or bribe someone who currently has access, then that is a concern that is largely independent of AI (since presumably a non-AI agent could also do that). If the AI sends a lot of money to someone in a suspicious way, that might be investigated by law enforcement. And if a group actually starts acquiring materials to build a nuclear weapon, or actually starts to build it, hopefully they would get caught. If there is no way to stop such a thing, then that is a concern we should deal with even if AI does not increase the risk. What would make an ASI with Internet access significantly more dangerous than a human billionaire with Internet access?

The scenario here that seems most plausible to me is that someone, for example, someone with access to nuclear weapons, simply takes the advice of an AI program. But if the person knows that AI is not capable of true understanding, presumably he would be less likely to follow such advice.

Richard Y Chappell

> "As far as I know, present-day AI still makes stupid mistakes of the sort that humans would not make."

If anything, I think the reverse is actually *more often* true: humans make stupid mistakes of the sort that frontier AI would not make. (But since capabilities are jagged, one may still find *some* areas where AI makes stupid mistakes that a human wouldn't. They are getting rarer, though.)

> "Also, it seems plausible that certain kinds of human behavior, such as answering factual questions based on memory, rely much less on understanding than others do. It might be that AI can imitate the former pretty well, but not the latter."

I suspect that any test of understanding that anyone would have thought up before 2022 has now been passed by AI. They can explain irony and subtle humor, ace International Math Olympiad problems, plan out complex code that would take an expert software engineer many hours, etc. So I don't think there's any empirical basis for the sort of divide you're proposing. (That doesn't mean that they really possess understanding, but they can at least expertly mimic it in many domains, in a way that seems contrary to what your view would seem to predict. I don't see any empirical basis for expecting a sudden reversal of this trend. We should probably expect—and certainly allow the possibility—that some day they will expertly mimic cognitive capabilities far beyond humanity's. In many ways, they already do.)

> "What would make an ASI with Internet access significantly more dangerous than a human billionaire with Internet access?"

Vastly greater intelligence (implies more advanced technology - consider dangerous possibilities from nanotech, synthetic biological weapons, etc.); less need of other human beings; less need to keep Earth capable of supporting biological life in general; and vastly greater agential capacity due to the ability to clone itself (or run in parallel) millions of times over.

I don't believe there's currently tracking and oversight of the sort that would necessarily detect dangerous bio-research happening outside of official research centers, for example. There are steps that could (and probably should) be taken to better track such risks. But part of the problem is that a technologically less-advanced society doesn't know what a more-advanced one could do, and hence doesn't entirely know what to look out for.

> I suspect that any test of understanding that anyone would have thought up before 2022 has now been passed by AI.

Any test that merely checks for specific sorts of responses to specific questions would have been flawed, though. Without knowing the details of a program or its context, we cannot really predict how it will behave. Imagine a very simple non-AI program that simply consists of an if statement that checks for the input 'What is wrong with the ontological argument for God?' and then outputs some impressive answer pre-written by a philosopher; any specific behavior (that involves merely outputting text) could be replicated. If a traditional program has a large enough database of inputs and outputs, it could give many impressive answers, even though it does not use what would be described as _AI_ techniques.

However, what would be predicted by certain accounts of cognition is that the AI program will more poorly handle _novel_ (relative to itself) situations. Of course, it is hard to say what is novel to it if it has access to a very large database whose details are unknown, but if it is given something truly novel, it is more likely to fail to handle it properly than a human.

Imagine a human who sees three bears enter a cave and then two bears leave the cave, and suppose that his experience of similar cases is exhausted by cases where a group enters and leaves together. If the human is asked if there are still bears in the cave, he might answer correctly. However, if he was merely making guesses based on past experience (which is more analogous to present-day LLMs, I think), this would be less likely. Of course, if his past experience is more abundant, then he might be able to answer correctly without giving it much thought.

Presumably many physicalists (or epiphenomenalists) would agree that our ability to deal with novelty is not merely due to past experience, but they might say that it is due to the logical structures instantiated by the brain. In that case it is not guaranteed that an AI program that works just as well could be built, but it might be seen as fairly plausible. However, if instead we are relying on synthetic a priori knowledge that depends on our ability to directly grasp universals, then the ability to deal with novel situations as well as humans do becomes less plausible.

Of course, current AI can deal to some extent with some amount of novelty, but it is doubtful whether it will ever deal with it well enough to count as 'ASI' of the sort that you are imagining. Surely it will never deal with it as well as we can if we assume that humans use synthetic a priori knowledge, while the AI is stuck using probabilistic methods (even if we merely make use of analytic a priori knowledge that could feasibly be replicated by a program, we would still need to develop a different kind of AI than what we currently have to deal equally well with novelty).

Hi Richard. I think posts like this are worth crossposting to the EA Forum.

Do you agree your basic argument can be simplified as follows?

- 1. AI risk warrants urgent further investigation and precautionary measures if it implies the risk of human extinction over the next 10 years is at least x.

- 2. The risk of human extinction over the next 10 years is at least x.

- 3. So AI risk warrants urgent further investigation and precautionary measures.

What is the minimum x for which you believe the above argument works? Is there any detailed quantitative modelling of the risk of human extinction due to AI that make you believe the risk is at least equal to your minimum x? I am not aware of any such modelling. Relatedly, I have wondered about whether Coefficient Giving (CG) should build detailed quantitative models which estimate global catastrophic risk (https://forum.effectivealtruism.org/posts/mJ3HHvrr3C4suC593/should-open-philanthropy-build-detailed-quantitative-models).

Richard Y Chappell

Jan 7Edited

I'd be hard pressed to specify a minimum. Easier to offer some "low but sufficient" values, like 1 in a billion. (But then the higher the value, the more urgent and significant are the "precautionary measures" that may then be warranted, depending also on the opportunity costs.)

I'm not aware of "detailed quantitative modelling" on this topic, and doubt it would do much to resolve the underlying dispute. (Disagreements about model parameters will be all over the place; plus it's important to account for model uncertainty.) At the end of the day, people need to exercise judgment. I broke it down into two steps, and would suggest that folks probably shouldn't be >99% confident in either (i) ASI being very distant, or (ii) ASI turning out to be safe for humanity. My OP didn't try to quantify the second step (instead appealing to the vaguer idea of what we "can’t be confident" of). But if we assign at least 1% to each of the two steps (ASI soon, and if soon then very bad) then we get at least a 0.01% chance of very bad outcomes to worry about, which is easily sufficient to warrant serious precautions.

Thanks for the clarification. A risk of human extinction of 1 in a billion over the next 10 years would imply 10 human deaths in expectation (= 10^(-9 + 10)), 1 per year (= 10/10). I think this is only enough to justify a very modest version of your conclusion that "AI risk warrants urgent further investigation and precautionary measures". My guess is that the risk of human extinction over the next 10 years is more like 10^-7, which implies 1 k human deaths in expectation (= 10^(-7 + 10)) over the next 10 years, 100 per year (= 1*10^3/10). This results in a stronger version of your conclusion, but is still very much compatible with AI risk warranting less investigation and precautionary measures at the margin (depending on the opportunity cost).

Richard Y Chappell

You're neglecting the value of future generations. Enumerating "deaths" does not even come close to capturing what matters most.

Personally, I think 1 human-year in 2026 is as valuable as 1 human-year 10^100 years from now (assuming both human-years have the same welfare). However, caring about future welfare in this way is very rare. It is also rare for people to care about 10^50 digital minds that may exist at some point in the future. I suspect people care about the next 100 years or so, which is enough to cover 3 generations of 33.3 years each, and this does not materially change my point. I thought your overarching point was that there was a strong case for "AI risk warrants urgent further investigation and precautionary measures" under moral views endorsed by the median person. I assume the median person will not demand "urgent further investigation and precautionary measures" due to an additional risk of death over 10 years of 1 in a billion. The annual death rate in 2021 was 0.835 %. So that additional risk would only increase the annual death rate in 2021 by 1.20*10^-5 % (= 1*10^-9/(8.35*10^-3)).

It is also unclear to me whether accounting for effects more than 100 years into the future change the pressingness of AI risk (https://forum.effectivealtruism.org/posts/B4NG3jRJzeo3nX4bM/epistemics-part-10-the-nontrivial-probability-gambit-or?commentId=NYQgAEnMyqJzDuPZG).

Richard Y Chappell

I don't think that's a good way to represent ordinary values. What the median person values (or finds convincing) depends on how you probe / frame the issue. I expect most would agree that extinction is a very significant further bad over and above the immediate deaths, but not in a way that precisely corresponds to any particular number of extra generations.

The claims I think most people would agree with are the premises stated in the OP.

Your problem is you have a blind faith in ASI. "There’s no in-principle barrier to such technology," there is, those algorithms just doing entropic statistical approximations don't even think, and will never acquire something near to self-consciousness, at least they have some degree of forgetfulness (though people in San Francisco thinks that's what makes them not super).

Amd whatever, even when they never will be ASI, if they are arms of mass destruction, if many superpowers have them, there's no problem also. As nuclear bomb did, it will bring peace at least among the nations that have it.

Richard Y Chappell

Three points:

1. Any sense in which AI "don't even think" is a sense of thinking that isn't necessary for shaping the world in dangerous ways.

2. It has been less than a century, and already there have been multiple moments in which global nuclear war very nearly broke out (e.g. due to false-positive signals nearly triggering "retaliatory" strikes). Multiplying such risks is far from reassuring.

3. You're assuming that nations will be able to perfectly control what "their" ASIs do. But the whole alignment problem is that nobody knows how to achieve that. As capabilities increase, so do the potential risks from persisting misalignment (or imperfect alignment).

Your first point just refutes your third, mate. If it not thinks it's not ASI, so not alignment problem (or not a serious one).

I think when talking about the philosophical aspects of Ai with those threatened by it we need to be talking about Ai in terms of Closed Ai vs Open Ai. Big data systems all connected vs small individual systems where they spread out the power consumption as a pose to one major spot like Arizona which will be extremely difficult to power the kinds of grids they are talking about. IP is the obvious thereat but in a closed environment not so much. How we approach Ai now matter most.

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts