Thinking rock subgoal setup

6/18/2023

(Note that, while many of these arguments are inspired from things I’ve read or heard, the presentation here is my own and may not accurately represent anyone else’s beliefs.)

But I haven’t seen arguments that persuade me to be confident about how future systems will generalize. For example, I am happy to confidently predict that given an English sentence that it has never seen before, GPT-3 would continue it with more English. It is also not necessarily hard to get strong evidence. These are not decisive it is simply an uninformative prior from which I start. There are a ridiculously huge number of possible programs, including a huge number of possible programs that are consistent with a given training dataset it seems like we need strong evidence to narrow down the space enough that we can make predictions about generalization.We can’t predict how current systems will generalize to novel situations (of similar novelty to the situations that would be encountered when deliberately causing an existential catastrophe).My core reasons for believing that predicting generalization is hard are that: I would not be shocked if in the future we could confidently predict how the future AGI systems will generalize. Note that this is about what we can currently say about future generalization. Part 1: We can’t currently confidently predict how future systems will generalize In the second part, I’ll demonstrate how various notions of “objective” don’t seem simultaneously Probable and Predictive. without reference to “objectives”), and why they don’t make me confident about how future systems will generalize. In the first part, I’ll briefly gesture at arguments that make predictions about generalization behavior directly (i.e. So, when choosing a notion of “objective”, you either get to choose a notion that we currently expect to hold true of future deep learning systems (Probable), or you get to choose a notion that would allow you to predict behavior in novel situations (Predictive), but not both. The core difficulty is that we do not currently understand deep learning well enough to predict how future systems will generalize to novel circumstances. In the case of AI risk, this is sufficient to justify “people should be working on AI alignment” I don’t think it is sufficient to justify “if we don’t work on AI alignment we’re doomed”. Note that in both cases, I find the stories plausible, but they do not seem strong enough to warrant confidence, because of the lack of a notion of “objective” with these two properties.

whether it will execute a treacherous turn once it has the ability to do so successfully). Predictive: If I know that a system has an “objective”, and I know its behavior on a limited set of training data, I can predict significant aspects of the system’s behavior in novel situations (e.g.Probable: there is a good argument that the systems we build will have an “objective”, and.When we imagine powerful AI systems built out of large neural networks, I’m often somewhat skeptical of these arguments, because I don’t see a notion of “objective” that can be confidently claimed is: This is often some variant of “we will use human feedback to train the AI system to help humans, and so it will learn to pursue the objective of helping humans.” Implicitly, this is a prediction about what AI systems do in novel situations for example, it is a prediction that once the AI system has enough power to take over the world, it will continue to help humans rather than execute a treacherous turn. For example, “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.” This is a prediction about a novel situation, since “causing human extinction” is something that only happens at most once. This is often some variant of “because the AI system is pursuing an undesired objective, it will seek power in order to accomplish its goal, which causes human extinction”. I often find these arguments plausible but not rock solid because it doesn’t seem like there is a notion of “objective” that makes the argument clearly valid.

Core arguments about existential risk from AI misalignment often reason about AI “objectives” to make claims about how they will behave in novel situations.

0 Comments

Thinking rock subgoal setup

Leave a Reply.

Author

Archives

Categories