research direction proposal

2023-05-11

A passage from AGI Ruin, a list of lethalities:

Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability. Consider the internal behavior "change your outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over you". This problem is one that will appear at the superintelligent level; if, being otherwise ignorant, we guess that it is among the median such problems in terms of how early it naturally appears in earlier systems, then around half of the alignment problems of superintelligence will first naturally materialize after that one first starts to appear. Given correct foresight of which problems will naturally materialize later, one could try to deliberately materialize such problems earlier, and get in some observations of them.

Eliezer Yudkowsky then goes on to make a solid a argument that materializing these properties early isn't enough to solve alignment, and that solutions to simplified versions probably shouldn't be expected to generalize. I broadly agree with this claim.

However, I still think it could be incredibly useful and insightful to try to materialize issues such as these, and to publish our findings. At present, the best work I know of in this direction really just shows goal misgeneralisation under distributional shift. In my opinion, we can probably do a lot more with current tech. If anyone knows of other work currently doing what I'm suggesting, please let me know.

To make this a bit more concrete:

Clearly if we were to train some 1e21 parameter model to pursue a goal in the real world, we would expect decent probability of a mesa alignment failure, where the model purposely deceives the training procedure:

The ideal strategy would be to look aligned whenever the model's actions would result in a signal to modify its parameters, and to revert to an unaligned internal policy when the model believes it is in deployment.
This deceptive policy is an attractor in the space: a system that isn't very good at deception will be updated by the training procedure to be better at deception.

The size of this attractive basin under current training procedures, and whether it functionally represents a minimum, or if compute contraints will slowly pull deceptive models towards internalising the intended reward model - these are both open problems.

The core question I'd like to answer: How far can we simplify a simulated world down from the reality, while keeping this mode of deception possible?

Totally speculating here, the simplest version of this might look a bit like the setup in this paper, but with the additional possibility that agents can perform steps of gradient descent on other agents, and use these other trained agents to achieve goals. I still need to work out details, but I think there's some ways you could make this tractable without too much compute, or even using LLMs. You could do things like encode in the "laws of physics" that simulated worlds always look different from the real one (the simulation has a matrix-green tint, etc), to ensure we have the distributional shift necessary to invoke alignment failure.

However we get there, once we're able to reproduce deceptive alignment, there's loads of experiments we could do. Personally, I was thinking it'd be neat to probe capability thresholds, see how sharp the cutoffs both in training time and model size are. Secondarily, this may be a route to demonstrate hypothesized behaviour of goals "pointing to" external concepts in the world, through some mechanistic interpretability.

It's also possible this is a complete dead end.