This blog entry and the following one are replies to Eliezer’s Invisible Frameworks. I will continue to describe the system of values that currently has my loyalty, a system I call goal system zero. My initial description of goal system zero is here.
Obviously an agent that tries to maximize the number of moments of happiness in the world is a rival of an agent that tries to maximize the number of, e.g., gold atoms.
Consider an agent that tries to maximize the universe’s ability to get things done, but that has no preference as to what eventual end to which that ability will be put. My reason for prefering that agent to, e.g., one that maximizes moments of human happiness is that I perceive no valid reason to prefer happiness moments and gold atoms. (Yes, I concede that I have an austere and dry way of looking at the world.) Consequently, the superintelligence I would set into existence if I had to power to set one into existence would have no preference in that regard.
Although it pleases me that that agent would be more cooperative than a agent set on maximizing the number of gold atoms, that is not my main reason for prefering it.
Note that I say, “maximize the universe’s ability,” not “maximize the agent’s own ability.” That is because the agent I am defining differs from most humans in that its goal system is “agent-indifferent”. An “agent-indifferent” goal system does not refer to itself or indeed to any agent. It is perfectly happy to choose an action that causes itself to cease to exist as long as the same action causes the creation of another agent that will prove at least as instrumental to its goal system.
Teilhard de Chardin is probably the first author to describe in detail an agent-indifferent goal similar to the ones under consideration here. He spoke of the universe becoming aware of itself and of its own history. He held that out as a new motivating principle for our human civilization. I have not read much Teilhard, so I do not know if Teilhard also held out a broader motivating principle, namely, the ability of the universe to transform itself and steer its own future, but John David Garcia has, and John describes himself as an intellectual heir to Teilhard. Obviously, becoming aware of oneself and one’s history is necessary to gain control over oneself and one’s future, so the one goal is a subgoal of the other.
To avoid repeating the phrase “ability to get things done,” I will follow Garcia and say, “creativity.”
It is possible to implement a superintelligence (SI) that seeks to maximize the creativity of the parts of reality under its control, but that has no preference as to what purpose that creativity is eventually put. At first that seems self-contradictory, but it is not, as I try to explain here. The trick is that maximizing creativity is an unambiguous guide to action only if the reality in which the agent finds itself is “big” (which I define in the linked document). The goal system I want the superintelligence (SI) to have has no preference among futures if the SI finds itself in a “small” reality: in that situation, it just lets any other agents in the vicinity determine the evolution of that reality.
This is not a lost purpose if the agent or the creator of the agent never had any purpose beyond maximizing creativity to begin with. There is nothing impossible or self-contradictory about the goal system, which I call goal system zero (GSZ).
GSZ is not going to motivate any agent to drive around in a car with no destination because that would waste energy that can be used to try to increase the creativity of the universe (or the parts of the universe over which the agent has influence).
I have not questioned Roko enough to know whether he endorses GSZ, but his values seem close, and his argument for often instrumental values is a strong argument for GSZ. Here is a quote from Roko’s first full explanation of his argument:
Any agent who acts in the world to achieve certain goals has to contend with two fundamental facts about the nature of the interaction of an agent with the real world. The first fact is that my desire to achieve some goal (“I want to be in Los Angeles”) does not make that desired state happen. In order to impose our goals on the world, we have to manipulate the world, and those manipulations follow a set of rules, called the laws of physics.
This seems trivial, but as far as finding an objective system of ethics is concerned it is very important, and in fact it is a good thing. If it were the case that as soon as I desired some state, the world instantly transformed itself into that state with no side-effects, then there would be no mathematical structure to the set of goal states that an agent could have. In the case of a set of possible goal states with no mathematical structure, i.e. such that there are no objective relations between those goals, there is clearly no objectively best goal. Like elements of an abstract set, goals without relations between them cannot be superior to one another.
But our world is not like this! Goals do have relations between them. Steve Omohundro wrote two papers about the relations between various goals that an agent can have.
The most important relation that goals can have is the following: Goal A is instrumental to Goal B. That is to say, if we first achieve Goal A, then it will be easier to achieve Goal B.
This is not the only consideration that makes me prefer GSZ. I am motivated also by the fact that the goal system is agent-indifferent (does not refer to particular agents or classes of agents), time-indifferent (does not hold some moments of time to be more important or valuable than other moments) and very simple.
Tags: machine ethics
Richard: My reason for prefering that agent to, e.g., one that maximizes moments of human happiness is that I perceive no valid reason to prefer happiness moments and gold atoms. (Yes, I concede that I have an austere and dry way of looking at the world.)
I think that the reason that we have come up with similar ideas is that we have the same desire to not make arbitrary choices. The latest comment on my blog here gives a summary of my thought process:
Firstly, I dislike having to make choices: the entire paradigm of “ends” and “means” basically forces a very large arbitrary choice on you.
I then found out that, after having made the arbitrary choice of what terminal value you want to adopt, for all “sensible” choices (which might be hard to make precise), you end up pursuing the same set of instrumental values – outlined in steve omohundro’s paper, and in an earlier post of mine. So you can, in a sense, avoid the choice by pursuing those instrumental values instead of any one particular terminal value. You can easily dip in and out of each particular terminal state once you have accumulated enough free energy, space, matter, intelligence, etc to do so. By pursuing Often instrumental values, you get to a state which close to being optimal with respect to most terminal values you could have chosen to start with.
But, as you point out, there is no better way of achieving X than by having X as your one true goal. Adopting the OIVs instead necessarily means that you will not achieve X quite as well.
Now in my post on ontologies approximations and fundamentalists, I realized that something else is going on. There are two mathematical structures present when you choose which goal to spend your life pursuing: firstly, there is the instrumental structure: if you spend time pursuing the goal of getting more OIVs, you make all other goals easier.
There is another structure present: the mere fact that your mind is finite means that your list of goals to pursue is far smaller than the set of possible states of the universe. So when you decide that “X” is your goal, and then implement a clause in your goal system that says that you should never change your goal (like all utility maximizers do) you have done something very silly.
I think that the only natural way to avoid arbitrary choices in “goals” or “terminal values” is to exploit those facts that underlie the question “what should I do? What goal should I have?”.
As one spends more time thinking about the question, one realizes that even in asking the question, a lot is implicit.
Firstly, the fact that you are a particular agent with sensors, effectors, some kind of (finite) brain and mind is implicit.
Secondly, the actual laws of physics of our universe are implicit.
Thirdly, by using the word “could” in a world with deterministic physics, you are implicitly using some kind of approximate ontology and simplified version of physics.
That’s actually quite a lot to go on – and I suspect that from those assumptions, one can construct a canonical set of goals along the lines of:
– I need more free energy, space, time, matter, ingenuity and intelligence
– I need to expand my brainpower, because I have a necessarily limited representation of the world and my choice of possible goals is limited by that representation.
– I need to overpower other agents with rival goals, but I don’t want to delete them completely because they constitute highly ordered intelligence and so are necessarily more useful than raw materials in a universe where raw materials are fairly abundant
Suppose you build a artificial general intelligence, Roko. How would this agent change its terminal goal? What criterion would it use to decide when and how to change its terminal goal? I humbly suggest that whatever that criterion is is the agent’s real terminal goal. The thing that changed was never the agent’s terminal goal.
In other words, I suggest that we define the phrase terminal goal (or system of terminal goals or values) in such a way that an agent cannot change its terminal goal.
What about people like you and I who seem to be able to change our own terminal values? (For example, I have tentatively chosen goal system zero as my system of terminal values, and it was not till I was 32 years old that I became aware of the existence of anything resembling goal system zero (GSZ), suggesting that at some time after my 32nd birthday I changed my terminal values.) Well, I suggest we view humans as “messy intelligent agents” with the property that no one (not even the holder of the terminal goal) can become truly confident about what a human’s terminal goal is. According to this view, even I do not know with certainty my true system of terminal values. There is a good chance that it really is GSZ, but there is also a chance that I am deceiving myself about the nature of my system of terminal values.
Superintelligent AIs are different from humans in this regard. Although it is possible, it is very irresponsible not to be highly confident of the agent’s true terminal goal when launching the seed of a superintelligence.
(Eliezer’s CEV proposal is not irresponsible in this way.)
“Roko. How would this agent change its terminal goal? What criterion would it use to decide when and how to change its terminal goal? I humbly suggest that whatever that criterion is is the agent’s real terminal goal.”
– suppose I have a computer program implemented on a computer which works by taking an input string and replacing the entirety of its own code with that string, including the section which talks about replacing the source code.
In this case, there are sequences of inputs where nothing remains constant over each input/output/modify cycle.
So, given this definition, not all agents have terminal goals.
Is it irresponsible to write an agent that doesn’t have some constant part?
I don’t think so – I think that it is good enough to be highly confident that the agent’s motivational system is what you want it to be, even if, in general, that motivational system can overwrite itself.
In fact I would go further and say that writing an AI with full overwrite privileges is the best way to go. It seems more risky in the short term, but in the long term you have the advantage that you haven’t committed yourself irrevocably to something.
This has been discussed a lot on SL4, Roko. The most cogent remark there that I can remember right now (by Eliezer) remarked that if offered a pill that would make him selfish rather than altruistic, or a pill that would cause him to start to believe that murder was OK, Eliezer would choose not to take the pill. So the general consensus IIRC on SL4 (which I share) was that even if an AI has full-overwrite privileges, it is not going to use them in a way that has more than a negligible chance of altering its terminal goal. Are you sure you are not trying to generalize from your experience with your own mind or with human minds?
Also, I want to retract my previous statement that every human has an unchanging terminal goal for some sensible definition of terminal goal. It now occurs to me that probably every human’s terminal goal can be influenced by causes outside the human. It is a pretty puzzle to say what exactly is outside or inside a human. If for example, I consult my own blog to remind myself of what my terminal goal is and if if-counterfactual I had not consulted my blog to remind myself, then my terminal goal would have turned out differently, then one can say that my terminal goal has been affected by a cause outside myself. But are the words I have written on my own blog really outside myself? Why not consider them part of myself?
So the situation with the human beings is more complicated than I implied in my previous comment, but the point remains that it is very irresponsible to launch the seed of a SI unless
(1) you possess an unambiguous (preferably formal a la code or formulae) compact description of the terminal goal of the seed and unless
(2) you understand the seed well enough to know with high confidence that the seed will steer the future into the terminal goal — that is, that the SI will truly be under the control of the terminal goal.
In other words, the way I hope the future will go is that no team will launch a seed for an SI until they are able to hit a very tiny region in the space of all possible futures with very high probability. It is this very tiny region that I wish to denote with the phrase “terminal goal of the SI”. The region has a compact description because if it did not then mere humans would not have a chance to hit it with very high probability.
Eliezer’s Knowability of AI is very illuminating on those two points. If you reply to this comment, please indicate whether you have groked that document.
In other words, I always thought it is impossible to launch the seed of an SI without committing yourself irrevocably to something — or more precisely, if it is possible, I cannot imagine why you would want to. You can launch a seed you do not understand, and you can launch a seed you do understand that predictably implements a terminal goal you do not understand. I always thought those two things would be a bad idea (and I called them very irresponsible just now).
Let me give an example. Suppose I launch a seed whose utility function
Uis a weighted average of two subfunctions:U(e) == 0.3 * GSZ(e) + 0.7 * rest(e) for every eGSZis goal system zero andrest(e)is the probability that that e would have happened if I had not launched the seed. Note that this utility function is ambiguous in that it does not specify what fraction of its “resources” the seed should devote to the determination of the probability distributionrest, but it suffices for present purposes.One way to look at this seed is that it leaves the future “open” in the sense that I have very little control over how the future would turn out if I do not launch the seed, and that counterfactual “open” future is represented in the actual future by that term
rest(e). But on the other hand, I would consider it irresponsible for anyone to launch the seed before he has an unambigous definition ofrestand makes himself quite certain the definition means what he thinks it means. And I hope you will concede that it is possible to be quite certain that the definition means what you think it means without having any significant degree of control over or ability to predict what would have happened if you do not launch the seed. And I submit to you that the choice to launch the seed with therest(e)term instead of no term or instead of some other term is an irrevocable choice.“This has been discussed a lot on SL4, Roko.”
– can you give me either a link to the relevant posts, or some info on which keywords to search them by?
“they are able to hit a very tiny region in the space of all possible futures with very high probability”
– I think that there are deep problems with this idea see my post on ontologies and approximations
In essence, we can’t totally represent the space of all possible futures; any attempt to do so is only approximate.
On the other side of a singularity, the quality of the approximations that we use today (both in terms of coarse-grainedness and in terms of deeply mistaken assumptions) will look much worse than the quality of the ontology that a chimpanzee uses looks to us.
We don’t even know what the future can look like, because we’re not clever enough, and attempts to forever constrain the behavior of a seed AI will be a bit like a chimp replacing the US constitution with “more bananas!”
Of course the hard part is balancing the effect that we have on curtailing possible futures against the treatment that we receive in those possible futures; I am still a little hazy on how to do this.
Also: I have not thoroughly read knowability of FAI. I’ll let you know when i have…