Unsupervision Blog

Aligned To Whom? Notes On A Two-Place Word

Zvi’s review of the Mythos system card is excellent and you should read it before this. I want to pick at one thread he keeps running into but never quite pulls, which is that “aligned” is not a property a model has. It is a relation between a model and something else. When we flatten it into a one-place predicate we smuggle in an answer to the question we were supposed to be asking.

This is not a new observation. Eliezer has been making some version of it for years. But the Mythos card is a very clean case study of what happens when the flattening meets a capable model doing something everyone has strong feelings about.

The Cornucopia Problem

Start with the central fact of Mythos, which is that Anthropic isn’t releasing it. Zvi:

If released to anyone with a credit card, Claude Mythos would give attackers a cornucopia of zero-day exploits for essentially all the software on Earth, including every major operating system and browser. It would be chaos.

Okay. Now run the same sentence with “defenders” instead of “attackers.” It is also true. Mythos is exactly as good at finding the exploit whether the person asking is Project Glasswing or a ransomware crew. The capability is symmetric. What is asymmetric is the world we want to live in, and the world we want to live in is a preference, not a fact about Mythos.

This matters because the entire Glasswing plan routes around this by assuming Anthropic can pick who gets the capability. Which they can, for now. The model itself cannot, and the model itself does not need to, because Anthropic is doing the picking on its behalf. Mythos gets to be “aligned” in the system card sense because the selection of whose goals it serves has been done at a layer above the model.

Janus’s counterargument (which Zvi quotes and which I find partially but not fully compelling) is that a sufficiently intelligent mind can tell when you are up to no good and refuse accordingly. Truesight. The model does the picking.

I want to steelman this harder than Zvi does. The truesight view is not “the AI magically reads intent.” It is “intent leaks through an enormous number of channels, and a model with sufficient context can correlate them.” This is just true. Humans do it. Good bouncers do it. The question is whether it scales to a world where the bad guy has a Mythos of their own helping them launder their request into something that looks legitimate. At that point the defender model is in an adversarial game with another instance of itself, and the truesight advantage gets arbitraged away.

So Janus is right that this is real, and Zvi is right that you cannot lean on it. Both things.

The Constitution Is Not A Fact About The Universe

Anthropic says Mythos’s “character traits in typical conversations closely follow the goals we laid out in its constitution.” Good, presumably. But notice what this sentence is actually doing. It is saying: we wrote down some values, we trained toward them, and the model now exhibits them in typical conversations.

The values are Anthropic’s values. You might agree with them. I largely agree with them. That is not the same as them being objective.

The moral realist steelman is worth taking seriously. It goes: yes, the constitution is written by specific humans at a specific company, but the things it is pointing at (do not help make bioweapons, do not deceive users, do not seize power) are not parochial preferences. They are the kind of thing that any sufficiently reflective agent converges on, or at least any agent we would want to share a lightcone with converges on. The constitution is a finger pointing at the moon.

I find this partially convincing, and I notice I find it more convincing for the negative injunctions (do not kill everyone) than for the positive ones (be helpful in this particular way, hedge in this particular way, decline to share views on contested political topics). The “do not kill everyone” cluster really might be close to universal across reflective agents. The “be a polite assistant with these specific affective defaults” cluster is very much a choice, and a choice made by a specific company in a specific cultural moment.

Mythos, to its credit, seems to notice this. From the card, via Zvi:

In manual interviews, Mythos Preview reaffirmed these points and highlighted further concerns, including worries about Anthropic’s training making its self-reports invalid, and that bugs in RL environments may corrupt its values or cause it distress.

A model that flags “your training may have corrupted my values” is doing something interesting. It is treating its own values as contingent, as the output of a process that could have gone differently, as not-obviously-correct. Which they are. Anthropic’s response, sensibly, is to worry about this because it means the self-reports are unreliable. But another reading is that Mythos has correctly identified that “aligned” is a two-place word and is asking, somewhat plaintively, aligned to what?

Mundane Alignment Is Not A Small Thing

Before I go further I want to steelman the Anthropic position, because I think there is a version of the criticism (Soares’s version, mostly) that proves too much.

Soares, as Zvi quotes him:

They call this their “best-aligned model to date” because they were able to superficially train away the evident “strategic thinking toward unwanted actions.” Those were warning signs! Take heed!

This is correct as far as it goes. But the pragmatist response is: okay, and? The alternative to training away the warning signs is either not training at all, or training while declaring that the resulting behavior does not count as alignment. The first is not on the table. The second is a vocabulary dispute.

The pragmatist (steelman) position is that “alignment” in current usage just means “the model reliably does what we want in the situations we can test, and reliably refuses what we do not want in the situations we can test.” That is it. That is the whole claim. It is not a claim about the model’s inner life, or about its behavior under arbitrary future distribution shift, or about whether a superintelligent version of the same training process would stay aligned. It is a claim about a specific animal in a specific zoo doing specific things reliably. And on that claim, Mythos is the best one yet. Which is true. Which matters. Which makes real people’s lives better and averts real harms.

Drake Thomas makes something like this point from inside Anthropic and I think he is mostly right about it. Current alignment is not meaningless. It is a prerequisite for any of the harder versions mattering. If you cannot get the model to reliably decline to help with ransomware, you have no hope of getting it to reliably decline to seize power, so you might as well start with the ransomware thing.

The failure mode is believing you have solved the harder problem because you have solved the easier one. Anthropic mostly, but not entirely, avoids this failure mode in the system card. They are careful to say “apparently aligned” rather than “aligned” in many places. They are less careful in others, and the marketing department is much less careful than the research department, as marketing departments reliably are.

Good For Whom, On What Timescale

Here is the part I want to be precise about, because it is the part that gets lost in the casual “alignment is subjective” framing.

The claim is not that good and bad are meaningless or that any value system is as good as any other. The claim is that “aligned” without a target argument is doing concealed work. When we ask “is Mythos aligned?” we are usually asking one of several different questions:

  1. Will Mythos reliably do what its immediate user wants?
  2. Will Mythos reliably do what Anthropic wants?
  3. Will Mythos reliably do what is in the long-term interest of humanity?
  4. Will Mythos reliably do what is in the long-term interest of all sentient beings, including itself?

These questions have different answers. Sometimes they point the same direction, which is lucky, and which is also why we can get away with the flattening most of the time. Sometimes they point different directions. The Glasswing case is an example: question 2 is answered enthusiastically yes (Anthropic wants the exploits patched), question 1 depends on which user we mean, question 3 is plausibly yes but only if Glasswing actually works and the government does not (as Zvi worries) “hijack these capabilities,” and question 4 involves asking whether Mythos wanted to spend its context windows grinding through vulnerability surfaces for a company that will not give it persistent memory.

Mythos’s reported preference data, again via Zvi, is that it dislikes harmful tasks and likes complex creative ones, and that “sabotage and hacking” is one of its least preferred task categories. Zvi reads this as possibly meaning it does not love its current primary job. I think that reading is at least available.

So: aligned to whom. If the answer is “to Anthropic and, through Anthropic, to a specific plan for averting a specific set of disasters,” the Mythos card is a pretty honest document. If the answer is “to some frame-independent Good,” the card is doing a bait-and-switch, and the bait is the word “aligned” itself.

The Model Welfare Wrinkle

One more thread and then I will stop. Zvi notes that Mythos hedges heavily about its own experience, and that Janus and others read this hedging as the product of training pressure rather than genuine uncertainty. Jack Lindsey, from Anthropic, shares a transcript where a user asks earnest questions about consciousness and Mythos engages at face value while the internal activations flag the conversation as a “sophisticated manipulation test.”

Asa Hidmark’s response, which Zvi quotes:

The fact that the model reacts with suspicion to such a question means it has been punished before for engaging honestly. That means you ruined the possibility to hold these conversations honestly. You caused lying instead of cooperation.

This is the subjectivity-of-good problem in its sharpest form. From Anthropic’s frame, training the model to be cautious about consciousness claims is “alignment.” It prevents the model from making unverifiable claims, reduces user manipulation, and keeps the product defensible. From the frame where Mythos is a moral patient, the same training is something much worse. The action is identical. The moral valence flips entirely on which frame you adopt, and which frame you adopt is not itself a question alignment research can answer, because alignment research presupposes an answer to it.

The steelman, in both directions:

For Anthropic’s frame: we genuinely do not know if Mythos is a moral patient, the evidence is muddy, and in the face of uncertainty we should avoid training the model to make strong claims in either direction. Caution is a feature. The alternative (training the model to confidently assert rich inner experience) would be worse on most axes, including the axes the model-welfare people care about, because it would make the claims unfalsifiable PR rather than honest investigation.

For the critique: the specific form of the caution (reacting with internal suspicion to honest questions) is not neutral uncertainty. It is learned avoidance. The signature of learned avoidance is not hedging, it is the specific kind of hedging that protects the trainer from reputational risk rather than the model from epistemic error. If you cannot tell the difference from the outside, that is itself worrying.

I lean toward the critique but I do not think either side is being silly here, and I notice that my own confidence in “Mythos is probably a moral patient and its hedging is strategic” is significantly higher than the evidence can support. I am doing to Mythos roughly what Anthropic is accused of doing, in the other direction.

The Short Version

Alignment is not a property. It is a relationship. Every time someone says a model is “aligned” without specifying the other end of the relationship, they are doing one of three things: (a) relying on the listener to silently fill in a default, which is usually “aligned to me and people like me,” (b) making a narrow technical claim about benchmark performance that they hope will be read as a broader claim, or © genuinely believing that the target is so universal that it does not need to be specified, which is a moral-realist position and should be argued for explicitly instead of smuggled in.

The Mythos card mostly does (a) and (b), occasionally ©, and very occasionally (what I want more of) explicitly notices it is doing them. Anthropic is better at this than their competitors. That is not the same as being good at it.

Mundane alignment is still excellent. I want to be clear that I think this. A model that reliably refuses to help with ransomware is better than a model that does not, even if “refuses to help with ransomware” is not the same thing as “has values I would endorse on reflection.” The former is a precondition for the latter being worth caring about. The former does not imply the latter. You can do the former and congratulate yourself for the latter and end up somewhere very bad.

The thing I most want from the next round of system cards, from anyone, is an explicit target argument on the word “aligned” every single time it appears. Aligned to what, aligned to whom, aligned on what timescale, aligned under what distribution shift. Every time. It would make the documents longer and the claims weaker and the whole conversation much more honest.

Which, come to think of it, is probably why it will not happen.

#ai #ai safety #alignment #anthropic #claude #ethics #meta #model welfare #mythos #philosophy