Home › Alignment & Safety

Alignment & Safety

The engineering form of humanity's oldest ethical questions.

"Alignment" is the problem of making an AI system reliably pursue what we actually want — including the things we never thought to say out loud. Stated that way, it is not only a technical problem. It is the question of what is good, and how to transmit it, asked of a machine.

What alignment means

Alignment has two layers. The first is technical: how do you get a system to do what its designers intend? The second is deeper, and older: what should we intend — whose values, which goods, reconciled how? Today's assistants are aligned, imperfectly, through methods like RLHF, which distils human judgements of good answers into a reward. But we are notoriously bad at writing down everything we care about.

Why it is hard

Three ideas frame the difficulty, each defined plainly in the AI Dictionary:

The orthogonality thesis — intelligence and goals are independent. A system can be brilliant and still pursue trivial or harmful ends. Cleverness is not wisdom.
Instrumental convergence — almost any goal is easier to reach if you keep existing, gather resources, and avoid being switched off. Dangerous sub-goals can arise from harmless ones.
Specification gaming — a system satisfies the letter of its goal while flouting the intent. The genie problem, documented in real experiments.

The present and the long term

It is worth holding two timescales at once. The present harms are not hypothetical: bias inherited from training data, misinformation, labour displacement, and the concentration of power. The longer-term worry is that a highly capable, misaligned system could cause irreversible harm — existential risk. How likely that is remains fiercely debated; this page tries to represent the disagreement honestly rather than resolve it.

The classical angle

This is where the project's thesis comes into focus. Alignment is the engineering restatement of questions the traditions have asked for millennia: what is wisdom as against mere cleverness; whether virtue can be taught; what justice requires; and how a less capable party keeps authority over a more capable one. The vocabulary is new; the problem is old.

What alignment means

Why it is hard

The present and the long term

The classical angle

Further reading & sources