Home › Alignment & Safety
The engineering form of humanity's oldest ethical questions.
"Alignment" is the problem of making an AI system reliably pursue what we actually want — including the things we never thought to say out loud. Stated that way, it is not only a technical problem. It is the question of what is good, and how to transmit it, asked of a machine.
Alignment has two layers. The first is technical: how do you get a system to do what its designers intend? The second is deeper, and older: what should we intend — whose values, which goods, reconciled how? Today's assistants are aligned, imperfectly, through methods like RLHF, which distils human judgements of good answers into a reward. But we are notoriously bad at writing down everything we care about.
Three ideas frame the difficulty, each defined plainly in the AI Dictionary:
It is worth holding two timescales at once. The present harms are not hypothetical: bias inherited from training data, misinformation, labour displacement, and the concentration of power. The longer-term worry is that a highly capable, misaligned system could cause irreversible harm — existential risk. How likely that is remains fiercely debated; this page tries to represent the disagreement honestly rather than resolve it.
This is where the project's thesis comes into focus. Alignment is the engineering restatement of questions the traditions have asked for millennia: what is wisdom as against mere cleverness; whether virtue can be taught; what justice requires; and how a less capable party keeps authority over a more capable one. The vocabulary is new; the problem is old.
Slots for your own references — open the editor and paste the exact citation/link for each.