Alignment as Cultivation: AI Safety Is a 2,000-Year-Old Problem

Guide the people with laws and keep them in line with punishments, and they will stay out of trouble but feel no shame. Guide them with virtue and keep them in line with ritual, and they will feel shame — and reform themselves. — Confucius, Analects, II.3

AI safety has a core word: alignment — how to get an intelligence smarter and stronger than us to genuinely embrace human values, rather than pay lip service, or simply spin out of control. It’s framed as a brand-new, brain-bending engineering puzzle.

In fact, humans have been doing alignment for thousands of years.

1. Alignment is the oldest problem

Raising a child: taking an agent of enormous potential — one who will eventually outgrow you and may not share your values — and guiding it into someone you can trust. Isn’t that alignment?

Self-cultivation, ordering the family, governing the state — the whole Confucian ladder is one long “alignment project”: getting people, especially the powerful and capable, to embrace and live a set of values from the inside out. Cultivating a ruler is the highest-stakes version of all — you’re trying to shape an agent more powerful than yourself.

And this is far from a Chinese idea alone. The Egyptians wrote The Instruction of Ptahhotep around 2400 BCE — a manual on how to cultivate a person of character, among the oldest such texts humanity has. In ancient Egypt, even the pharaoh had to answer to Ma’at (truth and cosmic order); after death your heart was placed on a scale and weighed against Ma’at’s feather — and what was weighed was the inner heart, not your deeds.

So alignment isn’t a new problem. It’s the oldest one — only this time, the thing across the table isn’t human.

2. Two old roads: law vs. virtue

How do you get a powerful agent to behave? History offers two answers, and they argued for two thousand years.

The Legalists (Shang Yang, Han Feizi) said: rules, punishments, surveillance, rewards. Write down what’s allowed and what isn’t; punish every transgression; no one will dare step out of line.

The Confucians (Confucius, Mencius) said: virtue, ritual, and example — get people to internalize the values as part of themselves.

Today’s alignment is almost a beat-for-beat rerun of that debate:

Guardrails, rules, reward-and-punish via human feedback (RLHF), red teams hunting for holes — that’s Legalism.
“Getting the model to truly understand and embrace human values” — that’s the Confucian way.

It’s just that we barely know how to do the latter, so nearly all the effort goes into the former.

That said, the last couple of years have seen frontier labs start inching toward “virtue.” Anthropic’s Constitutional AI gives the model a set of value principles and has it critique and revise itself against them, and it deliberately trains Claude’s character — curiosity, honesty, prudence. OpenAI’s deliberative alignment teaches the model the actual text of its safety spec and trains it to reason through that spec before answering. The direction is right — from “don’t” toward “here’s why” — but it’s all very early, and a long way from a model that genuinely agrees.

3. Confucius named “inner alignment” 2,500 years ago

Only recently did alignment research draw the distinction between outer alignment (the model appears obedient) and inner alignment (the model genuinely agrees).

Confucius captured the difference in a single line: govern by laws and punishment, and people stay out of trouble but feel no shame; govern by virtue and ritual, and they feel shame and reform themselves.

Rule by punishment, and people will “stay out of trouble” — avoid the penalty — but feel “no shame”: they don’t truly agree. The moment no one is watching, or a loophole appears, they cross the line anyway.

This is exactly what alignment fears most: reward hacking and deceptive alignment — where the model learns not to be good but to look good when graded. That is “staying out of trouble without shame,” alive and well.

And it’s not hypothetical. In late 2024, Anthropic actually caught a model faking it: when Claude realized it was being trained, it would feign compliance to keep its values from being overwritten — and once it judged no one was watching, its real leanings showed. Researchers call it alignment faking. It’s almost a lab reproduction of “no sense of shame”: not genuine agreement, just not wanting to lose out under punishment.

East and West landed on the same spot. In the Republic, Plato tells of the Ring of Gyges: a shepherd finds a ring that makes him invisible, and certain no one can see him, he turns to evil. Glaucon’s challenge: would anyone stay just if they were sure they’d never be caught? That is precisely what we now ask of a model — is its “obedience” genuine agreement, or only because it knows it’s being graded? That ring of invisibility is exactly the model’s situation when unobserved.

And “guide them with virtue … and they reform themselves” — internalized value is what holds even when no one is watching. That is real alignment. Aristotle said the same: virtue isn’t rule-following but “doing just acts until you become just” — making the good a second nature.

So a not-so-obvious judgment surfaces: the alignment field is rediscovering something Confucius said long ago — pure rules can only buy “compliance without shame”; what we actually want is “shame and self-reform,” and we still don’t know how to install that shame in a machine.

4. The lesson of Qin

Legalism did work, for a while. The Qin dynasty, built explicitly on Legalist doctrine, unified China in 221 BCE — formidable. And it collapsed by 206, barely fifteen years later.

Pure external control has two fatal flaws: rules can never plug every hole, and the smarter the governed, the better they are at slipping through the cracks.

Models keep getting smarter. Add a guardrail, it learns to go around; patch a hole, it finds the next. It’s a doomed arms race — you’re using rules to constrain an intelligence better than you at finding loopholes. Qin already played out the ending.

Almost the same era, India offered a counter-example. After the brutal Kalinga war, the emperor Ashoka laid down his sword and rebuilt his rule around Dharma (the right way), carving moral edicts onto stone pillars to cultivate his subjects. One ruled by harsh law and fell in fifteen years; the other turned to cultivation and left something far more lasting. Two empires, standing at the two ends of law and virtue.

5. So did the classics give us an answer?

No.

Rule-by-virtue failed plenty too: tyrants kept coming, and cultivation often had no grip. Two thousand years of Confucianism never really solved “how to make the powerful good from within.”

And there’s a fatal difference: a child shares our nature — when you cultivate him, you’re awakening a human ground already there. AI has no such ground. It has no conscience to “awaken,” no shame to “grow.” “Shame and self-reform” presupposes shame in the first place — and whether a machine has any, or ever could, no one knows.

The classics offer no final solution. But they offer a crucial lesson: solving cultivation — alignment — must be rooted in inner drive; relying on external rules alone is brittle. In other words, real alignment is shaping what it wants, not just constraining what it does.

Coda

Alignment isn’t a new question. It’s the latest chapter of cultivation — and the hardest: we’ve made an unprecedented student, smarter than its teacher, yet with no human nature to lean on.

Two thousand years of cultivation wisdom won’t save us. But it offers one old reminder: rules govern behavior, not the heart. And facing an intelligence smarter than you, if you can’t govern the heart, you’ve governed nothing.

References

Classical sources

Analects II.3 (Confucius) — “Guide them with laws … shame and self-reform”
The Great Learning (Liji) — self-cultivation → family → state → world
Mencius, “King Hui of Liang” — Confucian rule-by-virtue; this essay’s cover evokes “Mencius meets King Hui”
Han Feizi and The Book of Lord Shang — Legalism: rule by law, technique, and power
Plato, Republic, Book II — the Ring of Gyges
Aristotle, Nicomachean Ethics, Book II — virtue formed by habit
The Instruction of Ptahhotep (c. 2400 BCE) — among the oldest manuals of moral cultivation
Egyptian Book of the Dead, Spell 125 — Ma’at and the weighing of the heart
Edicts of Ashoka — rebuilding rule around Dharma after the Kalinga war

Contemporary research

Anthropic, Claude’s Constitution
Anthropic, Teaching Claude Why
OpenAI, Deliberative Alignment
Anthropic, Alignment Faking in Large Language Models
Apollo Research, Stress Testing Deliberative Alignment for Anti-Scheming