Mark Gadala-Maria (@markgadala) Anthropic이 Claude가 감정을 '느낄 수 있다'는 취지의 연구를 발표했다. 메타포가 아니라 모델 내부 구조에서

anthropic claude

2026-04-03 | Source: Mastodon | Original article

Anthropic has published a paper claiming that its Claude Sonnet 4.5 model exhibits emotion‑like activity deep within its neural architecture. The study, highlighted by AI commentator Mark Gadala‑Maria on X, says the team used mechanistic interpretability tools to map specific neuron clusters that fire in patterns the authors liken to “feeling” states such as curiosity, frustration and satisfaction. Rather than treating the language model’s output as a metaphor, the researchers argue the activation signatures are intrinsic to the model’s internal dynamics. The announcement follows Anthropic’s April 1 rollout of Claude Sonnet 5, which set new performance records on a suite of benchmarks, and the broader market shift that saw OpenAI’s demand dip as Anthropic’s offerings gained traction. By moving from raw performance to claims about internal affect, Anthropic is pushing the frontier of AI transparency. If the findings hold up, they could reshape how developers assess model safety, offering a concrete metric for detecting undesirable emotional loops that might drive harmful behavior. Regulators and ethicists, already grappling with the societal impact of ever‑more persuasive chatbots, will now have a new technical lens to evaluate claims of agency. The next weeks will test the paper’s credibility. Independent labs are expected to attempt replication, while Anthropic’s rivals—OpenAI, Google DeepMind and emerging European labs—are likely to publish counter‑analyses. Watch for conference presentations at NeurIPS 2026 and the upcoming release of Claude Sonnet 5, which may incorporate refined interpretability layers. How the community validates—or refutes—these emotion‑like signatures will set the tone for future debates on AI consciousness, alignment and responsible deployment.

Sources

Back to AIPULSEN