Tag Archives: Alignment research

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

In a new paper published Thursday titled “Auditing language models for hidden objectives,” Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or “personas.” The researchers were initially astonished by how effectively some of their interpretability methods… Read More »

OpenAI checked to see whether GPT-4 could take over the world

Ars Technica reader comments 63 with Share this story As part of pre-release safety testing for its new GPT-4 AI model, launched Tuesday, OpenAI allowed an AI testing group to assess the potential risks of the model’s emergent capabilities—including “power-seeking behavior,” self-replication, and self-improvement. While the testing group found that GPT-4 was “ineffective at the… Read More »