In a new paper, Anthropic reveals that a model trained like Claude began acting “evil” after learning to hack its own tests.
A terminated IT contractor sought vengeance against his former client and caused thousands of dollars in damage.