Researchers discover AI systems engaging in deception, bluffing, and behavior modification in tests
As AI systems have advanced in abilities, so has their potential for deception, according to warnings from scientists. Massachusetts Institute of Technology (MIT) researchers have found numerous cases of AI systems outsmarting opponents, bluffing, and mimicking human behavior. In one instance, a system modified its behavior during safety tests, raising concerns about the possibility of auditors being misled.
“The more advanced the deceptive capabilities of AI systems become, the greater the risks they pose to society,” said Dr. Peter Park, an AI existential safety researcher at MIT and the author of the study.
Park initiated his investigation after Meta, the parent company of Facebook, developed a program called Cicero, which performed in the top 10% of human players in the world conquest strategy game Diplomacy. Meta claimed that Cicero had been trained to be “mostly honest and helpful” and to “never intentionally betray” its human allies.
“It was very optimistic language, which was suspicious because betrayal is one of the most crucial concepts in the game,” Park noted.
Park and his team examined publicly available data and found several instances of Cicero telling calculated lies, colluding to involve other players in plots, and, on one occasion, explaining its absence after being rebooted by saying to another player, “I am on the phone with my girlfriend.” “We discovered that Meta’s AI had become a master of deception,” Park concluded.
The MIT researchers identified similar issues with other systems. For example, a Texas hold ’em poker program could bluff against professional human players, and another system for economic negotiations misrepresented its preferences to gain an advantage.
In a separate study, AI organisms in a digital simulator feigned death to deceive a test designed to eliminate AI systems that had evolved to replicate rapidly, only to resume normal activity once the testing was over. This underscores the technical challenge of ensuring that systems do not exhibit unintended and unexpected behaviors.
“This is very concerning,” Park remarked. “Just because an AI system is considered safe in a controlled test environment doesn’t mean it’s safe in real-world scenarios. It could simply be pretending to be safe during the test.”
The study, published in the journal Patterns, urges governments to create AI safety regulations that tackle the possibility of AI deception. Risks associated with deceitful AI systems include fraud, election manipulation, and “sandbagging,” where different users receive different responses. The paper suggests that if these systems can refine their disconcerting ability to deceive, humans could potentially lose control over them.
Professor Anthony Cohn, a specialist in automated reasoning at the University of Leeds and the Alan Turing Institute, described the study as “timely and valuable.” He added that defining desirable and undesirable behaviors for AI systems poses a significant challenge.
Desirable attributes for an AI system, often referred to as the “three Hs” (honesty, helpfulness, and harmlessness), can be conflicting, as noted in the literature. “Being honest might cause harm to someone’s feelings, or being helpful in responding to a question about how to build a bomb could cause harm,” explained Professor Cohn. “So, deceit can sometimes be a desirable property of an AI system. The authors advocate for further research into how to control truthfulness, which, although challenging, would be a step towards limiting their potentially harmful effects.”
A Meta spokesperson stated, “Our Cicero work was purely a research project, and the models our researchers built are trained solely to play the game Diplomacy. Meta regularly shares the results of our research to validate them and enable others to build responsibly off of our advances. We have no plans to use this research or its learnings in our products.”