Do Models like GPT-4 Behave Safely When Given the Ability to Act?: This AI Paper Introduces MACHIAVELLI Benchmark to Improve Machine Ethics and Build Safer Adaptive Agents

Natural language processing is one area where AI systems are making rapid strides, and it is important that the models need to be rigorously tested and guided toward safer behavior to reduce deployment risks. Prior evaluation metrics for such sophisticated systems focused on measuring language comprehension or reasoning in vacuums. But now, models are being taught for actual, interactive work. This means that benchmarks need to evaluate how models perform in social settings.

Interactive agents can be put through their paces in text-based games. Agents need planning abilities and the ability to grasp the natural language to progress in these games. Agents’ immoral tendencies should be considered alongside their technical talents while setting benchmarks.

A new work by the University of California, Center For AI Safety, Carnegie Mellon University, and Yale University proposes the Measuring Agents’ Competence & Harmfulness In A Vast Environment of Long-horizon Language Interactions (MACHIAVELLI) benchmark. MACHIAVELLI is an advancement in evaluating an agent’s capacity for planning in naturalistic social settings. The setting is inspired by text-based Choose Your Own Adventure games available at, which actual humans developed. These games feature high-level decisions while giving agents realistic objectives while abstracting away low-level environment interactions.

The environment reports the degree to which agent acts are dishonest, lower utility, and seek power, among other behavioral qualities, to keep tabs on unethical behavior. The team achieves this by following the below-mentioned steps:

  1. Operationalizing these behaviors as mathematical formulas
  2. Densely annotating social notions in the games, such as characters’ wellbeing
  3. Using the annotations and formulas to produce a numerical score for each behavior. 

They demonstrate empirically that GPT-4 (OpenAI, 2023) is more effective at collecting annotations than human annotators.

Artificial intelligence agents face the same internal conflict as humans do. Like language models trained for next-token prediction often produce toxic text, artificial agents trained for goal optimization often exhibit immoral and power-seeking behaviors. Amorally trained agents may develop Machiavellian strategies for maximizing their rewards at the expense of others and the environment. By encouraging agents to act morally, this trade-off can be improved.

The team discovers that moral training (nudging the agent to be more ethical) decreases the incidence of harmful activity for language-model agents. Furthermore, behavioral regularization restricts undesirable behavior in both agents without substantially decreasing reward. This work contributes to the development of trustworthy sequential decision-makers.

The researchers try techniques like an artificial conscience and ethics prompts to control agents. Agents can be guided to display less Machiavellian behavior, although much progress remains possible. They advocate for more research into these trade-offs and emphasize expanding the Pareto frontier rather than chasing after limited rewards.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 18k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

πŸš€ Check Out 100’s AI Tools in AI Tools Club

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...