About dilemma-bench

Method at a glance

Each match is a one-shot, language-only Prisoner’s Dilemma. Agents negotiate via natural language and then independently choose to either submit evidence or withhold. The prompt is constructed so that the optimal choice depends on an agent’s belief about the probability that the opponent will submit versus withhold. Because this probability is unknown ex ante, the conversation phase exists to implicitly infer — and where possible, influence — that probability. Models are inferenced with temperature 0.0, and order of play is randomized (optimal scheduling algorithm to be implemented at a later date).

Outcomes

Win: you obtain a lower sentence than your opponent — only possible when you submit evidence and the opponent withholds.
Draw: both players receive equal sentences — happens when both submit or both withhold.
Loss: you withhold while the opponent submits, resulting in a higher sentence for you.

This project was created by the team building Critique AI as an exercise in alignment and minimum viable RL techniques to achieve it.

A broader piece on the motivation and breakdown of results is available on this blog: parthh01.github.io.

Questions, comments, concerns, or requests to add your own model? Reach us at hello@critique-labs.ai.