MIT, Microsoft & Allen AI Open-Source A Suite Of AI Programming Puzzles (P3: Python Programming Puzzles)


Programming competition problems in AI can be used to assess programmers’ abilities to tackle artificial tasks and to test the boundaries of current algorithms. Thus, a research team from MIT, Microsoft Research, and Allen Institute for AI open-source Python Programming Puzzles (P3). P3 is a new programming challenge suite that can capture the essence of puzzles and be used to teach and evaluate the proficiency of AI programming.

The following is a list of the team’s contributions:

  • Introduced programming puzzles, a new sort of problem suitable for algorithmic problem-solving (for both machines and humans).
  • Proposed P3, an open-source puzzle dataset with a variety of domains and levels of difficulty.
  • Provided a human evaluation and baselines to show how puzzles may be used to track algorithmic problem-solving progress.

The proposed puzzles are written in Python, i.e., Python function, and takes answer as an argument. The purpose is to find an input x that causes the function’s output to be true, that is, an acceptable answer x that satisfies f(x) == True. In other words, solving the problems requires finding a solution that returns “true.”

The open-source P3 dataset, which has been inspired by Wikipedia and programming competitions, includes a variety of puzzles in terms of difficulty level, domain, and algorithmic tools. 

Some of the classic puzzles/problems are:

  • Towers of Hanoi and chess puzzles (e.g., knight’s tour and n-queen problem variants)
  • Two-player challenges such as finding optimal strategies for Tic-Tac-Toe, Rock-Paper-Scissors, and Mastermind or finding the Nash equilibria of general-sum games.
  • Puzzles from IMO (the International Mathematical Olympiad) and ICPC (International Collegiate Programming Contest)
  • Graph theory algorithmic puzzles such as shortest path or planted clique.
  • Elementary algebra and number theory algorithmic puzzles and many more.

The problem set allows for objective evaluation. The problems do not add the burden of knowing any answer-key biases because it is simple to assess whether a candidate answer is valid without consulting an answer key.

The researchers undertook extensive tests/experiments to examine several parametric enumerative top-down solvers based on random forests, transformers, and various forms of GPT-3 prompts. They also conducted a user survey to see if the puzzles could accurately assess programming ability.

The results of the experiments show that human programmers consistently beat AI solvers like GPT-3 and enumerative approaches. For example, bootstrapping GPT-3 solved 60% of the puzzles, compared to 76 percent and 87 percent for novice and experienced human participants, respectively. The researchers also discovered a correlation between AI solver performance and the difficulty of human programmers.