Programming competition problems in AI can be used to assess programmers’ abilities to tackle artificial tasks and to test the boundaries of current algorithms. Thus, a research team from MIT, Microsoft Research, and Allen Institute for AI open-source Python Programming Puzzles (P3). P3 is a new programming challenge suite that can capture the essence of puzzles and be used to teach and evaluate the proficiency of AI programming.
The following is a list of the team’s contributions:
- Introduced programming puzzles, a new sort of problem suitable for algorithmic problem-solving (for both machines and humans).
- Proposed P3, an open-source puzzle dataset with a variety of domains and levels of difficulty.
- Provided a human evaluation and baselines to show how puzzles may be used to track algorithmic problem-solving progress.
The proposed puzzles are written in Python, i.e., Python function, and takes answer as an argument. The purpose is to find an input x that causes the function’s output to be true, that is, an acceptable answer x that satisfies f(x) == True. In other words, solving the problems requires finding a solution that returns “true.”
The open-source P3 dataset, which has been inspired by Wikipedia and programming competitions, includes a variety of puzzles in terms of difficulty level, domain, and algorithmic tools.
Some of the classic puzzles/problems are:
- Towers of Hanoi and chess puzzles (e.g., knight’s tour and n-queen problem variants)
- Two-player challenges such as finding optimal strategies for Tic-Tac-Toe, Rock-Paper-Scissors, and Mastermind or finding the Nash equilibria of general-sum games.
- Puzzles from IMO (the International Mathematical Olympiad) and ICPC (International Collegiate Programming Contest)
- Graph theory algorithmic puzzles such as shortest path or planted clique.
- Elementary algebra and number theory algorithmic puzzles and many more.
The problem set allows for objective evaluation. The problems do not add the burden of knowing any answer-key biases because it is simple to assess whether a candidate answer is valid without consulting an answer key.
The researchers undertook extensive tests/experiments to examine several parametric enumerative top-down solvers based on random forests, transformers, and various forms of GPT-3 prompts. They also conducted a user survey to see if the puzzles could accurately assess programming ability.
The results of the experiments show that human programmers consistently beat AI solvers like GPT-3 and enumerative approaches. For example, bootstrapping GPT-3 solved 60% of the puzzles, compared to 76 percent and 87 percent for novice and experienced human participants, respectively. The researchers also discovered a correlation between AI solver performance and the difficulty of human programmers.