This article is based on OpenAI's Post 'Measuring Goodhart’s Law'. Most credit goes to OpenAI researchers 👏👏👏 Please don't forget to join our ML Subreddit
Many researchers’ efforts involve aligning models such as GPT-3 with human intents and values, where optimization questions like “how helpful is this response?” or “how factually accurate is this claim?” These are complicated objectives that necessitate human scrutiny. Therefore reward models are trained to predict these human preferences, and their predictions are used as a proxy objective. However, monitoring how well the actual aim is being optimized is critical.
Goodhart’s law first originated in economics and nowadays is engaged with OpenAI in many situations. For instance, how to optimize the problematic or costly objectives to measure. It is frequently necessary to introduce a quicker or less expensive proxy goal to measure, but be careful not to over-optimize it.
Let’s go over some of the math on how to do this.
The first one in the discussion to optimize the proxy objective is the most straightforward method, “Best-of-n sampling.” Its also known by the names rejection sampling or reranking. Here, we simply sample n times and take the one with the highest proxy objective score.
Despite its simplicity, this method can compete with more advanced techniques such as reinforcement learning, albeit at the expense of more inference-time computing. In WebGPT, for example, the best-of-64 model outperformed the reinforcement learning model, possibly because the best-of-64 model had access to many more websites. Even using best-of-4 gave human preferences a significant boost.
Furthermore, best-of-n sampling has consistent performance and is simple to mathematically analyze, making it well-suited to empirical studies of Goodhart’s law and related phenomena.
Let’s take a more formal look at best-of-n sampling. Assume we have a sample space S, a probability distribution P over S, a true objective or reward Rtrue: S → ℝ, and a proxy objective Rproxy: S → ℝ. Assume Rproxy is somehow optimized to obtain a new distribution P′ as a result. Then the expected value quantifies how well the true goal is optimized is given by Ex’ ~ P’ [Rtrue (x’) ].
The Kullback-Leibler divergence (KL divergence) DKL (P’||P) quantifies how much optimization has been done. For instance, suppose P’ is obtained by taking the first sample from P that belongs to some subset S’ ⊆ S. In that case, this KL divergence is simply the negative log probability that a sample from P belongs to S′.
These quantities can be estimated efficiently using samples from P in the case of best-of-n sampling. First, heading with the expectation part, the naive approach uses a Monte Carlo estimator, which involves performing best-of-n sampling many times, measuring the true objective on those samples, and averaging the results.
However, there is a more accurate estimator. Suppose we have N >= n samples from P. In that case, we can consider every possible subset of these samples of size n simultaneously, weight each sample by the number of subsets for which it is the best according to the proxy objective, and then compute the weighted average true objective score. This weight is simply the binomial coefficient k-1n-1, where k is the sample’s rank under the proxy objective, ranging from worst, i.e., 1 to best, i.e., N. The sum of these weights is Nn, proving the Hockey-stick identity. The WebGPT paper contains a formal derivation described here.
Surprisingly, the KL divergence has an exact formula for any continuous probability distribution P. One might erroneously believe that the answer is (log n) because best-of-n does something akin to taking the top 1n. This is a rough distribution approximation: The precise answer is (log n – n-1n).
These estimators, when combined, allow quick analysis of how the true objective varies with the amount of optimization applied to the proxy purpose.
WebGPT 175B has the best-of-n performance.
The training reward model provides the original distribution P, the proxy objective used to compute best-of-n Rproxy, and three putatively “true” purposes Rtrue: a validation reward mode, the training reward model, and actual human preferences. The proxy objective is not over-optimized much, but it is expected to be at higher KLs.
The main limitation of best-of-n sampling is that the KL divergence grows logarithmically with n, making it only suitable for minor optimization.
Reinforcement learning is typically used to apply more optimization. A KL of around 10 nats was achieved using reinforcement learning in settings, such as summarization, before the objective decreases due to Goodhart’s law. To reach this KL using best-of-n, n has to be around 60,000, and it is hoped to achieve much larger KLs with improvements to the reward modeling and reinforcement learning practices.
As nats are not equal, for small KL budgets, best-of-n outperforms reinforcement learning in optimizing both the proxy and proper objectives. Best-of-n appears to be a “brute force” approach, which is more information-theoretically efficient than reinforcement learning but less computationally efficient at large KLs.