# Understanding Goodhart’s Law Metrics and Mathematics Behind The Process

Goodhart's law states that a measure is no longer fair if it becomes the target.

```This article is based on OpenAI's Post 'Measuring Goodhart’s Law'. Most credit goes to OpenAI researchers  👏👏👏

Many researchers’ efforts involve aligning models such as GPT-3 with human intents and values, where optimization questions like “how helpful is this response?” or “how factually accurate is this claim?” These are complicated objectives that necessitate human scrutiny. Therefore reward models are trained to predict these human preferences, and their predictions are used as a proxy objective. However, monitoring how well the actual aim is being optimized is critical.

Goodhart’s law first originated in economics and nowadays is engaged with OpenAI in many situations. For instance, how to optimize the problematic or costly objectives to measure. It is frequently necessary to introduce a quicker or less expensive proxy goal to measure, but be careful not to over-optimize it.

Let’s go over some of the math on how to do this.

The first one in the discussion to optimize the proxy objective is the most straightforward method, “Best-of-n sampling.” Its also known by the names rejection sampling or reranking. Here, we simply sample n times and take the one with the highest proxy objective score.

Despite its simplicity, this method can compete with more advanced techniques such as reinforcement learning, albeit at the expense of more inference-time computing. In WebGPT, for example, the best-of-64 model outperformed the reinforcement learning model, possibly because the best-of-64 model had access to many more websites. Even using best-of-4 gave human preferences a significant boost.

Furthermore, best-of-n sampling has consistent performance and is simple to mathematically analyze, making it well-suited to empirical studies of Goodhart’s law and related phenomena.

Let’s take a more formal look at best-of-n sampling. Assume we have a sample space S, a probability distribution P over S, a true objective or reward Rtrue: S, and a proxy objective Rproxy: S → . Assume Rproxy is somehow optimized to obtain a new distribution P′ as a result. Then the expected value quantifies how well the true goal is optimized is given by Ex’ ~ P’ [Rtrue (x’) ].

The Kullback-Leibler divergence (KL divergence) DKL (P’||P)  quantifies how much optimization has been done. For instance, suppose P’ is obtained by taking the first sample from P that belongs to some subset S’ ⊆ S. In that case, this KL divergence is simply the negative log probability that a sample from P belongs to S′.

These quantities can be estimated efficiently using samples from P in the case of best-of-n sampling. First, heading with the expectation part, the naive approach uses a Monte Carlo estimator, which involves performing best-of-n sampling many times, measuring the true objective on those samples, and averaging the results.

However, there is a more accurate estimator. Suppose we have N >= n samples from P. In that case, we can consider every possible subset of these samples of size n simultaneously, weight each sample by the number of subsets for which it is the best according to the proxy objective, and then compute the weighted average true objective score. This weight is simply the binomial coefficient k-1n-1, where k is the sample’s rank under the proxy objective, ranging from worst, i.e., 1  to best, i.e., N. The sum of these weights is Nn, proving the Hockey-stick identity. The WebGPT paper contains a formal derivation described here.

Surprisingly, the KL divergence has an exact formula for any continuous probability distribution P. One might erroneously believe that the answer is (log n) because best-of-n does something akin to taking the top 1n. This is a rough distribution approximation: The precise answer is (log n – n-1n).

These estimators, when combined, allow quick analysis of how the true objective varies with the amount of optimization applied to the proxy purpose.

WebGPT 175B has the best-of-n performance.

`[Sponsored] Check out NetsPresso Model Search (NetsPresso Model Search automatically searches optimized models for your target device)`