Stanford Researcher develops a simple prompting strategy that enables open-source LLMs with 30x fewer parameters to exceed the few-shot performance of GPT3-175B

By improving the in-context learning quality of smaller and open source models, more researchers and organizations can study and apply the technology. One exciting set of applications is in personal-private machine learning.  

In contrast to designing perfect prompts via brute force guess and check, “ Ask Me Anything” (AMA) provides principled approaches and insights into prompt design — the work shows how studying the pertaining corpus and the LLM training procedure may provide effective signals for how to format prompts, and the work aggregates the predictions of multiple prompts using tools from weak supervision. Methods like AMA can help provide a starting point for LLM-users who are dealing with the massive search space of natural language prompts.

Prompt design towards a “perfect prompt” for a task involves significant effort and is a hard process…and often just simply frustrating.

This paper describes a new approach AMA for prompting which leads to significant higher performance for LLMs : This strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks.

The AMA strategy prompt combines multiple imperfect prompts with weak supervision to create predictions for the best inputs,  as described below.


The researcher really innovated and followed this 3 step process to craft this approach:

  1. Identifying the properties for prompts that lead to highest effectiveness.

The research found that question-answering (QA) prompts which typically result in open-ended generation (“Who went to the park?”) had the highest performance.

They then created a two-step prompting pipeline: (1) generating questions based on the input and (2) prompting the LLM to answer the generated questions.

Finally, they generated and aggregated over multiple prompt-outputs for each input.

  1. Developing a strategy to scalably format task inputs according to the most efficient prompt property.

Scaling the step 1 above is not trivial. To do so, the researcher applied prompt chaining. Specifically, the researcher recursively applied the LLM itself using a chain of functional prompts, referred to as prompt()-chains . These prompts apply a task-agnostic operation to all inputs in the tasks, without any example-level customization. 

AMA constructs different prompt()-chains where each unique prompt()-chain is a different view of the task and can emphasize different aspects. The chains are also diversified via two key levers : the in-context demonstrations and the style of prompt questions. See below for an example:

  1. Prompt aggregation.

For the first time, yes ! For the first time, weak supervision was used to aggregate prompts. Prompt aggregation is not new but weak supervision applied to it is.

Weak supervision.. quick reminder: learning high-quality models from weaker sources of signal without labeled data.

This was particularly powerful given the varied accuracies and dependencies among prompt()-chains and the fact that no label data was required.


Impressive results as per the table below. These benchmark results compare the open-source GPT-J-6B and few-shot (k ∈ [32..70]) GPT3175B. 

The number of in-context examples is in parentheses in the table below.


The open-source 6B parameter model exceeds the average few-shot performance of the GPT3-175B model on 15 of 20 benchmarks.

Benefits of AMA:

  • Using imperfect prompts and enabling the use of small open-source LLMs.
  • Improve the prompting performance of off-the-shelf language models with no fine-tuning.

Check out the Paper and Github. All Credit For This Research Goes To Simran Arora, Stanford researcher, and her collaborators Avanika, Mayee, and Laurel at Hazy Research.

✅ [Featured Tool] Check out Taipy Enterprise Edition