Sampling Multiple Answers Improves LLM Reasoning

A common way to get more reliable answers from a large language model is to ask the same question several times and then look for agreement among the results. This idea was formalized in the paper “Self-Consistency Improves Chain of Thought Reasoning in Language Models” by Wang and colleagues, presented at ICLR 2023 and available at https://arxiv.org/abs/2203.11171.

How Self-Consistency Works

Chain-of-thought prompting asks a model to write out its intermediate reasoning steps before giving a final answer. The standard approach uses greedy decoding, which produces a single reasoning path and a single answer. Self-consistency replaces that single path with a decoding strategy built around two steps.

First, instead of generating just one chain of thought, the method samples a diverse set of reasoning paths for the same problem. Second, it selects the most consistent answer by marginalizing out the sampled reasoning paths. In practice this means aggregating the final answers from all the sampled paths and choosing the one that appears most often.

The reasoning behind the method is that a difficult problem usually has more than one valid line of thinking, and several of those lines tend to arrive at the same correct answer. By looking across many sampled paths rather than trusting a single one, the answer reached most frequently is treated as the most reliable.

Reported Accuracy Gains

The paper reports improvements across a range of arithmetic and commonsense reasoning benchmarks when self-consistency is added on top of chain-of-thought prompting. On GSM8K, a grade-school math word problem dataset, accuracy improved by 17.9 percent. The SVAMP arithmetic benchmark improved by 11.0 percent, and AQuA improved by 12.2 percent.

The gains were not limited to arithmetic tasks. On StrategyQA, a commonsense reasoning benchmark, accuracy improved by 6.4 percent, and on ARC-challenge it improved by 3.9 percent. The authors describe these results as boosting chain-of-thought performance by a striking margin.

The work involved large language models including PaLM and UL2, according to the paper’s revision history.

Why the Approach Is Useful

Self-consistency is described as a decoding strategy rather than a change to the model itself. It does not require training, fine-tuning, or additional labeled data. The same model and the same prompt are reused, and the only change is sampling several reasoning paths and taking the most common final answer.

That simplicity is part of the appeal. The method works on top of existing chain-of-thought prompting, so it can be applied to tasks where a model already produces step-by-step reasoning. The tradeoff is that generating multiple reasoning paths costs more computation than a single greedy pass, since the model has to produce several full responses for one question.

The benchmarks in the paper focus on problems with a discrete, checkable final answer, such as a number or a multiple-choice selection. That structure is what allows the answers from different reasoning paths to be compared and counted, making the consistency check meaningful.

Sampling Multiple Answers Improves LLM Reasoning

Sampling Multiple Answers Improves LLM Reasoning

How Self-Consistency Works

Reported Accuracy Gains

Why the Approach Is Useful

Related Tips

Qwen2-Audio Listens and Replies in Text

"Take a Deep Breath" Came From an AI Optimizer

Inkling: Mira Murati's Conversational AI Model