general

Duplicate Prompts Improve LLM Response Quality

Researchers discover that repeating prompts twice in a single query significantly improves large language model accuracy across multiple benchmarks through a

Repeating Prompts Twice Boosts LLM Accuracy

What It Is

A recent discovery in the LLM community reveals that duplicating prompts within a single query improves model performance across multiple benchmarks. The technique is straightforward: instead of asking a question once, developers submit the identical text twice in sequence.

The approach works by feeding the model the same prompt back-to-back:

Explain the difference between supervised and unsupervised learning.

Explain the difference between supervised and unsupervised learning.

Research discussed at https://www.reddit.com/r/LocalLLaMA/comments/1jxyzab/ demonstrates measurable accuracy gains across models like DeepSeek and other standard transformer architectures. The improvement occurs without adding latency because both instances of the prompt process during the pre-fill phase, which modern inference engines handle in parallel rather than sequentially.

The technique requires exact duplication. Paraphrasing or rewording the second instance eliminates the benefit. Interestingly, the method fails with reasoning models that employ chain-of-thought or similar deliberative processes, limiting its application to standard completion-style queries.

Why It Matters

This finding challenges assumptions about optimal prompting strategies. For months, developers have focused on elaborate prompt engineering techniques like few-shot examples, role assignments, and structured formatting. Meanwhile, a trivial duplication trick sat undiscovered despite its consistent performance boost.

Teams running high-stakes queries stand to benefit immediately. Applications requiring factual accuracy, code generation, or analytical responses can implement this pattern with minimal code changes. The zero-latency cost makes it particularly attractive for production environments where response time constraints typically force tradeoffs between accuracy and speed.

The discovery also raises questions about how attention mechanisms process redundant information. Models apparently extract signal from repetition in ways that single-pass prompts miss, suggesting current understanding of transformer behavior remains incomplete. This gap between empirical results and theoretical models points to opportunities for architecture improvements.

Getting Started

Implementation requires no special tooling or API modifications. For direct API calls, concatenate the prompt with itself:

full_prompt = f"{prompt}\n\n{prompt}"
response = client.completions.create(model="deepseek-chat", prompt=full_prompt)

When working with chat interfaces that expect message arrays, duplicate the user message:

 {"role": "user", "content": "What causes ocean acidification?"},
 {"role": "user", "content": "What causes ocean acidification?"}
]

Testing the approach requires comparing outputs from single versus doubled prompts across representative queries. Metrics like factual accuracy, coherence scores, or task-specific benchmarks reveal whether the pattern benefits a particular use case.

The technique works best for queries where precision matters more than creative variation. Code generation, data extraction, and technical explanations show stronger improvements than open-ended creative tasks.

Context

The doubled-prompt technique joins a growing list of counterintuitive LLM behaviors. Similar discoveries include the effectiveness of “Let’s think step by step” prefixes and the impact of prompt ordering on multi-turn conversations. Each finding reveals gaps between how developers assume models work and their actual behavior.

Alternative accuracy-boosting methods include temperature adjustment, top-p sampling modifications, and ensemble approaches that combine multiple model outputs. These alternatives typically impose latency costs or require additional compute resources. The doubled-prompt method stands out for its zero-overhead profile.

Limitations constrain broader adoption. The technique fails with reasoning models, which represent the current frontier of LLM capabilities. Models like o1 or those employing explicit reasoning chains show no benefit or sometimes degraded performance from prompt duplication. This suggests the mechanism relies on attention patterns specific to standard transformer inference.

The discovery’s late arrival despite widespread LLM usage highlights how empirical testing still uncovers basic behavioral patterns. Systematic exploration of simple variations remains valuable even as the field pursues sophisticated architectural innovations.