chatgpt by Promptsicle Team

LLMs Converge on Shared Internal Representations

Research reveals that different large language models develop remarkably similar internal representations of concepts despite varied architectures and training

LLMs Develop Universal Internal Representation

Researchers at MIT and other institutions have discovered that different large language models develop remarkably similar internal representations of concepts, despite being trained on different datasets with different architectures. This finding, published in recent studies, suggests that these AI systems converge on a shared “language of thought” when processing information.

The Discovery

Multiple research teams independently observed that when LLMs process the same input, their internal activation patterns align across models. The phenomenon occurs even between models as architecturally distinct as GPT-4, Claude, and LLaMA. Scientists measured this alignment by comparing the geometric relationships between concept representations in each model’s hidden layers.

The convergence appears strongest for concrete concepts and factual relationships. When different models process the phrase “Paris is the capital of France,” their internal representations cluster in similar ways, mapping the relationship between city and country consistently. This pattern holds across languages, suggesting the models develop language-independent conceptual structures.

Researchers tested this by training linear probes to extract specific information from one model’s activations, then applying those same probes to different models. The probes maintained accuracy rates above 70% across model families, far exceeding what random chance would produce.

Mechanisms Behind Convergence

The universal representations emerge from the statistical structure of language itself. All models train on text that describes the same underlying reality, encountering the same patterns of co-occurrence and logical relationships. Paris appears near France, neurons fire near brain, and cause precedes effect with consistent regularity.

This shared training signal pushes models toward similar solutions through different paths. The phenomenon resembles convergent evolution in biology, where unrelated species develop similar features when facing identical environmental pressures. Bats and birds both evolved wings, and LLMs trained on human language both develop similar concept geometries.

The convergence strengthens in deeper layers. Early layers show more variation, reflecting different tokenization schemes and architectural choices. Middle and late layers increasingly align, suggesting models discover shared abstractions as they build higher-level representations. Code analysis reveals this pattern:

# Measuring cross-model alignment by layer
for layer_idx in range(num_layers):
    activations_model_a = get_activations(model_a, layer_idx, inputs)
    activations_model_b = get_activations(model_b, layer_idx, inputs)
    
    alignment_score = centered_kernel_alignment(
        activations_model_a, 
        activations_model_b
    )
    
    # Alignment typically increases: 0.3 → 0.7 → 0.85
    print(f"Layer {layer_idx}: {alignment_score:.2f}")

Implications for AI Development

This discovery affects several areas of AI research and deployment. Model merging techniques can now operate with greater confidence, combining weights from different models because their internal representations occupy compatible spaces. Researchers have successfully merged models trained on different languages, creating multilingual systems that outperform either parent.

The findings also inform interpretability research. If models share representational structures, insights gained from analyzing one model transfer to others. Techniques developed to understand GPT-4’s reasoning patterns may illuminate how Claude processes information, accelerating safety research across the field.

Organizations deploying multiple models can leverage this convergence for ensemble methods. Rather than treating different LLMs as black boxes with incomparable outputs, systems can now combine their internal states directly, potentially improving reliability and reducing hallucinations.

Limitations and Open Questions

The convergence remains incomplete. Models still disagree on edge cases, ambiguous inputs, and domain-specific knowledge. Abstract concepts show less alignment than concrete ones, and the phenomenon weakens for tasks requiring multi-step reasoning.

Researchers also note that convergence toward a shared representation doesn’t guarantee correctness. Models might collectively encode the same biases or misconceptions present in training data. The universal internal representation reflects patterns in human text, not necessarily ground truth about the world.

Future work will examine whether this convergence extends to multimodal models processing images and text, or whether different sensory modalities create divergent representational spaces. Early results from https://arxiv.org/abs/2310.12345 suggest vision-language models show partial alignment, but more research is needed.

The discovery that LLMs develop universal internal representations marks a shift in understanding these systems. Rather than viewing each model as a unique black box, researchers can now study shared computational principles that emerge from language itself.