chatgpt

LLMs Develop Universal Internal Representation

Research reveals that large language models develop language-agnostic internal representations, where identical content in different languages produces more

LLMs Use Universal Internal Language Across Languages

What It Is

Research into transformer model internals has revealed something unexpected: large language models appear to develop a language-agnostic internal representation when processing information. When analyzing the middle layers of these models, identical content translated into different languages (such as Chinese and English) produces more similar activation patterns than completely different content written in the same language.

This discovery emerged from experiments with model expansion techniques. Rather than training larger models from scratch, researchers found they could repeat transformer blocks in the middle layers of existing models to create more capable variants. The RYS (Repeat Your Self) series demonstrates this approach, building on Qwen3.5-27B by duplicating middle-layer blocks at different scales. Four variants exist with increasing repetition counts: S (small), M (medium), L (large), and XL (extra-large), all quantized to FP8 for efficiency.

The technique works because middle layers handle semantic understanding rather than language-specific encoding or decoding. Early layers convert tokens into representations, final layers convert back to text, but middle layers operate on meaning itself - apparently in a format that transcends individual languages.

Why It Matters

This finding has immediate practical implications for multilingual AI development. If models truly process meaning in a universal format, training on high-quality data in one language should improve performance across all languages the model supports. Teams working with limited resources in low-resource languages could potentially achieve better results by focusing on semantic understanding rather than language-specific training.

The model expansion approach also offers a cost-effective path to larger models. Training a 27B parameter model from scratch requires enormous computational resources, but expanding an existing model through layer repetition provides a shortcut. The XL variant shows particular promise for fine-tuning applications, potentially reaching state-of-the-art performance in its size category after task-specific training.

For researchers studying model interpretability, the universal representation hypothesis provides a new lens for understanding how transformers work. Rather than viewing these models as sophisticated pattern matchers operating on text statistics, the evidence suggests they build abstract semantic representations that exist independently of surface-level language features.

Getting Started

All four RYS-Qwen3.5-27B variants are available on Hugging Face:

Loading these models follows standard Hugging Face patterns:


model = AutoModelForCausalLM.from_pretrained(
 "dnhkng/RYS-Qwen3.5-27B-FP8-XL",
 device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("dnhkng/RYS-Qwen3.5-27B-FP8-XL")

The complete technical writeup, including cross-language similarity analysis and methodology details, is available at https://dnhkng.github.io/posts/rys-ii/

Teams interested in fine-tuning should start with the XL variant, which shows the strongest response to task-specific training. The FP8 quantization keeps memory requirements manageable while preserving most of the model’s capabilities.

Context

Traditional model scaling follows a “bigger is better” philosophy - more parameters, more training data, more compute. The RYS approach challenges this by demonstrating that architectural modifications to existing models can yield significant improvements without starting from scratch.

However, layer repetition has limits. Each duplicated block adds computational cost during inference, and there’s likely a point of diminishing returns where additional repetitions provide minimal benefit. The researchers are developing new model formats optimized for duplicated layers, suggesting current implementations may not fully exploit this technique’s potential.

Alternative expansion methods exist, including mixture-of-experts architectures and sparse activation patterns. These approaches offer different tradeoffs between model size, inference speed, and capability. The universal representation finding may apply across these architectures, but validation would require similar cross-language analysis on different model types.

The broader implication remains speculative: if models develop language-independent semantic representations, what does this reveal about the nature of meaning itself? The similarity between model internals and human conceptual processing deserves further investigation.