LLMs Use Universal Internal Language, Loop Boosts Performance

What It Is

Research into transformer architecture has revealed something unexpected: large language models appear to develop a language-agnostic internal representation during their middle processing layers. When researchers analyzed how models process identical content in different languages, they found that the internal representations of “hello” in English and “你好” in Chinese are more similar to each other than either is to unrelated content in the same language.

This discovery led to a practical architectural modification called layer looping. Instead of processing information through each transformer block exactly once, these experimental models repeat certain middle-layer blocks multiple times. The technique essentially gives the model more time to work in its “internal language” before translating back to human-readable output. The RYS (Repeat Your Self) series demonstrates this approach with four variants that add progressively more layer repetition to the Qwen 3.5 27B base model.

Why It Matters

This architectural insight challenges assumptions about how transformer models should be structured. Traditional designs stack unique layers sequentially, but layer looping suggests that depth through repetition can be more valuable than depth through unique transformations - at least in the middle processing stages where models work with abstract representations.

For developers working with limited compute budgets, this matters significantly. The technique offers a path to better performance without requiring larger base models or massive training runs from scratch. Teams can take existing models and modify their architecture to add computational depth where it counts most.

The XL variant shows particular promise for fine-tuning applications. Models that process information more thoroughly in their middle layers appear to learn more effectively from domain-specific training data. Organizations building specialized AI applications could see better results from smaller fine-tuning datasets, reducing both data collection costs and training time.

The broader AI ecosystem benefits from this kind of architectural experimentation. As model sizes push against practical deployment limits, techniques that extract more capability from existing parameter counts become increasingly valuable. Layer looping represents a different optimization axis than the usual scaling approaches.

Getting Started

The RYS-Qwen models are available on HuggingFace in four configurations:

S (Small): https://huggingface.co/dnhkng/RYS-Qwen3.5-27B-FP8-S
M (Medium): https://huggingface.co/dnhkng/RYS-Qwen3.5-27B-FP8-M
L (Large): https://huggingface.co/dnhkng/RYS-Qwen3.5-27B-FP8-L
XL (Extra Large): https://huggingface.co/dnhkng/RYS-Qwen3.5-27B-FP8-XL

Loading these models follows standard HuggingFace patterns:


model = AutoModelForCausalLM.from_pretrained(
 "dnhkng/RYS-Qwen3.5-27B-FP8-XL",
 device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("dnhkng/RYS-Qwen3.5-27B-FP8-XL")

The technical details and benchmarks are documented at https://dnhkng.github.io/posts/rys-ii/. Note that GGUF quantized versions are not yet available, though conversion work is ongoing. Developers should also be aware that current implementations may use more VRAM than expected since duplicated layers aren’t yet optimized for memory sharing.

Context

Layer looping sits alongside other architectural innovations like mixture-of-experts and sparse attention mechanisms. Each approach tackles the efficiency problem from different angles - MoE through conditional computation, sparse attention through selective focus, and layer looping through repeated processing.

The technique has limitations. Memory optimization remains a challenge since naive implementations store each repeated layer separately. The models also require careful tuning to determine optimal repetition counts - too few loops waste the potential benefit, too many create diminishing returns and slower inference.

Alternative approaches to improving model reasoning include chain-of-thought prompting, which happens at the prompt level rather than architecture level, and speculative decoding, which focuses on inference speed rather than capability. Layer looping operates at a different level, modifying how models process information internally rather than how they’re prompted or deployed.

The real test will come from community experimentation and fine-tuning results. If the XL variant delivers on its promise for specialized applications, expect to see layer looping incorporated into future model designs.

LLMs Develop Universal Internal Language Representation

LLMs Use Universal Internal Language, Loop Boosts Performance

What It Is

Why It Matters

Getting Started

Context

Related Tips

Liquid AI MoE Models Run in Browser via WebGPU

LLMs Develop Universal Internal Representation

Uncensored Qwen3.5-35B Maintains Full Performance