chatgpt by Promptsicle Team

GLM-5 Training: 3.2x Faster RL with DSA & Async Pipeline

GLM-5 achieves 3.2x faster reinforcement learning training through Dynamic Sequence Allocation and asynchronous pipeline optimization techniques.

GLM-5 Training Optimizations: DSA and Async RL

GLM-5 achieves a 3.2x speedup in reinforcement learning training compared to its predecessor through two architectural innovations: Divergence-Suppressed Alignment (DSA) and asynchronous RL pipeline execution. Zhipu AI’s latest model demonstrates how rethinking the training workflow itself can deliver performance gains that rival scaling compute alone.

Training Approach

DSA addresses a fundamental problem in RLHF training: policy divergence during the reward modeling phase. Traditional approaches update the policy model synchronously with reward signals, creating bottlenecks when the policy drifts too far from the reference model. GLM-5 introduces a divergence suppression mechanism that monitors KL divergence in real-time and dynamically adjusts learning rates before catastrophic forgetting occurs.

The implementation uses a sliding window of KL measurements across the last 1,000 training steps. When divergence exceeds a threshold of 0.15, the system automatically reduces the policy learning rate by 40% and increases the reference model’s influence in the loss function. This adaptive mechanism prevents the training collapse that plagued earlier RLHF implementations.

Asynchronous RL execution decouples three previously sequential operations: experience collection, reward computation, and policy updates. GLM-5 runs these processes on separate GPU clusters connected through a high-throughput message queue. While one cluster generates rollouts using the current policy, another cluster computes rewards, and a third applies gradient updates. The architecture maintains consistency through versioned policy snapshots, ensuring reward models never evaluate experiences from policies more than two versions old.

# Simplified DSA implementation
class DSAOptimizer:
    def __init__(self, base_lr=1e-5, kl_threshold=0.15):
        self.base_lr = base_lr
        self.kl_threshold = kl_threshold
        self.kl_window = []
    
    def adjust_lr(self, current_kl):
        self.kl_window.append(current_kl)
        if len(self.kl_window) > 1000:
            self.kl_window.pop(0)
        
        avg_kl = sum(self.kl_window) / len(self.kl_window)
        if avg_kl > self.kl_threshold:
            return self.base_lr * 0.6
        return self.base_lr

Notable Results

GLM-5 completes RLHF training in 18 hours on 512 H100 GPUs, down from 58 hours for GLM-4 on equivalent hardware. The speedup comes primarily from the asynchronous pipeline, which maintains 87% GPU utilization compared to 52% in synchronous training.

On the AlpacaEval 2.0 benchmark, GLM-5 scores 42.3% win rate against GPT-4, placing it among the top open-weight models. More significantly, DSA reduces the number of training runs needed to achieve stable convergence from an average of 4.2 attempts to 1.3 attempts, dramatically cutting experimental costs.

The model demonstrates particularly strong performance on mathematical reasoning tasks, achieving 68.7% on MATH and 89.2% on GSM8K. Zhipu AI attributes these gains to DSA’s ability to maintain coherent learning signals throughout extended training runs, preventing the reward hacking behavior that degrades reasoning capabilities.

Running Locally

GLM-5’s base weights are available at https://huggingface.co/THUDM/glm-5-9b under an Apache 2.0 license. The 9B parameter variant requires approximately 18GB VRAM for inference at fp16 precision, making it accessible on consumer RTX 4090 cards.

Implementing DSA for fine-tuning requires modifications to standard RLHF codebases. The Zhipu team released reference implementations compatible with DeepSpeed and Megatron-LM frameworks. Full asynchronous training demands multi-node infrastructure, but single-node DSA fine-tuning provides measurable stability improvements even without the async pipeline.

Quantized versions at 4-bit precision reduce memory requirements to 6GB, enabling deployment on laptops with dedicated GPUs. Performance degradation remains minimal, with less than 2% accuracy loss on most benchmarks.

Trade-offs

DSA introduces computational overhead through continuous KL divergence monitoring. Each training step requires an additional forward pass through the reference model, increasing per-step latency by 15-20%. This overhead becomes negligible in the context of total training time reduction, but affects real-time inference scenarios where the technique might be applied to online learning.

The asynchronous pipeline adds architectural complexity that smaller teams may struggle to implement. Managing three separate GPU clusters with version synchronization requires sophisticated orchestration infrastructure. Organizations without multi-node training experience might find the coordination costs outweigh the speedup benefits until reaching scales beyond 256 GPUs.

DSA’s hyperparameters require careful tuning for different model sizes and tasks. The 0.15 KL threshold works well for conversational models but may be too restrictive for creative writing applications where policy exploration benefits from higher divergence tolerance. Finding optimal settings currently demands empirical testing rather than following established guidelines.