chatgpt

Liquid AI MoE Models Run in Browser via WebGPU

Liquid AI's Mixture of Experts language models now run directly in web browsers using WebGPU technology, enabling client-side AI inference without servers or

Liquid AI’s MoE Models Run in Browser via WebGPU

What It Is

Liquid AI’s Mixture of Experts (MoE) language models now run directly in web browsers through WebGPU, eliminating the need for server infrastructure or cloud API calls. The implementation uses ONNX-formatted models that execute entirely client-side, with the 24B parameter model (activating 2B parameters per token) achieving approximately 50 tokens per second on an M4 Max processor. The smaller 8B variant exceeds 100 tokens per second on identical hardware.

WebGPU provides the computational backbone for this deployment, offering GPU acceleration through standard browser APIs. The MoE architecture contributes to performance by selectively activating only a subset of the model’s total parameters for each token generation, rather than engaging the entire network. This sparse activation pattern reduces computational overhead while maintaining model quality.

Why It Matters

Browser-based inference fundamentally shifts how developers can deploy language models. Applications no longer require backend infrastructure, API rate limits, or recurring cloud costs. Privacy-sensitive use cases benefit particularly well - medical documentation tools, legal analysis software, or internal business applications can process data without transmitting it to external servers.

The performance metrics reveal practical viability for real-time applications. Generating 50-100 tokens per second supports conversational interfaces, code completion, and content generation without perceptible lag. Developers building offline-first applications or tools for regions with limited connectivity gain access to capable language models that function without internet access after initial download.

The ONNX format standardization matters for the broader ecosystem. Converting models to ONNX creates interoperability across frameworks and deployment targets. Teams can experiment with different runtime environments without retraining or extensive model conversion work.

Getting Started

The live demonstration runs at https://huggingface.co/spaces/LiquidAI/LFM2-MoE-WebGPU, providing immediate access to test inference speed and output quality. The source code in that space shows implementation details for developers planning similar deployments.

For local development, the ONNX models are available for download:

Browser compatibility requires WebGPU support, currently available in Chrome 113+, Edge 113+, and recent Safari Technology Preview builds. Firefox support remains in development. Developers should include feature detection:

 console.error('WebGPU not supported');
 // Fallback to server-side inference
}

Model loading times vary based on connection speed - the 24B model weighs several gigabytes. Implementing progressive loading or caching strategies improves user experience for repeat visits.

Context

Traditional browser-based ML relied on WebGL or CPU-bound JavaScript execution, both significantly slower than WebGPU’s compute shader capabilities. WebGPU provides closer-to-metal GPU access, narrowing the performance gap between browser and native applications.

Compared to server-side inference through APIs like OpenAI or Anthropic, browser deployment trades model selection flexibility for zero latency costs and complete privacy. The 24B model won’t match GPT-4’s capabilities, but for many applications, the tradeoff favors local execution.

Alternative browser inference frameworks include ONNX Runtime Web and TensorFlow.js. ONNX Runtime Web specifically targets WebGPU acceleration and supports the same model format, making it a natural fit for this deployment approach.

Limitations include hardware requirements - older devices or integrated GPUs may struggle with larger models. Memory constraints also apply, as the entire model loads into browser memory. The 8B variant offers better compatibility across device tiers while maintaining reasonable performance.

The MoE architecture’s efficiency comes with complexity costs. Training and fine-tuning MoE models requires specialized infrastructure, though inference remains straightforward once models are converted to ONNX format. Teams considering similar deployments should evaluate whether the performance benefits justify the additional architectural complexity versus dense models.