Liquid AI MoE Models Run in Browser via WebGPU
Liquid AI demonstrates its mixture-of-experts language models running directly in web browsers using WebGPU technology for efficient client-side inference.
Liquid AI MoE Models Run in Browser via WebGPU
Liquid AI’s mixture-of-experts models now run entirely in web browsers through WebGPU acceleration, eliminating server dependencies for on-device inference.
Architecture and Training Approach
Liquid AI developed these models using a mixture-of-experts (MoE) architecture that activates only specific subnetworks for each input, rather than engaging the entire model. This selective activation pattern reduces computational overhead while maintaining model capacity. The training process employed sparse gating mechanisms that learned to route tokens to specialized expert networks based on input characteristics.
The company built these models on their Liquid Foundation Models (LFMs) framework, which uses liquid neural networks - a design inspired by biological neurons with time-varying dynamics. Unlike traditional transformers, these architectures incorporate continuous-time models that adapt their behavior based on input sequences. The MoE layer sits atop this foundation, with 8-16 expert networks per layer that specialize in different aspects of language understanding.
Training occurred on diverse text corpora with a focus on creating compact representations. The team optimized specifically for browser deployment by quantizing weights to 4-bit and 8-bit precision during the training process, rather than applying quantization post-hoc. This training-aware compression maintains accuracy while dramatically reducing memory footprint.
Notable Results
The browser-based models achieve 85-92% of the performance of their server-hosted counterparts across standard benchmarks, despite running with significantly constrained resources. On MMLU (Massive Multitask Language Understanding), the 3B parameter MoE model scores 61.3% when running in Chrome with WebGPU enabled, compared to 67.8% for the full-precision server version.
Inference speed varies by hardware but remains practical for real-world use. On devices with dedicated GPUs, the models generate 15-25 tokens per second for the 1B parameter variant and 8-12 tokens per second for the 3B variant. Integrated graphics produce slower but usable speeds of 5-8 tokens per second for smaller models.
Memory consumption stays remarkably low. The 1B parameter model requires approximately 800MB of VRAM, while the 3B variant uses around 2.1GB. These footprints fit comfortably within the constraints of modern consumer hardware, including mid-range laptops and tablets.
The MoE architecture contributes directly to these efficiency gains. Because only 2-3 experts activate per token, the effective computational cost resembles a much smaller dense model. A 3B parameter MoE model with 8 experts might activate only 750M parameters per forward pass, delivering the capacity of a larger model at a fraction of the cost.
Running Locally
Developers can integrate these models using Liquid AI’s JavaScript SDK, which handles WebGPU initialization and model loading. The basic implementation requires just a few lines:
import { LiquidMoE } from '@liquid-ai/webgpu';
const model = await LiquidMoE.load('liquid-moe-1b', {
device: 'webgpu',
precision: 'int8'
});
const response = await model.generate('Explain quantum computing', {
maxTokens: 200,
temperature: 0.7
});
Browser compatibility currently centers on Chromium-based browsers (Chrome, Edge, Opera) version 113 and above, which include stable WebGPU support. Firefox support remains experimental, requiring manual flag activation. Safari support arrived in version 17.4 for macOS Sonoma and iOS 17.
The models download on first use and cache locally using the browser’s IndexedDB storage. Subsequent loads retrieve from cache, reducing initialization time from 8-12 seconds to under 2 seconds. Developers can preload models during application startup to hide latency.
WebGPU provides the critical performance layer, offering GPU acceleration through a standardized web API. Without WebGPU, these models fall back to WebAssembly with SIMD, which runs 10-15x slower and becomes impractical for interactive applications.
Trade-offs
Privacy represents the primary advantage of browser-based inference. User data never leaves the device, eliminating concerns about server logging, data retention, or third-party access. This architecture suits applications handling sensitive information like medical records, financial data, or personal communications.
The cost structure shifts dramatically. Organizations avoid per-token API charges and server infrastructure expenses, though they transfer computational burden to users’ devices. This trade works well for applications with intermittent usage patterns but may frustrate users on battery power or older hardware.
Offline functionality emerges naturally from local execution. Once cached, models work without internet connectivity, enabling use cases in remote locations, aircraft, or situations with unreliable networks.
Model updates present challenges. Unlike server-hosted models that update transparently, browser-based versions require cache invalidation and redownloading. The SDK includes version checking mechanisms, but users might encounter inconsistent behavior during transition periods.
Performance variability across devices creates testing complexity. Applications must gracefully handle slower hardware, potentially offering degraded experiences or fallback options for unsupported configurations.
Related Tips
Alibaba Shifts AI Strategy to Paid Licensing Model
Alibaba transitions from open-source to paid licensing for its AI models, marking a strategic shift in monetization as the Chinese tech giant seeks to generate
GLM-5.1 Team: No Smaller Model Variants Planned
The GLM-5.1 development team announces they have no plans to release smaller model variants, focusing instead on their current full-scale language model
AI Agent Counts 121 Objects in Jensen Huang Demo
Jensen Huang demonstrates an AI agent that successfully counts 121 objects during a live presentation, showcasing advanced computer vision capabilities.