20B Parameter Model Runs Locally in Browser

Someone got a 20 billion parameter language model running completely in the browser using WebGPU. No server calls, everything processes locally.

The demo uses Transformers.js v4 (still in preview) with ONNX Runtime Web to make it work. Pretty wild that a model this size can run client-side now.

Try it here:

Live demo: https://huggingface.co/spaces/webml-community/GPT-OSS-WebGPU
Model files: https://huggingface.co/onnx-community/gpt-oss-20b-ONNX

The whole setup runs on WebGPU, which explains how it handles the compute without melting the browser. Source code is available in the demo link if anyone wants to poke around the implementation.

Main benefit is privacy - prompts never leave the machine. Performance obviously depends on GPU, but the fact that it works at all for a 20B model is impressive.

20B Parameter Model Runs Locally in Browser

Related Tips

GLM-5: 744B Sparse Model with 40B Active Parameters

30B Model Handles 10M Tokens via Subquadratic Attention

5 ChatGPT Shortcuts That Cut Prompt Length by 70%