chatgpt

Google Releases Gemma Scope 2 for Model Interpretability

Google releases Gemma Scope 2, an advanced interpretability tool that helps researchers understand and analyze the internal workings of AI language models

Google just dropped Gemma Scope 2, which is pretty useful for anyone trying to understand what’s actually happening inside language models.

It’s basically a collection of pre-trained sparse autoencoders (SAEs) that let researchers peek into Gemma 2’s internal workings. The models are available at https://huggingface.co/collections/google/gemma-scope-2 and cover the 2B, 9B, and 27B parameter versions.

Quick setup:


sae = AutoModel.from_pretrained("google/gemma-scope-2b-pt-res")

The cool part is they’ve trained these on multiple layers and attention heads, so you can see how different features activate for specific inputs. Turns out this makes interpretability research way more accessible since you don’t need to train your own SAEs from scratch.

Particularly handy for safety research - figuring out why models behave certain ways or what triggers specific outputs.