general

GLM 4.7 Flash Uncensored: Fast Local AI Model

GLM 4.7 Flash Uncensored is a fast, lightweight AI model designed for local deployment, offering unrestricted conversational capabilities and quick response

Someone fine-tuned the new GLM 4.7 Flash model to remove content restrictions while keeping performance intact. Pretty interesting for local AI setups.

The model runs on only ~3B active params (from a 30B MoE architecture), so inference is surprisingly fast. Two versions available - Balanced for coding tasks and Aggressive for everything else.

Recommended settings for llama.cpp:

--temp 1.0 --top-p 0.95 --min-p 0.01 --jinja

For tool use, switch to --temp 0.7 --top-p 1.0 and keep repeat penalty at 1.0.

Works with llama.cpp, LM Studio, Jan, and koboldcpp. Currently has chat template issues with Ollama though.

Download links with Q8_0, Q6_K, and Q4_K_M quants:

The creator claims it’s effectively lossless compared to