Blockchain CouncilGlobal Technology Council
ai7 min read

GLM-5

Michael WillsonMichael Willson
GLM-5

Introduction

GLM-5 is Z.ai (Zhipu AI)’s new flagship open-weights large language model, publicly released on February 11, 2026, and positioned as a “frontier” open model built for complex software engineering and long-horizon agent tasks. An AI certification is useful here because GLM-5 is not just “another big model.” It is engineered and marketed around agent reliability, long-context work, and the practical realities of serving a huge Mixture-of-Experts system at scale.

What GLM-5 is

GLM-5 is described as a flagship open-weights model designed for “systems engineering” workloads and long-horizon agent workflows. The positioning is explicit: stronger coding capability, better performance on long-running agent tasks, and optimization for agent frameworks.

Reuters’ reporting on the release frames it as open-source and highlights Zhipu’s claim that GLM-5 approaches Claude Opus 4.5 on coding benchmarks and beats Gemini 3 Pro on some benchmarks. It also notes the model is optimized for agent use, with OpenClaw mentioned as a specific agent context.

Model architecture

GLM-5 is a Mixture-of-Experts model with 744B total parameters and roughly 40B active parameters per token. This is a meaningful scale-up versus the GLM-4.5 to GLM-4.7 generation, described as 355B total parameters with 32B active parameters.

The practical implication of MoE at this scale is that you do not “use” 744B parameters for every token. The routing mechanism activates a subset of experts, which is how systems aim to increase capability without linear increases in compute per token.

Context window and attention

GLM-5 supports a 200K-token context window and integrates DeepSeek Sparse Attention to reduce inference cost while preserving long-context capability.

This combination is important in practice. Long context is only useful if it is affordable and stable enough to run in real systems. Sparse attention approaches are explicitly about avoiding full quadratic attention costs at long lengths.

Post-training and “slime”

Z.ai states it built “slime,” an asynchronous reinforcement-learning infrastructure, to improve reinforcement learning throughput and iteration efficiency at this scale.

Both VentureBeat and the Hugging Face model materials highlight slime as the post-training system used to push agentic performance and reliability. The core idea is not just more RL, but faster RL iteration on long-range interactions, which directly aligns with the “long-horizon agent” positioning.

Open weights and licensing

GLM-5 is released as open weights under the MIT License, as stated across multiple industry summaries and the model listing materials.

That matters because open weights change who can deploy and fine-tune the model. This is not merely “API access.” It enables internal hosting, controlled environments, and custom safety layers, assuming you can handle the deployment burden.

Weight formats and distribution

Z.ai provides at least two official variants:

GLM-5 in BF16
GLM-5-FP8 for more practical serving

The repo distribution notes downloads via Hugging Face and ModelScope for both BF16 and FP8.

The BF16 release is extremely large. Artificial Analysis describes it as on the order of roughly 1.5TB of weights and estimates roughly 1,490GB of memory to store weights at native BF16 for self-deployment. This is the part people forget when they hear “open weights.” Openness does not make the hardware problem vanish.

Benchmarks and what is being emphasized

Z.ai’s model card publishes a broad benchmark table that includes agentic and coding evaluations such as:

SWE-bench Verified at 77.8
Terminal-Bench 2.0 in the mid-50s depending on variant
BrowseComp
MCP-Atlas
τ²-Bench
Vending Bench 2 outcomes

The positioning is that GLM-5 ranks at the top among open models on several agentic and coding benchmarks, not just one coding test.

Artificial Analysis reports GLM-5 as the leading open-weights model on its Intelligence Index v4.0, with a strong jump over GLM-4.7 driven by improvements in agentic work and lower hallucination. It also confirms the 200K context window and MIT licensing.

VentureBeat emphasizes an “AA-Omniscience Index” improvement and highlights a behavior that matters in real deployments: “knowing when to abstain,” meaning avoiding fabricated answers instead of confidently guessing.

A practical interpretation is that GLM-5 is being marketed as an engineering assistant that fails more safely, not only one that scores higher.

API access and pricing

Z.ai’s developer documentation shows first-party API access via an OpenAI-style chat completions endpoint and includes a “thinking” toggle in requests.

Published pricing in USD per 1M tokens lists:

GLM-5 at $1 input and $3.2 output
Cached-input pricing is listed, and “limited-time free” notes appear on certain pricing fields

Third-party availability is also part of the story. Artificial Analysis notes GLM-5 availability through Z.ai’s API and multiple third-party providers, with example pricing that aligns with the first-party $1 and $3.2 tier and cheaper offers such as $0.8 input and $2.56 output via some providers.

Local serving and tooling

Z.ai’s repository materials and deployment recipes describe practical serving paths using vLLM and SGLang. The examples commonly show serving the FP8 variant with tensor parallelism, often illustrated with 8 GPUs, plus OpenAI-compatible client usage patterns.

Ollama lists a GLM-5 “cloud” entry and repeats the public-facing spec themes, including the 744B total and 40B active parameter framing, long context, agentic focus, and post-training with an asynchronous RL infrastructure.

If you are evaluating feasibility, the key is not whether it runs. The key is whether it runs at the latency, reliability, and cost profile your workload requires. A Tech certification helps because serving MoE models at long context is an infrastructure problem involving memory, parallelism, scheduling, and failure handling.

Chips and geopolitics angle

Reuters reports GLM-5 inference work was developed using domestically manufactured chips including Huawei Ascend, plus chips from Moore Threads, Cambricon, and Kunlunxin. The framing is that this release aligns with China’s push for semiconductor self-sufficiency under tighter US export controls.

This matters because hardware availability can shape who can deploy frontier-scale models, how quickly they iterate, and whether the economics work outside a few well-resourced providers.

Demand and pricing signal

A separate Reuters report dated February 12, 2026 states Zhipu increased prices on its GLM coding plan by at least 30% due to rising demand, with changes not affecting existing subscribers.

This is not a model specification, but it is a near-term adoption signal. It suggests user demand is high enough that pricing power exists, which usually tracks with real usage rather than curiosity clicks.

Real-world examples of where GLM-5 fits

Software engineering agents. GLM-5’s long-horizon framing fits workflows like multi-hour bug hunts, refactors across large codebases, and repeated compile test cycles where the agent must maintain state across many steps and files.

Long-context technical work. A 200K context window is directly relevant for tasks like reviewing multi-module repositories, scanning long design docs, or tracing requirements across a large set of issues, logs, and patches.

Abstention-sensitive environments. If the “knowing when to abstain” behavior holds up, it is valuable in regulated or safety-critical domains where a wrong answer is worse than no answer. That is a messaging and trust issue as much as a capability issue. A Marketing certification helps here because product teams need to set expectations correctly: abstaining is a feature, not a failure.

Conclusion

GLM-5 is a flagship open-weights Mixture-of-Experts model released February 11, 2026, aimed at complex systems engineering and long-horizon agent tasks. It scales to 744B total parameters with about 40B active per token, supports a 200K-token context window, and uses DeepSeek Sparse Attention to control long-context inference costs. Z.ai’s post-training story centers on “slime,” an asynchronous reinforcement-learning infrastructure designed to improve iteration throughput and long-range agent learning. The model is released under the MIT License with BF16 and FP8 variants, where BF16 storage is extremely large and FP8 is positioned for more practical serving.

Z.ai publishes strong coding and agentic benchmark results including SWE-bench Verified 77.8 and a broader suite of agent evaluations, while third-party commentary emphasizes gains in agentic reliability and reduced hallucination, including stronger “abstain” behavior. Access spans open weights and an OpenAI-style API with a thinking toggle, with published pricing of $1 per 1M input tokens and $3.2 per 1M output tokens and third-party offerings sometimes lower.

Deployment guidance highlights vLLM and SGLang routes for FP8 serving, often with multi-GPU tensor parallelism. Reuters also reports GLM-5 inference was developed on domestically manufactured chips and that Zhipu raised prices on its GLM coding plan by at least 30% due to rising demand, reinforcing the model’s momentum in real developer usage.

GLM-5