We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks

Shortened from: https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/

A benchmark comparing open-source models against frontier coding agents for IDOR (Insecure Direct Object Reference) vulnerability detection revealed surprising results.
GLM 5.2, an open-weight model from Zhipu AI, achieved a 39% F1 score on IDOR detection, outperforming Claude Code (32%) and costing significantly less ($0.17 per vulnerability found).
Semgrep’s internal multimodal pipeline, which includes a purpose-built harness for static analysis, still led the rankings with 53-61% F1, demonstrating the significant impact of the surrounding harness on performance.
The experiment aimed to determine the contribution of the model versus the harness to vulnerability detection, with open-weight models running in a simpler harness without endpoint discovery or guided navigation.
GLM 5.2 is an open-weight, genuinely competitive, and cost-effective Mixture-of-Experts (MoE) model, making it particularly interesting for security teams due to its ability to run on private hardware and strong coding benchmarks.
The IDOR vulnerability is difficult to detect for both static analysis and LLMs because it involves a missing authorization check rather than a dangerous function.
While GLM 5.2 was a standout, other open-weight models like MiniMax M3 (23%) and Kimi K2.7 Code (22%) lagged behind, indicating that “open weights have caught up” is not a universal truth.
Key takeaways suggest that while the harness remains crucial for performance, GLM 5.2’s surprising success highlights the potential of open-weight models, especially in terms of cost-effectiveness and flexibility for security teams.
The authors caution that these findings are based on one specific task and dataset, and further research is needed to determine GLM 5.2’s broader applicability in other vulnerability detection scenarios.