We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks
Shortened from: https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/
- A benchmark comparing open-source models against frontier coding agents for IDOR (Insecure Direct Object Reference) vulnerability detection revealed surprising results.
- GLM 5.2, an open-weight model from Zhipu AI, achieved a 39% F1 score on IDOR detection, outperforming Claude Code (32%) and costing significantly less ($0.17 per vulnerability found).
- Semgrep’s internal multimodal pipeline, which includes a purpose-built harness for static analysis, still led the rankings with 53-61% F1, demonstrating the significant impact of the surrounding harness on performance.
- The experiment aimed to determine the contribution of the model versus the harness to vulnerability detection, with open-weight models running in a simpler harness without endpoint discovery or guided navigation.
- GLM 5.2 is an open-weight, genuinely competitive, and cost-effective Mixture-of-Experts (MoE) model, making it particularly interesting for security teams due to its ability to run on private hardware and strong coding benchmarks.
- The IDOR vulnerability is difficult to detect for both static analysis and LLMs because it involves a missing authorization check rather than a dangerous function.
- While GLM 5.2 was a standout, other open-weight models like MiniMax M3 (23%) and Kimi K2.7 Code (22%) lagged behind, indicating that “open weights have caught up” is not a universal truth.
- Key takeaways suggest that while the harness remains crucial for performance, GLM 5.2’s surprising success highlights the potential of open-weight models, especially in terms of cost-effectiveness and flexibility for security teams.
- The authors caution that these findings are based on one specific task and dataset, and further research is needed to determine GLM 5.2’s broader applicability in other vulnerability detection scenarios.