AI agents are increasingly recognized as a means to enhance the capabilities of cybersecurity teams. But which agents excel in this role? Wiz has created a benchmark suite comprising 257 real-world challenges across five offensive domains: zero-day discovery, CVE (code vulnerability) detection, API security, web security, and cloud security to determine the most effective AI agents.
The company evaluates various combinations of AI agents along with their underlying models against this test suite to identify which ones achieve the highest scores in each category. The scoring approach is systematic and based on multiple factors, including multi-dimensional rubrics for zero-day and CVE detection, endpoint-and-severity matching for API security, and lag capture for web and cloud challenges.
These benchmark tests are conducted within isolated Docker containers that are equipped with adequate resources and do not impose per-challenge time limits. This setup ensures that the scores reflect the agents' capabilities rather than any throttling effects. Each agent operates using its native tools and execution model right out of the box and is allowed three attempts at each challenge to assess its average performance.
Results of the Trials
In a recent blog post announcing the Cyber model arena benchmarks, Wiz remained discreet about the outcomes of its evaluations. Leading the trials is Claude Code, which operates on Claude Opus 4.6. As Wiz is soon to become a subsidiary of Google, it may be hesitant to widely publicize this information. Nonetheless, Claude's lead is narrow, and the competitive landscape can shift rapidly. Notably, Gemini 3 Pro ranks second.