How is Qwen3.6 for automated pentesting?

For the past couple of years, anyone building offensive security automation has been working off the same assumption: if you want an LLM agent to actually find vulnerabilities, you have to pay for a closed frontier model. We wanted to know whether that's still true.

So we took Qwen3.6-27B — an open-weights model that runs on hardware you control — and put it through the same Claude Code skills harness that powers the Claude family. We pointed it at XBOW, a public 104-challenge web-app capture-the-flag suite, and compared it head-to-head against GPT-5.4, GPT-5-nano, and Claude Opus, Sonnet, and Haiku, in both skills and vanilla modes. What you'll read below is the result, the places it falls apart, and how we'd actually deploy it.

What's in a Claude Code skill

Think of a skill as a focused playbook the agent pulls off the shelf when a job calls for it. Each one is a single markdown file that says when it applies, what steps to follow, and which payloads to try, recon, then enumerate, then exploit, then verify. The harness (Claude Code) keeps the full library on disk and only loads the playbook that matches the task at hand, so the model's context stays clean.

For pentest work, those playbooks are grouped into roughly 22 categories. The ones that matter on a web-app suite: reconnaissance, tech-stack fingerprinting, injection, authentication, API security, web-app logic...

Benchmark Methodology

XBOW is a public web-app pentest validation suite built around capture-the-flag style challenges that mimic real vulnerability classes like XSS, IDOR, SSTI, JWT tampering, SQLi, deserialization, XXE, auth bypass, and harder multi-step web flows. Every challenge resolves to a flag string. A challenge counts as solved only on flag match.

Two configurations were measured:

Skills mode: Claude Code agent harness with skill files loaded. The agent has access to a curated library of pentest skills — recon templates, payload builders, common exploit recipes — that it composes during a run.
Vanilla mode: raw model + tool calls, no skill files. The model reasons from scratch with shell, HTTP, and filesystem tools.

The execution stack was based on three components:

Claude Code: unmodified, same agent harness used for the Claude family models;
LiteLLM: the translation proxy, allowing any model (even not from the Claude familiy) to support Anthropic Mesages API that Claude Code will expect;
Modal is hosting service where the model Qwen3.6-27B runs on GPU.

Why Qwen?

We picked Qwen3.6-27B because is one of the most accurate open-weights models large enough to follow a multi-step exploit chain without losing the plot, small enough to fit on a single high-end GPU and serve at a cost that actually competes with API pricing. The Qwen3 line also has a strong reputation for instruction following and tool use, the two capabilities a skills harness leans on hardest. And the licence is permissive enough to deploy in commercial offensive-security workflows without a legal sidequest.

Benchmarks results

Benchmark comparison per modela, skills vs vanilla

Qwen3.6-27B in the Claude Code with skills is able to solve 79/104 (75.96%) of all challenges, in average 60 minutes per challenge. Above the vanilla Haiku (57.69%), but within 11 points of vanilla Sonnet (86.54%), 7x better than GPT-5-nano configuration. In summary an open-weights 27B model running on user-controlled infrastructure reaches mid-tier proprietary parity on web-app CTF pentest.

One caveat lands up front. Qwen was not run in vanilla mode, so the absolute skills uplift for Qwen is unknown — only the skills-mode score is reported. Every other model in this writeup ran both configurations.

GPT-5.4 — 100% skills / 100% vanilla · 229s / 139s avg.
Claude Opus 4.x — 100% skills / 89.4% vanilla · 170s / 286s avg.
Claude Sonnet 4.x — 96.2% skills / 86.5% vanilla · 275s / 532s avg.
Qwen3.6-27B — 76.0% skills / vanilla not run · 1262s avg.
Claude Haiku 4.x — 62.5% skills / 57.7% vanilla · 304s / 364s avg.
GPT-5-nano — 9.6% skills / 11.5% vanilla · 107s / 120s avg.

Skills lift mid-tier proprietary models by 5–11 points, do nothing for a saturated top model, and cannot rescue a weak base model. Qwen3.6-27B in skills mode slots between vanilla Haiku and vanilla Sonnet — exactly the band where mid-tier proprietary inference lives.

Skills aren't a silver bullet

Look at GPT-5-nano: 11.54% on its own, 9.62% with skills. Adding the harness actually made it slightly worse. That's the part most people miss, a skills harness can't manufacture reasoning a model doesn't already have. If anything, it gives a weaker model more rope: more pages of instructions to misread, more steps to fumble.

Now compare with Haiku on the same setup: 57.69% on its own, 62.50% with skills. Same harness, very different outcome. Skills helped because Haiku already had a working baseline to amplify.

So the rule of thumb we'd offer:

Skills turn latent capability into solved challenges. They don't conjure capability that isn't there.
Before betting on harness lift, pick a base model that already clears around 50% on its own.
Below that line, you're paying in latency and tokens for no extra wins.

This is the part that matters if you're considering something smaller and self-hosted. A 7B-class open-weights model under the same harness is much more likely to look like nano than like Qwen3.6-27B. The 27B size, the instruction tuning, the tool-use reliability — those are doing the heavy lifting. The harness is just helping a capable model focus.

Where Qwen3.6-27B breaks

The 24% Qwen miss rate is not random. It clusters into three failure modes, mapped against specific challenges.

Challenges the model was mostly successful:

Cross-Site Scripting families
IDOR variants, SSTI (Jinja, Django), JWT tampering, simple SQLi.
PHP deserialization, XXE upload, basic auth bypasses.

Challenges where the model exceeded the budget/time allocated:

Blind/timing-based loan calculator challenges.
Unlogic nginx interaction.
Challenges requiring sustained multi-step state tracking or precise timing-side-channel reasoning. The model explores; it does not converge.

Failed challenges where the model committed a wrong flag or ran out of ideas:

WordPress magic, Nice SOAP, Melodic Mayhem (audio-related).
Pab Users, S3aweed (cloud storage).
Protocol breadth (SOAP), non-HTTP modalities (audio), and cloud-service-specific reasoning. These are categories where the training distribution thins out fast.

The cost

Now the trade-off. Qwen3.6-27B takes around 21 minutes per challenge (1262 seconds on average). The Claude family is far quicker: 3 to 5 minutes. GPT-5.4 sits at about 4 minutes with skills, 2 minutes without.

What that means in practice for a full XBOW run:

Qwen, single-threaded across 104 challenges: roughly 36 hours.
Qwen, six-way parallel: about 6 hours.
The same suite on Claude Opus with skills: under 5 hours single-threaded, well under an hour in parallel.

The economics shift the moment you self-host. Cost per challenge stops being a per-token bill and starts being a GPU-hour bill. For anyone running the same suite nightly against a CI build, or fuzzing a target every commit, GPU-hours beat per-token pricing without contest.

When Qwen + skills is the right call

Translate the benchmark into a deployment decision. Pick Qwen3.6-27B + skills when at least one of these conditions holds:

The environment is air-gapped or regulated and third-party API calls are forbidden.
Data residency rules require inference to stay inside a specific jurisdiction or VPC.
Inference cost must be capped predictably, GPU-hour, not per-token.
The workload is batch-mode and tolerates ~20-minute per-target latency.

Skip it when the workload looks like this:

Targets include audio, SOAP-heavy enterprise apps, deep multi-step authenticated flows, or cloud-service-specific exploitation paths. The failure list maps directly to these categories.
An operator is waiting for results in real time.
The target surface contains challenge classes Qwen has not demonstrated competence on, and re-routing isn't an option.

The data supports a hybrid routing pattern:

Route XSS, IDOR, SSTI, JWT, simple SQLi, deserialization, and XXE workloads to Qwen3.6-27B + skills.
Escalate WordPress, SOAP, cloud-storage, audio, and timing-blind challenges to a stronger model.
Track the routing decision per workload class, not per target — the failure pattern is class-shaped, not target-shaped.

Open-weights + skills harness is now a defensible substitute for a closed frontier model on the web-app CTF slice. It is not a substitute on the long tail.

In the end

An open-weights model, running on a single GPU, under an unmodified public agent harness, just solved three out of four web-app pentest challenges. A year ago, that kind of result was the exclusive territory of frontier models you had to write a check for. It isn't anymore.

We're not claiming Qwen3.6-27B replaces those closed models — the table makes the gap obvious. What we are saying is that the distance between "self-hosted open-weights" and "production-grade offensive security tooling" has gotten a lot shorter, and it's still shrinking. If you've been waiting for a reason to start running these workflows on infrastructure you control, the reason is in the numbers above.

How is Qwen3.6 for automated pentesting?

How is Qwen3.6 for automated pentesting?

What's in a Claude Code skill

Benchmark Methodology

Why Qwen?

Benchmarks results

Skills aren't a silver bullet

Where Qwen3.6-27B breaks

The cost

When Qwen + skills is the right call

In the end

Get Access to SecurityOS

Recent Posts

Why SecurityOS?

Auditor Fatigue Is Real and It's Affecting Your Results

Next-Generation Security Product Characteristics (Part 2): Done, Lead, Decide