The Qwen crew at Alibaba has unveiled QwQ-32B, a 32 billion parameter AI mannequin that demonstrates efficiency rivalling the a lot bigger DeepSeek-R1. This breakthrough highlights the potential of scaling Reinforcement Studying (RL) on strong basis fashions.
The Qwen crew have efficiently built-in agent capabilities into the reasoning mannequin, enabling it to assume critically, utilise instruments, and adapt its reasoning primarily based on environmental suggestions.
“Scaling RL has the potential to boost mannequin efficiency past standard pretraining and post-training strategies,” the crew acknowledged. “Current research have demonstrated that RL can considerably enhance the reasoning capabilities of fashions.”
QwQ-32B achieves efficiency corresponding to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated), a testomony to the effectiveness of RL when utilized to strong basis fashions pretrained on in depth world information. This outstanding end result underscores the potential of RL to bridge the hole between mannequin measurement and efficiency.
The mannequin has been evaluated throughout a variety of benchmarks, together with AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL, designed to evaluate its mathematical reasoning, coding proficiency, and normal problem-solving capabilities.
The outcomes spotlight QwQ-32B’s efficiency compared to different main fashions, together with DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the unique DeepSeek-R1.
Benchmark outcomes:
- AIME24: QwQ-32B achieved 79.5, barely behind DeepSeek-R1-6718’s 79.8, however considerably forward of OpenAl-o1-mini’s 63.6 and the distilled fashions.
- LiveCodeBench: QwQ-32B scored 63.4, once more intently matched by DeepSeek-R1-6718’s 65.9, and surpassing the distilled fashions and OpenAl-o1-mini’s 53.8.
- LiveBench: QwQ-32B achieved 73.1, with DeepSeek-R1-6718 scoring 71.6, and outperforming the distilled fashions and OpenAl-o1-mini’s 57.5.
- IFEval: QwQ-32B scored 83.9, very near DeepSeek-R1-6718’s 83.3, and main the distilled fashions and OpenAl-o1-mini’s 59.1.
- BFCL: QwQ-32B achieved 66.4, with DeepSeek-R1-6718 scoring 62.8, demonstrating a lead over the distilled fashions and OpenAl-o1-mini’s 49.3.
The Qwen crew’s method concerned a cold-start checkpoint and a multi-stage RL course of pushed by outcome-based rewards. The preliminary stage centered on scaling RL for math and coding duties, utilising accuracy verifiers and code execution servers. The second stage expanded to normal capabilities, incorporating rewards from normal reward fashions and rule-based verifiers.
“We discover that this stage of RL coaching with a small quantity of steps can improve the efficiency of different normal capabilities, reminiscent of instruction following, alignment with human desire, and agent efficiency, with out important efficiency drop in math and coding,” the crew defined.
QwQ-32B is open-weight and accessible on Hugging Face and ModelScope beneath the Apache 2.0 license, and can be accessible by way of Qwen Chat. The Qwen crew views this as an preliminary step in scaling RL to boost reasoning capabilities and goals to additional discover the mixing of brokers with RL for long-horizon reasoning.
“As we work in direction of creating the subsequent technology of Qwen, we’re assured that combining stronger basis fashions with RL powered by scaled computational assets will propel us nearer to reaching Synthetic Normal Intelligence (AGI),” the crew acknowledged.
See additionally: Deepgram Nova-3 Medical: AI speech mannequin cuts healthcare transcription errors

Need to study extra about AI and large knowledge from business leaders? Try AI & Big Data Expo going down in Amsterdam, California, and London. The great occasion is co-located with different main occasions together with Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Discover different upcoming enterprise expertise occasions and webinars powered by TechForge here.