Qwen3 – Unified Models for Thinking and Non-Thinking


Qwen3 – Unified Models for Thinking and Non-Thinking

Among open-source LLMs, the Qwen family of models is perhaps one of the best known. Not only are these models some of the highest performing ones, but they are also open license – Apache-2.0. The latest in the family is the Qwen3 series. With increased performance, being multilingual, 6 dense and 2 MoE (Mixture of Experts) models, this release surely stands out. In this article, we will cover some of the most important aspects of the Qwen3 technical report and run inference using the Hugging Face Transformer.

Qwen3.
Figure 1. Qwen3. (Source: https://qwenlm.github.io/blog/qwen3/)

We will cover the following while discussing Qwen3

  • What makes Qwen3 unique?
    • The unified thinking and non-thinking modes
  • What are the different model architectures in the Qwen3 series?
    • 6 dense and 2 MoE models
  • What is the pretraining and post-training strategy?
    • Training stages and datasets.

Note: This article concisely covers some of the most important parts of the Qwen3 technical report, which includes the model architecture, the dataset, and training strategies. We skip the evaluation tables in this article. The technical report already contains pages of data on the evaluation results. It is more worthwhile to directly go over the evaluation results from the technical report. In short, we answer “What is Qwen3?”

What Makes Qwen3 Unique?

Qwen3 employs a unified “thinking” and “non-thinking” architecture. Most LLMs use different models for thinking and non-thinking purposes. In simpler terms, the thinking mode allows an LLM to use test-time compute (higher resources) and ponder over the user query more. However, it is difficult to do so with the same model. In fact, the previous versions of Qwen also had a separate thinking model, that is, Qwen QwQ.

Along with the above, Qwen3 employs several other unique properties.

  • Unified Framework: A core innovation is the integration of “thinking mode” and “non-thinking mode” within a single model.
    • Thinking Mode: Designed for complex, multi-step reasoning tasks where the model explicitly generates intermediate thought processes.
    • Non-Thinking Mode: Optimized for rapid, context-driven responses, suitable for tasks where extensive reasoning isn’t required.
  • Eliminates Model Switching: This unified approach removes the need for users or developers to switch between different specialized models, such as a chat-optimized model (e.g., GPT-4o) and a dedicated reasoning model (e.g., QwQ-32B).
  • Dynamic Mode Switching: Qwen3 enables dynamic selection of the operational mode based on:
    • User queries.
    • Chat templates (e.g., using special flags like /think or /no_think in the prompt).
  • Thinking Budget Mechanism: This feature allows users to adaptively allocate computational resources during inference.
    • Users can specify a “thinking budget” (e.g., a maximum number of tokens for the thinking process).
    • This helps balance latency and performance based on the complexity of the task, effectively controlling the “depth” of reasoning.
  • Expanded Multilingual Capabilities: Compared to its predecessor Qwen2.5, Qwen3 significantly expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility.
  • Openness and Accessibility: All Qwen3 models are publicly accessible under the Apache 2.0 license, facilitating reproducibility and community-driven research.

What Are The Different Model Architectures In Qwen3 Series?

Qwen3 has 6 dense and 2 MoE (Mixture of Experts) models.

Qwen3 dense and MoE model architectures.
Figure 2. Qwen3 dense and MoE model architectures.

For the dense models:

  • We have small edge/mobile deployment friendly models in the range of 0.6B, 1.7B, 4B, and 8B parameters.
  • And large cloud based deployment models in the range of 14B and 32B parameters.

The authors use the following recipe (similar to Qwen2.5) for the dense models:

  • Grouped Query Attention
  • SwiGLU
  • Rotary Positional Embedding
  • RMSNorm with pre-normalization
  • For stable training, the authors remove the QKV-bias from Qwen2 and introduce QK-Norm for the attention mechanism. This ensures stable training.

The MoE models include the following:

  • Qwen3-30B-A3B: With a total of 30B parameters and 3B activated parameters for each token.
  • Qwen3-235B-A22B: With a total of 235B parameters and 22B activated parameters for each token.

The key features of Qwen3 MoE include:

  • Fine-grained expert segmentation.
  • A total of 128 experts with 8 activated experts per token.
  • Excluding shared experts.
  • Global-batch load balancing loss for expert specialization.

All models use the Qwen tokenizer, implementing byte-level byte-pair encoding with a vocabulary size of 151,669.

What Is the Pretraining and Post-Training Strategy for Qwen3?

Just like any other major LLM, Qwen3 was also trained in two stages: pretraining and post-training. These two contain several dataset collection strategies and sub-parts for training the model.

Pretraining

Here are the most important aspects of the pretraining dataset:

  • All Qwen3 models were trained on 119 languages, a total of 36 million tokens. The dataset covers coding, STEM, reasoning tasks, books, multilingual texts, and synthetic data.
  • Qwem2.5-VL was used to extract text (via OCR) from PDF-like documents, which was then refined by Qwen2.5. This led to additional trillions of tokens.
  • The authors also employ Qwen2.5, Qwen2.5-Coder, and Qwen2.5-Math to generate additional high-quality synthetic data.
  • Finally, multilingual data was added to the corpus.

The following are the three pre-training stages of Qwen3:

  • General Stage (S1): All Qwen3 models trained on over 30 trillion tokens with a sequence length of 4096. This covers 119 languages. This stage does not cover the entire 36 trillion tokens at the moment.
  • Reasoning Stage (S2): The authors add STEM, coding, reasoning, and synthetic data to the corpus, which is an additional 5 trillion tokens.
  • Long Context Stage (S3): In the third stage, the authors collect high-quality long context data for adding to the corpus. Here, 75% of the text is between a sequence length of 16,384 to 32,768 tokens. The rest 25% of the data has a sequence length between 4,096 to 16,384 tokens.

The above pretraining experiments give rise to the Qwen3-Base models.

Post-Training

Qwen3 follows the below post-training pipeline.

Qwen3 post-training strategy.
Figure 3. Qwen3 post-training strategy.

The post-training stages of Qwen3 are quite important because this is where the unified “thinking” and “non-thinking” modes are added.

  • Long-CoT Cold Start: Here, the authors start training the Qwen3 model directly with carefully curated long Chain of Thought samples. This allows for learning a diverse chain of thought reasoning without relying on the model’s performance initially.
  • Reasoning RL: In this sub-stage, the authors employ GRPO for reasoning training. The dataset spans 3,995 query-verifier pairs. These are challenging samples covering a broad range of sub-domains, and the above long-CoT model was directly trained on this.
  • Thinking Model Fusion: The above stages give rise to the “thinking” model. This sub-stage ensures the integration of the “non-thinking” model. Here, additional SFT data with both “thinking” and “non-thinking” samples were included. This also allows for adding a thinking budget, where if the thinking process is going on and the model reaches a user-specified budget threshold, then the thinking process halts. The following is the chat template design to fuse both modes without training an additional model.
Qwen3 chat templates for thinking and non-thinking modes.
Figure 4. Qwen3 chat templates for thinking and non-thinking modes.
  • General RL: The general RL stage broadly enhances the model’s ability across diverse scenarios with a reward system. It covers the following core capabilities: instruction following, format following, preference alignment, agent ability, and abilities for specialized scenarios.
  • Strong to Weak Distillation: This is perhaps the most important stage of post-training. The above 4 post-training sub-stages give rise to the flagship Qwen3 models – Qwen3-235B-A22B and Qwen3-32B. These models are further used for distilling knowledge into lightweight models to give rise to Qwen3-0.6B, 1.7B, 4B, 8B, and 14B, and Qwen3-30B-A3B. This eliminates the process of training these models through the four stages.

Bonus Content – Inference Jupyter Notebook

In case you are interested in running inference using any of the Qwen3 models, you can download the notebook, upload it to either Colab or Kaggle, and start playing around.

Notebook features:

  • All Qwen3 models are available. Can run up to Qwen3-14B with 4-bit quantization on Colab/Kaggle. Can run Qwen3-30B-A3B on L40S 48GB VRAM.
  • Switches for thinking and non-thinking modes.

Qwen3 inference.
Figure 5. Qwen3 inference.

Articles on LLMs and Transformer Architectures

Summary and Conclusion

In this article, we covered some of the important aspects of the Qwen3 model. We covered what made Qwen3 unique, the architecture, the dataset preparation, and the pretraining & post-training strategies. In future article, we will cover the benchmark and inference in detail.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Reference

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *