Introduction to Qwen3.5 – Overview, vLLM, and llama.cpp


Introduction to Qwen3.5 – Overview, vLLM, and llama.cpp

Among open-source LLMs, the Qwen series of models is perhaps one of the best known. Be it their language-only models or the VLMs, they always punch above their weight. Recently, the researchers from Qwen released Qwen3.5, a series of multimodal native language models that can accept text, image, and video input. In this article, we are going to explore the same, with an overview from their official technical article, and running inference using vLLM & llama.cpp.

Qwen3.5 models and capabilities.
Figure 1. Qwen3.5 models and capabilities.

What makes the release of Qwen3.5 more interesting is not just the technical capability. But also the day-1 support with various libraries like Hugging Face Transformers, KTransformers, Unsloth, vLLM, and llama.cpp. This makes the models accessible to a wide range of community contributors to test them out.

What are we going to cover in this article with Qwen3.5?

  • An overview of the motive, capability, and benchmark results from the official article.
  • Inference using vLLM and llama.cpp using Qwen3.5 for:
    • Text-only task
    • Image description
    • Video description
    • Object counting
    • OCR

Models in the Qwen3.5 Series

Qwen3.5 comes in 8 model sizes across both base and instruction-tuned models:

  • Qwen3.5-397B-A17B
  • Qwen3.5-122B-A10B
  • Qwen3.5-35B-A3B
  • Qwen3.5-27B
  • Qwen3.5-9B
  • Qwen3.5-4B
  • Qwen3.5-2B
  • Qwen3.5-0.8B

Qwen3.5 models.
Figure 2. Qwen3.5 models.

This makes the model series extremely accessible across a range of hardware, from cloud GPUs to local desktop/laptop and mobile environments. The open-source models have a context length of 262,144 tokens. It is quite interesting to see so many open-source models being dropped for a family of LLMs in a short period of time. Not to mention the day one integrations with vLLM, llama.cpp, KTransformers, and Hugging Face Transformers.

Additionally, Qwen3.5 Plus, their flagship closed-source model, is available only via Qwen Chat and API with a context length of 1M tokens.

Furthermore, unlike Qwen3 LLMs, all Qwen3.5 models are natively multimodal with image and video support.

Benchmarks and Performance of Qwen3.5

At the time of writing this, no technical report or paper is out yet. However, according to the official article, Qwen3.5-397B-A17B, which is the flagship open-source model in the series, achieves amazing results in reasoning, coding, agent capabilities, and multimodal understanding. It uses Sparse Mixture-of-Experts. With just 17B active parameters, with the right resources, it will be super-efficient during inference.

The following image shows the performance of the same against other flagship closed-source models.

Qwen3.5-397B-A17B benchmark against other state-of-the-art models.
Figure 3. Qwen3.5-397B-A17B benchmark against other state-of-the-art models.

We can see right away that the largest Qwen3.5 model does not always beat the other closed-source models. However, that is not even the point. At the moment, we have an open model that can be deployed on local clusters by companies that have resources and achieve almost the same performance as closed-source ones without sending their data outside of their organization. This gap will close rapidly by the end of 2026.

The following table shows the language capabilities of the Qwen3.5-397B-A17B compared against other models.

GPT5.2Claude 4.5 OpusGemini-3 ProQwen3-Max-ThinkingK2.5-1T-A32BQwen3.5-397B-A17B
Knowledge
MMLU-Pro87.489.589.885.787.187.8
MMLU-Redux95.095.695.992.894.594.9
SuperGPQA67.970.674.067.369.270.4
C-Eval90.592.293.493.794.093.0
Instruction Following
IFEval94.890.993.593.493.992.6
IFBench75.458.070.470.970.276.5
MultiChallenge57.954.264.263.362.767.6
Long Context
AA-LCR72.774.070.768.770.068.7
LongBench v254.564.468.260.661.063.2
STEM
GPQA92.487.091.987.487.688.4
HLE35.530.837.530.230.128.7
HLE-Verified¹43.338.84837.637.6
Reasoning
LiveCodeBench v687.784.890.785.985.083.6
HMMT Feb 2599.492.997.398.095.494.8
HMMT Nov 2510093.393.394.791.192.7
IMOAnswerBench86.384.083.383.981.880.9
AIME2696.793.390.693.393.391.3
General Agent
BFCL-V463.177.572.567.768.372.9
TAU2-Bench87.191.685.484.677.086.7
VITA-Bench38.256.351.640.941.949.7
DeepPlanning44.633.923.328.714.534.3
Tool Decathlon43.843.536.418.827.838.3
MCP-Mark57.542.353.933.529.546.1
Search Agent
HLE w/ tool45.543.445.849.850.248.3
BrowseComp65.867.859.253.9–/74.969.0/78.6
BrowseComp-zh76.162.466.860.970.3
WideSearch76.876.468.057.972.774.0
Seal-045.047.745.546.957.446.9
Multilingualism
MMMLU89.590.190.684.486.088.5
MMLU-ProX83.785.787.778.582.384.7
NOVA-6354.656.756.754.256.059.1
INCLUDE87.586.290.582.383.385.6
Global PIQA90.991.693.286.089.389.8
PolyMATH62.579.081.664.743.173.3
WMT24++78.879.780.777.677.678.9
MAXIFE88.479.287.584.072.888.2
Coding Agent
SWE-bench Verified80.080.976.275.376.876.4
SWE-bench Multilingual72.077.565.066.773.069.3
SecCodeBench68.768.662.457.561.368.3
Terminal Bench 254.059.354.222.550.852.5
Table 1. Qwen3.5-397B-A17B language benchmarks.

The next table shows the vision-language performance of the same model.

GPT5.2Claude 4.5 OpusGemini-3 ProQwen3-VL-235B-A22BK2.5-1T-A32BQwen3.5-397B-A17B
STEM and Puzzle
MMMU86.780.787.280.684.385.0
MMMU-Pro79.570.681.069.378.579.0
MathVision83.074.386.674.684.288.6
Mathvista(mini)83.180.087.985.890.190.3
We-Math79.070.086.974.884.787.9
DynaMath86.879.785.182.884.486.3
ZEROBench93104912
ZEROBench_sub33.228.439.028.433.541.0
BabyVision34.414.249.722.236.552.3/43.3
General VQA
RealWorldQA83.377.083.381.381.083.9
MMStar77.173.283.178.780.583.8
HallusionBench65.264.168.666.769.871.4
MMBenchEN-DEV-v1.188.289.293.789.794.293.7
SimpleVQA55.865.773.261.371.267.1
Text Recognition and Document Understanding
OmniDocBench1.585.787.788.584.588.890.8
CharXiv(RQ)82.168.581.466.177.580.8
MMLongBench-Doc61.960.556.258.561.5
CC-OCR70.376.979.081.579.782.0
AI2D_TEST92.287.794.189.290.893.9
OCRBench80.785.890.487.592.393.1
Spatial Intelligence
ERQA59.846.870.552.567.5
CountBench91.990.697.393.794.197.2
RefCOCO(avg)84.191.187.892.3
ODInW1346.343.247.0
EmbSpatialBench81.375.761.284.377.484.5
RefSpatialBench65.569.973.6
LingoQA68.878.872.866.868.281.6
V*75.967.088.085.977.095.8/91.1
Hypersim11.012.5
SUNRGBD34.938.3
Nuscene13.916.0
Video Understanding
VideoMME(w sub.)8677.688.483.887.487.5
VideoMME(w/o sub.)85.881.487.779.083.283.7
VideoMMMU85.984.487.680.086.684.7
MLVU (M-Avg)85.681.783.083.885.086.7
MVBench78.167.274.175.273.577.6
LVBench73.757.376.263.675.975.5
MMVU80.877.377.571.180.475.4
Visual Agent
ScreenSpot Pro45.772.762.065.6
OSWorld-Verified38.266.338.163.362.2
AndroidWorld63.766.8
Medical VQA
SLAKE76.976.481.354.781.679.9
PMC-VQA58.959.962.341.263.364.2
MedXpertQA-MM73.363.676.047.665.370.0
Table 2. Vision-language benchmark of Qwen3.5-397B-A17B.

As we can see, the model is surpassing a lot of the closed-source models in the vision-language performance. Even though it is behind in some of the benchmarks, the gap is not that high anymore. This shows how far we have come with open-source multimodal models that can now compete with proprietary ones.

Pretraining and Performance Against Other Similar Open-Source Models

The authors of Qwen3.5 focus on three components while pretraining: power, efficiency, and versatility.

Qwen3.5-397B-A17B inference efficiency and decode throughput.
Figure 4. Qwen3.5-397B-A17B inference efficiency and decode throughput.

The models have been trained on a significantly larger corpus of data with higher-sparsity MoE, combined with Gated DeltaNet and Gated Attention. We can see from the above figure that it surpasses the previous Qwen models in throughput.

But how does it stack against other similar open-source models?

Qwen3.5-397B-A17B against other leading open-source LLMs.
Figure 5. Qwen3.5-397B-A17B against other leading open-source LLMs.

We can see that it either matches or surpasses all the open-source models of similar scale in the above benchmarks.

The technical article also lays out the infrastructure used for efficient training and inference of the latest Qwen3.5 models. This includes their data service, parallelism strategies across vision and language, and a native FP8 training pipeline.

Qwen3.5-397B-A17B training and inference infrastructure.
Figure 6. Qwen3.5-397B-A17B training and inference infrastructure.

It is best to go through the technical article to understand a few of the components deeply. From the next section onward, we will focus on the inference of Qwen3.5 models.

If You Are Interested, Find More Qwen Articles Below

Qwen3.5 Inference with vLLM and llama.cpp

For Qwen3.5 inference, we will focus on two libraries, vLLM and llama.cpp. vLLM supports inference with all three modalities: text, image, and video. However, it does not offer CPU RAM offloading like llama.cpp does. Although llama.cpp does not support video inference, we can use it for running larger models.

All the experiments covered here were run on a system with 8GB VRAM and 32 GB system RAM.

Project Directory Structure

The following is the directory that we are following.

├── input
│   ├── image_1.jpg
│   ├── image_2.jpg
│   ├── image_3.jpg
│   ├── image_4.png
│   └── video_1.mp4
└── qwen_3_5_inference_server.ipynb
  • The input directory contains the images that we will use for inference.
  • We have the qwen_3_5_inference_server.ipynb that contains the inference code.

The notebook and the inference data are available as a downloadable zip file along with this article.

Download Code

Installing Dependencies and Setup

It is best to create a new environment for installing all the dependencies.

First, we are going to install vLLM using uv. The following command installs vLLM for NVIDIA CUDA.

uv pip install vllm --torch-backend=auto

Second, to install llama.cpp, you can follow the gpt-oss getting started article.

Third, install the OpenAI Python client.

pip install openai

Finally, create a .env file in the project directory. This contains a dummy OpenAI API key that we need for creating the OpenAI client instance.

OPENAI_API_KEY="EMPTY"

This is all the setup we need for running Qwen3.5 locally.

Qwen3.5 Inference using vLLM

Let’s launch the vLLM first and start covering the Jupyter Notebook.

vllm serve Qwen/Qwen3.5-0.8B \
--port 8000 \
--max-model-len 4000 \
--gpu-memory-utilization 0.8 \
--allowed-local-media-path $PWD
  • We are launching the Qwen3.5-0.8B model here on port 8000 with a context length of 4000. You can increase the context length as per your requirement and according to the VRAM at your disposal.
  • The --gpu-memory-utilization 0.8 uses around 80% of the VRAM when loading the model and leaves a bit of overhead for dynamic allocation as and when needed during inference.
  • The --allowed-local-media-path tells from which directory we can read the files for multimodal inference. This is important as we can only provide subpaths of this path when passing images or videos for inference. Here we are using the present working directory as the path as all the inference files are present in the input directory.

Jupyter Notebook Code for Qwen3.5 Inference

Let’s cover the Jupyter Notebook now.

All the code is present in qwen_3_5_inference_server.ipynb.

Imports

The following code block contains all the imports that we need.

from openai import OpenAI

import os
import base64
Create the OpenAI Client

We will use the localhost base URL and the dummy API key from .env.

client = OpenAI(
    base_url='http://localhost:8000/v1', 
    api_key=str(os.getenv('OPENAI_API_KEY'))
)
Helper Function for Image Conversion

The following function converts all the image paths to base64 encoded format which makes it easier to use them with both vLLM and llama.cpp.

def image_to_data_uri(image_path):
    mime = 'jpeg'
    with open(image_path, 'rb') as f:
        data = base64.b64encode(f.read()).decode()
    return f"data:image/{mime};base64,{data}"
Text-Only Inference

Let’s carry out a text-only inference. The following code block contains a function for text inference using the OpenAI chat completions format.

def text_inference(instruction, max_tokens=1024, model_name='Qwen/Qwen3.5-0.8B'):
    response = client.chat.completions.create(
        model=model_name,
        messages=[{
            'role': 'user',
            'content': [{
                'type': 'text',
                'text': instruction,
            }]
        }],
        max_tokens=max_tokens,
        stream=True
    )
    
    for chunk in response:
            content = chunk.choices[0].delta.content
            print(content, end='', flush=True)

The next code block just calls the function.

response = text_inference('Tell me a short story about rivers.')

We get a streaming response. Following is the truncated output.

In the year 1842, the Great Salt产量 was no longer the unyielding force of nature, but a slave to the river itself.

The Grand Saltway was a system of canals woven through the Mountain Passes. There was only one river passing through it now: the **River of Ice**, frozen midway between two major cities, the City of Gold and the City of Stone. Today, the water was murky, heavy with tar and brine, drifting lazily like a sleeping moth.
.
.
.
Image Inference

We have a similar function for image inference that accepts a base64 encoded image as input, along with the text instruction.

def image_inference(instruction, image_path, max_tokens=1024, model_name='Qwen/Qwen3.5-0.8B'):
    uri = image_to_data_uri(image_path)
    
    response = client.chat.completions.create(
        model=model_name,
        messages=[{
            'role': 'user',
            'content': [
                {
                    'type': 'image_url',
                    'image_url': {
                        # 'url': f"file://{image_path}"
                        'url': f"{uri}"
                    },
                },
                {
                    'type': 'text',
                    'text': instruction,
                 }
            ]   
        }],
        max_tokens=max_tokens,
        stream=True
    )
    
    for chunk in response:
            content = chunk.choices[0].delta.content
            print(content, end='', flush=True)

image_inference(
    instruction='What is this image?', 
    image_path=os.path.join(os.getcwd(), 'input/image_1.jpg')
)

Following is the image we are using for inference, along with the response below that.

Fruit image that we are using for Qwen3.5 image inference.
Figure 7. Fruit image that we are using for Qwen3.5 image inference.
This image is a still life photograph featuring bright orange mandarin oranges (also known as tangerines or sisal oranges) arranged on a clean, white surface. The scene evokes an elegant, minimalist aesthetic, likely intended for stock photography, editorial work, or promotional material.

**Key elements include:**

- **Fruits:** Multiple ripe mandarin oranges, some with green calyxes attached — indicating freshness and natural origin. One fruit bears a single green leaf tucked near its stem.
- **Packaging:** The fruits are loosely arranged around a reusable eco-friendly mesh bag (tpartyer bag), which is crinkled and loosely wound, suggesting it’s filled and ready to be used — a highly sustainable organic choice.
- **Background & Lighting:** The background is明亮的、无纹理的白色素面 (white expanse), with soft shadows beneath the fruits and bag, giving depth and natural lighting. The overall color palette is minimal, dominated by citrus orange and beige tones, including the natural canvas color of the bag.
- **Style & Composition:** The lighting is bright and directional, producing even illumination and subtle highlights on the textured rind of the oranges and the weave of the mesh. The composition is deliberate, placing the most prominent fruit near the top left, while others are scattered for visual interest. There are no distracting elements — the focus remains entirely on the natural beauty of fruit and its sustainable packaging.

**Overall Impression:**
The image captures the harmony between natural complexity (orange skin) and minimalism (white, clean space). It’s both fresh and refined — perfect for aligning sustainability with culinary aesthetics. If it were purchased, it could be used as an invitation photo or visual storytelling piece about “natural, wholesome, ethically sourced food.”

We can see that although the model contains just 0.8B parameters, the description of the image is quite detailed.

Video Inference

For video inference, we need to provide the full file path of the video along with the text instruction.

def video_inference(instruction, video_path, max_tokens=1024, model_name='Qwen/Qwen3.5-0.8B'):
    response = client.chat.completions.create(
        model=model_name,
        messages=[{
            'role': 'user',
            'content': [
                {
                    'type': 'video_url',
                    'video_url': {
                        'url': f"file://{video_path}"
                    },
                },
                {
                    'type': 'text',
                    'text': instruction,
                 }
            ]   
        }],
        max_tokens=max_tokens,
        stream=True
    )
    
    for chunk in response:
            content = chunk.choices[0].delta.content
            print(content, end='', flush=True)

video_inference(
    'How many zebras are in the video?',
    os.path.join(os.getcwd(), 'input/video_1.mp4')
)

We are using the following video for inference.

Video 1. A video showing a group of zebras that we are using for “video understanding” with Qwen3.5.

We get the following output.

There are six zebras in the video.

The model is making a mistake here by counting six zebras in the video, whereas there are only 5. Larger models will surely give a better output in this scenario.

Image Counting Experiments

Let’s use this model for an image counting experiment and ask the model to count the number of oranges in the first image that we used earlier.

image_inference(
    instruction='How many oranges are there outside the bag?', 
    image_path=os.path.join(os.getcwd(), 'input/image_1.jpg')
)

The following is the model output.

There are **3** oranges outside the bag:

1. One orange is in the foreground, slightly to the left — fully visible and near the bottom.
2. Another orange is nestled inside the net bag bowl, positioned more toward the center-lower part.
3. A third orange is on the far right edge, partially cut off by the frame.

These are the only standout visible oranges — no orange is inside the net bag itself.

✅ Final Answer: **3**

The model is again making a mistake here by counting 3 oranges outside the bag, whereas there are 4.

Telling Time on Watch Faces

One of the interesting experiments is telling the time of analog watch faces, which a lot of previous models failed.

image_inference(
    instruction='What time and date are shown on the watch face?', 
    image_path=os.path.join(os.getcwd(), 'input/image_4.png')
)

We are using the following image for inference here.

An image showing an analog wrist watch face for experimenting with Qwen3.5 "time telling" experiment.
Figure 8. An image showing an analog wrist watch face for experimenting with Qwen3.5 “time telling” experiment.

We get the following output from the model.

The time and date shown on the watch face are:

**Time: 7:25**  
**Date: Wednesday, September 17**

---

### Breakdown:
- **Hour**: The index of the rightmost hand points toward the “7” (markings at 60 seconds apart), so the time is 7:25.
- **Date**:  
  - “Wednesday” is written in the upper center.  
  - “September” is below “Wednesday” — both on black background.  
  - “17” is left under “September”, indicating the day of the week (Sunday = 0, Monday = 1) and year (January = 1, February = 2, …, September = 16, and 17 = Sunday).  
  - “25°” is in the upper right corner with the temperature.

This is a smartwatch interface (recognized as Huawei Watch 5 Series) displaying a modern analog-style digital dial with chronograph-style sub-dials for steps, power, etc.

Let me know if you’d like to interpret what each sub-dial mean! 📲

It is quite amusing to see that a model with 0.8B parameters is telling the correct time with a detailed breakdown of the image. A couple of years ago, even the largest VLMs struggled with telling time on analog watch faces.

OCR Experiment

The final experiment that we will do with vLLM + Qwen3.5-0.8B is for OCR.

image_inference(
    instruction='Carry out OCR in this receipt without additional text.', 
    image_path=os.path.join(os.getcwd(), 'input/X00016469670.jpg')
)

We have the following receipt image for OCR.

A receipt image for OCR experiment with Qwen3.5.
Figure 9. A receipt image for OCR experiment with Qwen3.5.

The following block contains the output.

tan chay yee
*** COPY ***
OJC MARKETING SDN BHD
ROC NO: 538358-H
NO 2 & 4, JALAN BAYU 4,
BANDAR SERI ALAM,
81750 MASAL, JOHOR
Tel:07-368 2218 Fax:07-368 8218
Email: ng{ojcgroup.com
TAX INVOICE
Invoice No :PEGIV-1030765
Date :15/01/2019 11:05:16 AM
Cashier : NG CHUAN MIN
Sales Person : FATIN
Bill To : THE PEAK QUARRY WORKS
Address : .
Description Qty Price Amount
000000111 1 193.00 193.00 SR
KINGS SAFETY SHOES KWD 805
Qty:1 Total Exclude GST:
0.10 193.00
03/07/16 35.00 Total GST @6%
03/07/16 193.00
Total Inclusive GST:
0000 000000 0000 0000 00193.00
Round Amt:
00.00
TOTAL: 193.00
VISA CARD ************4318
</TOP>
----
**** Thank You, Please Come Again.****
Goods Sold Are Not Returnable & Refundable
****

We can see that the model has made quite a few mistakes in the bottom section of the image. It is also not maintaining the structure, and sometimes the lines are jumbled up. Perhaps a larger model will work much better here.

llama.cpp Qwen3.5 Inference

Note: We cannot run video inference natively with llama.cpp.

We can run much larger models using llama.cpp because of CPU RAM offloading. Here we are running the Qwen3.5-4B INT4 GGUF model.

./build/bin/llama-server \
-hf unsloth/Qwen3.5-4B-GGUF \
-c 32000 \
--reasoning-budget 0 \
--port 8000 \
--media-path $PWD

Let’s carry out inference on some of the images where the smaller Qwen3.5 model made inference.

The first one is the counting experiment.

image_inference(
    instruction='How many oranges are there outside the bag?', 
    image_path=os.path.join(os.getcwd(), 'input/image_1.jpg')
)

We get the following output.

Looking at the image, we can count the oranges that are **outside** the mesh bag:

- One on the far left (with a leaf and stem)
- One behind it (also with a stem and leaf)
- One in the front center (smaller, with a green leaf)
- One on the far right (with a stem and leaf)

That’s **four** oranges clearly outside the bag.

The remaining oranges are inside the mesh bag — you can see several through the netting, but they’re not “outside.”

✅ Final Answer: **4**

The answer is entirely correct this time. The model also shows spatial reasoning by telling which orange is placed where.

Finally, the OCR experiment.

image_inference(
    instruction='Carry out OCR on this receipt without additional text.', 
    image_path=os.path.join(os.getcwd(), 'input/X00016469670.jpg')
)

We have the following output from the model.

tan chay yee

*** COPY ***

OJC MARKETING SDN BHD
ROC NO: 538358-H
NO 2 & 4, JALAN BAYU 4,
BANDAR SERI ALAM,
81750 MASAI, JOHOR
Tel: 07-388 2218 Fax: 07-388 8218
Email: [email protected]

TAX INVOICE
-----------------------------
Invoice No : PEGIV-1030765
Date      : 15/01/2019 11:05:16 AM
Cashier   : NG CHUAN MIN
Sales Persor: FATIN
Bill To   : THE PEAK QUARRY WORKS
Address   : .

-----------------------------
Description        Qty  Price    Amount
-----------------------------
0000000111        1    193.00   193.00 SR
KINGS SAFETY SHOES KWD 805
-----------------------------
Qty: 1            Total Exclude GST:   193.00
                  Total GST @6%:       0.00
                  Total Inclusive GST: 193.00
                  Round Amt:           0.00
-----------------------------
                  TOTAL:              193.00
-----------------------------
VISA CARD         193.00
xxxxxx000xxxx4318
Approval Code: 000

193.00

Goods Sold Are Not Returnable & Refundable
***Thank You. Please Come Again.***

The Qwen3.5-4B model outputs correct OCR text this time. Not only that, it also maintained the structure wherever it was necessary.

Summary and Conclusion

In this article, we covered an introduction to the Qwen3.5 models. We started with a brief discussion of the important aspects of the official technical article. Then we moved to inference using vLLM and llama.cpp. We discovered where the smaller Qwen3.5 model was making mistakes and where the larger one was excelling. In future articles, we will focus on fine-tuning Qwen3.5 models for use cases where the pretrained ones are not performing well.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

References

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *