Fine-Tuning Qwen3.5

In this article, we will fine-tune the Qwen3.5 model for a custom use case. Specifically, we will be fine-tuning the Qwen3.5-0.8B model on the VQA-RAD dataset.

In the previous article, we introduced the Qwen3.5 model family along with inference for several multimodal tasks. Here, we will take it a step further by adapting the model to a domain-specific task.

Jump to Download Code

Figure 1. Inference demo after fine-tuning Qwen3.5-0.8B.

The Qwen3.5-0.8B model, although the smallest in the family, shows strong vision-language performance. We can easily run the model with FP16/BF16 precision even with 4GB VRAM during inference. This makes it a practical option for experimenting on custom tasks and small-scale deployments.

What are we going to cover while fine-tuning Qwen3.5-0.8B?

Setting up the environment for Unsloth training.
Understanding the VQA RAD dataset.
Preparing the dataset in a specific format for training.
Training and inference using Qwen3.5-0.8B with Unsloth.

The VQA-RAD Dataset

The VQA-RAD dataset contains radiology images of the chest and brain where clinicians have asked naturally occurring questions. These questions were answered by other clinicians. Additionally, the dataset contains metadata for each question and answer pair along with the image.

We can find and download the dataset from here. The following is the directory structure after extracting the dataset.

osfstorage-archive
├── Readme.docx
├── VQA_RAD Dataset Public.json
├── VQA_RAD Dataset Public.xlsx
├── VQA_RAD Dataset Public.xml
└── VQA_RAD Image Folder
    ├── synpic100132.jpg
    ├── synpic100176.jpg
    ...
    └── synpic9872.jpg

The osfstorage-archive contains several files along with the images in the VQA_RAD Image Folder. The JSON, Excel, and XML files all contain the annotations in different formats. In this article, we will use the JSON file. There are 315 images in total.

The following block shows the truncated samples from the JSON file.

[
   {
      "qid": "0",
      "phrase_type": "freeform",
      "qid_linked_id": "03f451ca-de62-4617-9679-e836026a7642",
      "image_case_url": "https://medpix.nlm.nih.gov/case?id=48e1dd0e-8552-46ad-a354-5eb55be86de6",
      "image_name": "synpic54610.jpg",
      "image_organ": "HEAD",
      "evaluation": "not evaluated",
      "question": "Are regions of the brain infarcted?",
      "question_rephrase": "NULL",
      "question_relation": "NULL",
      "question_frame": "NULL",
      "question_type": "PRES",
      "answer": "Yes",
      "answer_type": "CLOSED"
   },
   {
      "qid": 1,
      "phrase_type": "freeform",
      "qid_linked_id": "06e26b2c-04b9-42bc-8e98-1de30a0f7682",
      "image_case_url": "https://medpix.nlm.nih.gov/case?id=b197277b-6960-4175-86ee-d2cb23e381b3",
      "image_name": "synpic29265.jpg",
      "image_organ": "CHEST",
      "evaluation": "not evaluated",
      "question": "Are the lungs normal appearing?",
      "question_rephrase": "NULL",
      "question_relation": "NULL",
      "question_frame": "NULL",
      "question_type": "ABN",
      "answer": "No",
      "answer_type": "CLOSED"
   },
]

For each question sample, we have the following important attributes:

qid: This is the question ID/number.
image_name: Refers to the image name in the VQA_RAD Image Folder directory.
image_organ: The organ that the question is about.
question: The question asked by one of the radiologists.
question_rephrase: A rephrasing of the same question. If no rephrased question is present, it is NULL/None.
question_type: The type of question. We will get into this a bit later.
answer: The answer to the question. It can be one word, a few words, or a small phrase.

The JSON file contains 2247 such question IDs because there are multiple questions for the same image. This gives the VLM plenty of opportunity to learn about different questions from the data.

Let’s check out one example. The following figure shows an image from the image folder and one of its corresponding questions.

Figure 2. Sample data and its corresponding question from the VQA-RAD dataset.

We will get into the details of some of the attributes when covering the dataset preparation script.

Project Directory Structure

The following is the directory structure that we are following here.

├── input
│   ├── osfstorage-archive
│   │   ├── VQA_RAD Image Folder  [315 entries exceeds filelimit, not opening dir]
│   │   ├── Readme.docx
│   │   ├── VQA_RAD Dataset Public.json
│   │   ├── VQA_RAD Dataset Public.xlsx
│   │   └── VQA_RAD Dataset Public.xml
│   └── osfstorage-archive.zip
├── outputs
│   ├── checkpoint-142
│   │   ├── adapter_config.json
│   │   ├── adapter_model.safetensors
│   │   ├── chat_template.jinja
│   │   ├── optimizer.pt
│   │   ├── processor_config.json
│   │   ├── README.md
│   │   ├── rng_state.pth
│   │   ├── scheduler.pt
│   │   ├── tokenizer_config.json
│   │   ├── tokenizer.json
│   │   ├── trainer_state.json
│   │   └── training_args.bin
│   ...
│   │   └── training_args.bin
│   └── README.md
├── qwen_lora
│   ├── adapter_config.json
│   ├── adapter_model.safetensors
│   ├── chat_template.jinja
│   ├── processor_config.json
│   ├── README.md
│   ├── tokenizer_config.json
│   └── tokenizer.json
├── annotations.csv
├── qwen3_5_0_8b_ft.ipynb
├── qwen_3_5_inference_vqa_rad_fine_tuned.ipynb
├── qwen_3_5_inference_vqa_rad.ipynb
├── README.md
└── requirements.txt

The input directory contains the VQA-RAD dataset that we discussed in the previous section.
The outputs and qwen_lora directories contain the intermediate and final training checkpoints, respectively.
We have three Jupyter Notebooks. One for fine-tuning and the other two for inference. Among the inference notebooks, one is for testing the pretrained model and the other for testing the fine-tuned model.
Apart from the above, we also have a README and a requirements file.

All the Jupyter Notebooks and final LoRA checkpoints are provided along with this article as a zip file. You can download the dataset, set it up in the above directory structure, and get started with either training or inference right away.

Download Code

Installing the Requirements

We can use the requirements file to install all the dependencies.

pip install -r requirements.txt

This is all the setup we need. From the next section onward, we will focus on the coding implementation for training Qwen3.5-0.8B.

Fine-Tuning Qwen3.5-0.8B on VQA-RAD Dataset

Let’s jump into the code implementation of the use case.

Inference before fine-tuning:

The qwen_3_5_inference_vqa_rad.ipynb Jupyter Notebook contains the inference of the pretrained model.

Here is one sample response from the model for the following question and image.

Figure 3. A sample from the VQA-RAD dataset on which we are running the pretrained model.

Here is the question from the JSON file related to the image.

Is this patient in a supine position?

And the following is the pretrained model’s response:

Based on the image provided, the patient is **not** in a supine position.

Here is the reasoning:
*   **Standard Position:** In a standard CT scan, the patient is placed in the **supine position** (lying on their back).
*   **Image Orientation:** This is an **axial (cross-sectional) slice** of the abdomen. In this view, the patient's back is at the top of the image, and their head is at the bottom.
*   **Anatomical Features:** You can see the kidneys, spine, and abdominal organs clearly. The orientation of the body parts relative to the image plane confirms this is a standard anatomical view taken from the back.

Although the model gives a lengthy response, the primary part is that it states the patient is not in a supine position. Whereas according to the JSON file, the patient is. Fine-tuning is surely going to improve the response.

Covering the Training Code

Now, let’s cover the code in the qwen_3_5_0_8b_ft.ipynb Jupyter Notebook.

The first code block contains all the imports that we need along the way.

from unsloth import FastVisionModel
from datasets import load_dataset
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
from transformers import TextStreamer
from datasets import Dataset

import torch
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pprint

We are importing all the necessary classes and modules from Unsloth and TRL.

Loading the Pretrained Model

Next, we load the Qwen3.5-0.8B pretrained model and prepare it for PEFT (Parameter Efficient Fine-Tuning).

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,

    r=16,           
    lora_alpha=16, 
    lora_dropout=0,
    bias='none',
    random_state=3407,
    use_rslora=False,  
    loftq_config=None,
    # target_modules='all-linear',
)

The hyperparameter choices are pretty standard here. We are fine-tuning the vision layers here as well. The rank and alpha for LoRA are 16. Although we could use a higher rank, we are setting a baseline training experiment here.

Data Preparation

The next few code blocks focus on the dataset preparation to make it supervised fine-tuning compatible.

root_dir = 'input/osfstorage-archive'
image_folder = 'VQA_RAD Image Folder'
annotation_file = 'VQA_RAD Dataset Public.json'

annotations = pd.read_json(f"{root_dir}/{annotation_file}")

We start by reading the annotation JSON file from the root data directory.

Let’s read one question from the JSON file and visualize the corresponding image.

# We can extract image name from image_name column.
sample_image_name = annotations['image_name'][0]
print(sample_image_name)

# Print the question and the answer.
print("Question:", annotations['question'][0])
print("Answer:", annotations['answer'][0])

# Read the image.
image = plt.imread(f"{root_dir}/{image_folder}/{sample_image_name}")
plt.imshow(image)
plt.axis('off')
plt.show()

The following is the ground truth response and the radiology image associated with it.

Question: Are regions of the brain infarcted?
Answer: Yes

Figure 4. A ground truth sample from the VQA-RAD dataset.

Next, we convert the dataset into Hugging Face Datasets format for easier mapping conversion later on.

# Convert dataframe to HF dataset format.
# To CSV.
annotations.to_csv('annotations.csv', index=False)
dataset = load_dataset('csv', data_files='annotations.csv')['train']

# Shuffle the dataset.
dataset = dataset.shuffle(seed=3407)

# Split into train and eval.
dataset = dataset.train_test_split(test_size=0.1, seed=3407)
train_dataset = dataset['train']
eval_dataset = dataset['test']

print(f"Train dataset size: {len(train_dataset)}")
print(f"Eval dataset size: {len(eval_dataset)}")

After splitting, we have 2023 samples for training and 225 samples for validation. However, it is important to note that it is random splitting, and different or rephrased questions might overlap between the training and validation datasets for the same image.

In the next code block, we have the logic to create the converted dataset that we will feed to the training pipeline.

instruction = """Answer the following question based on the given image.
Here is some additional information about the type of question that you will encounter:
Type of question:
MODALITY
PLANE
ORGAN (Organ System)
ABN (Abnormality)
PRES (Object/Condition Presence)
POS (Positional Reasoning)
COLOR
SIZE
ATTRIB (Attribute Other)
COUNT (Counting)
Other
"""

def convert_to_conversation(sample):
    image_name = sample['image_name']
    image = f"{root_dir}/{image_folder}/{image_name}"
    # Question to model.
    question = sample['question']
    # Managing model's answers.
    answer = str(sample['answer'])
    rephrased_question = sample['question_rephrase'] if sample['question_rephrase'] is not None else ''
    organ = sample['image_organ'] if sample['image_organ'] is not None else ''
    question_type = sample['question_type'] if sample['question_type'] is not None else ''
    if rephrased_question == '':
        final_answer = f"This is question about {organ} and the question type is {question_type}. The answer is {answer}."
    else:
        final_answer = f"This is question about {organ} and the question type is {question_type}. The question can also be rephrased as: {rephrased_question}. The answer is {answer}."

    # print("Question:", question)
    # print("Rephrased Question:", rephrased_question)
    # print("Answer:", answer)

    conversation = [
        { 'role': 'user',
          'content' : [
            {'type' : 'text',  'text'  : instruction + "QUESTION: " + question},
            {'type' : 'image', 'image' : image} ]
        },
        { 'role' : 'assistant',
          'content' : [
            {'type' : 'text',  'text'  : final_answer} ]
        },
    ]
    return { 'messages' : conversation }
pass

We have a common instruction that gets appended before every question. Here, we add the type of question possible. This is additional information that might be helpful for the model to learn more about the data. Furthermore, we add a rephrased question to the final part of the answer if it is not None. This helps the model to create a rephrased question as part of its response if possible.

Finally, we map the training and eval datasets.

converted_dataset_train = [convert_to_conversation(sample) for sample in train_dataset]
converted_dataset_eval = [convert_to_conversation(sample) for sample in eval_dataset]

for i in range(3):
    qid = np.random.randint(0, len(converted_dataset_train))
    pprint.pp(converted_dataset_train[qid])
    print('-----------------------------')

This is how the converted samples look like.

{'messages': [{'role': 'user',
               'content': [{'type': 'text',
                            'text': 'Answer the following question based on '
                                    'the given image.\n'
                                    'Here is some additional information about '
                                    'the type of question that you will '
                                    'encounter:\n'
                                    'Type of question:\n'
                                    'MODALITY\n'
                                    'PLANE\n'
                                    'ORGAN (Organ System)\n'
                                    'ABN (Abnormality)\n'
                                    'PRES (Object/Condition Presence)\n'
                                    'POS (Positional Reasoning)\n'
                                    'COLOR\n'
                                    'SIZE\n'
                                    'ATTRIB (Attribute Other)\n'
                                    'COUNT (Counting)\n'
                                    'Other\n'
                                    'QUESTION: Is there biliary duct '
                                    'dilation?'},
                           {'type': 'image',
                            'image': 'input/osfstorage-archive/VQA_RAD Image '
                                     'Folder/synpic33889.jpg'}]},
              {'role': 'assistant',
               'content': [{'type': 'text',
                            'text': 'This is question about ABD and the '
                                    'question type is SIZE. The question can '
                                    'also be rephrased as: Are the biliary '
                                    'ducts dilated?. The answer is Yes.'}]}]}

It is important to note that the response part of the model that we have created is not very refined. Because the original answers may contain phrases, the final answer structure might sound a bit awkward at times. However, combining all the components, just like we have, also gives the model the best chance to learn about the image as much as possible.

The goal here is not only to teach medical knowledge, but also to align the model’s responses to the dataset distribution.

Checking Model Response Before Fine-Tuning

Let’s run the model through the first sample from the evaluation dataset and check its response before fine-tuning.

FastVisionModel.for_inference(model) # Enable for inference!

image = converted_dataset_eval[0]['messages'][0]['content'][1]['image'] # Extract the image.
instruction = converted_dataset_eval[0]['messages'][0]['content'][0]['text'] # Extract the instruction text.

print("Instruction:", instruction)

messages = [
    {'role': 'user', 'content': [
        {'type': 'image', 'image': image},
        {'type': 'text', 'text': instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors='pt',
).to('cuda')

text_streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(
    **inputs, 
    streamer=text_streamer, 
    max_new_tokens=1024,
    use_cache=True, 
    temperature=1.5, 
    min_p=0.1
)

print('#' * 50)
print('GROUND TRUTH:')
print(converted_dataset_eval[0]['messages'][1]['content'][0]['text'])

We are printing the instruction, the response from the model, and the ground truth in the next block.

Instruction: Answer the following question based on the given image.
Here is some additional information about the type of question that you will encounter:
Type of question:
MODALITY
PLANE
ORGAN (Organ System)
ABN (Abnormality)
PRES (Object/Condition Presence)
POS (Positional Reasoning)
COLOR
SIZE
ATTRIB (Attribute Other)
COUNT (Counting)
Other
QUESTION: What is the size of the lesion?
Based on the provided CT scan image, we can analyze the lesion's size by observing its dimensions relative to the surrounding structures.

- The lesion is located in the left upper quadrant of the abdomen, adjacent to the spleen.
- It appears as a well-defined, hypodense (darker) area compared to the surrounding liver parenchyma.
- By comparing the lesion's size to the adjacent spleen and the vertebral body, it can be estimated to be approximately 3-4 cm in diameter.

Therefore, the size of the lesion is approximately **3-4 cm**.

##################################################
GROUND TRUTH:
This is question about ABD and the question type is SIZE. The question can also be rephrased as: Describe the size of this lesion?. The answer is 5.6cm focal, predominantly hypodense.

The model currently gives an answer in its native format. However, it said that the size of the lesion is 3-4 cm, while it is 5.6 cm according to the ground truth data. Fine-tuning with varied samples will make the model learn this spatial information.

Training the Qwen3.5-0.8B Model

All the training happened on an RTX 5050 8GB VRAM laptop GPU.

The next code block initializes the SFTTrainer and SFTConfig with the appropriate hyperparameters.

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),
    train_dataset=converted_dataset_train,
    eval_dataset=converted_dataset_eval,
    args=SFTConfig(
        per_device_train_batch_size=24,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=1,
        warmup_steps=5,
        num_train_epochs=4,
        learning_rate=2e-4,
        logging_steps=50,
        eval_steps=50,
        eval_strategy='steps',
        do_eval=True,
        optim='adamw_8bit',
        weight_decay=0.001,
        lr_scheduler_type='linear',
        seed=3407,
        output_dir='outputs',
        report_to='none',

        # You MUST put the below items for vision finetuning:
        remove_unused_columns=False,
        dataset_text_field='',
        dataset_kwargs={'skip_prepare_dataset': True},
        max_length=2048,
    ),
)

trainer_stats = trainer.train()

The following are some of the hyperparameter choices that we make:

Training and validation batch sizes: 24 and 4, respectively
Gradient accumulation step: 1
Number of epochs: 4
Logging and evaluation steps: 50
Context length: 2048

Here are the training logs.

We get the least validation loss after 250 steps. However, as we are not loading the best model at the end, we will be saving a slightly overfit model. This might be even useful for this use case, as the dataset and response format are pretty niche.

Saving the Final Model LoRA Model

model.save_pretrained('qwen_lora')
tokenizer.save_pretrained('qwen_lora')

Saving the Evaluation Dataset in Hugging Face Format

Finally, let’s save the evaluation dataset in Hugging Face format so that we can load it in a different inference notebook and run the trained model through it.

# Save eval dataset in HF format to be loaded later.
hf_eval_dataset = Dataset.from_list(converted_dataset_eval)
hf_eval_dataset.save_to_disk('hf_eval_dataset')

This completes all the training related workflows.

Inference Using the Trained Qwen3.5-0.8B Model on the VQA-RAD Dataset

The qwen_3_5_inference_vqa_rad_fine_tuned.ipynb Jupyter Notebook contains the code to use the fine-tuned model and run inference on the saved evaluation dataset.

The code for that is pretty straightforward. So, we are not covering that in detail here. Feel free to check out the notebook once. Rather, let’s go through a few responses that the trained model has given on the evaluation dataset.

At this stage, evaluation is manual due to variability in generated responses. A more robust approach would involve constraining the model to produce structured outputs (e.g., JSON), which would allow fully automated evaluation pipelines.

Figure 6. Qwen3.5 fine-tuned inference result 1.

This is the same question that we run the model through in the training notebook before fine-tuning. We can see that the model responds correctly now with the lesion size as 5.6 cm. Along with that, it has also learned the response format and rephrased the original question.

Figure 7. Qwen3.5 fine-tuned inference result 2.

For the above figure, the model recognizes the organ and the type of question. However, it wrongly answers the question. More training would surely help here.

Figure 8. Qwen3.5 fine-tuned inference result 3.

For the Figure 8, the model has to learn a proper spatial understanding. And it seems the model can identify where the ribs are and answer the question correctly.

Takeaways and Further Improvements

In the above experiments, we followed a very simple process to create the question and response format. We can improve that even further by adding the correct notations for the organ and question type whose information is present in the Readme.docx file that comes with the dataset. Furthermore, we can train the model with a higher rank and check if the results improve.

Training a larger model might also help here. Nonetheless, for a 0.8B model, with just 4 epochs of training, the above results look like a good starting point.

If You Are Interested, Find More Qwen Articles Below

Summary and Conclusion

In this article, we covered the fine-tuning of Qwen3.5-0.8B model on a niche VQA-RAD dataset. We start with dataset exploration and preparation, and cover fine-tuning and inference results in detail. We will carry out more such experiments in future articles.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

Fine-Tuning Qwen3.5

The VQA-RAD Dataset

Project Directory Structure

Download Code

Installing the Requirements

Fine-Tuning Qwen3.5-0.8B on VQA-RAD Dataset

Covering the Training Code

Loading the Pretrained Model

Data Preparation

Checking Model Response Before Fine-Tuning

Training the Qwen3.5-0.8B Model

Saving the Final Model LoRA Model

Saving the Evaluation Dataset in Hugging Face Format

Inference Using the Trained Qwen3.5-0.8B Model on the VQA-RAD Dataset

Takeaways and Further Improvements

If You Are Interested, Find More Qwen Articles Below

Summary and Conclusion

Leave a Reply Cancel reply