Fine-Tuning Phi-3.5 Vision Instruct


Fine-Tuning Phi-3.5 Vision Instruct

Phi-3.5 Vision Instruct is one of the most popular small VLMs (Vision Language Models) out there. With around 4B parameters, it is easy to run within 10GB VRAM, and it gives good results out of the box. However, it falters in OCR tasks involving small text, such as receipts and forms. We will tackle this problem in the article. We will be fine-tuning Phi-3.5 Vision Instruct on a receipt OCR dataset to improve its accuracy.

Results for before and after fine-tuning Phi-3.5 Vision Instruct for OCR.
Figure 1. Results for before and after fine-tuning Phi-3.5 Vision Instruct for OCR.

OCR is a complex task in itself. Most small VLMs are not good at them because they are generally pretrained for image/video explanation and captioning. That’s why fine-tuning the Phi Vision Instruct model will reveal the benefits firsthand.

We will cover the following topics while fine-tuning Phi-3.5 Vision Instruct:

  • Discussing the dataset and creation of the annotation. We will use the SROIEv2 dataset.
  • Preparing the Phi-3.5 Vision Instruct for fine-tuning for the OCR task.
  • Training and inference on receipt images.
  • Running evaluation for calculating CER on the ground truth test data.

Let’s jump into the article.

The SROIEv2 Dataset for Fine-Tuning Phi-3.5 Vision Instruct

We will use a slightly modified version of the SROIEv2 (Scanned receipts OCR and information extraction) dataset.

In this dataset, the annotations have been generated by another VLM and stored in plain text file format.

The dataset is available here on Kaggle.

The following is the directory structure after downloading and extracting the dataset.

input/
├── claude35haiku_annots
│   ├── test_annots  [347 entries exceeds filelimit, not opening dir]
│   └── train_annots  [626 entries exceeds filelimit, not opening dir]
├── gemini-2.0-flash
│   ├── test_annots  [347 entries exceeds filelimit, not opening dir]
│   └── train_annots  [626 entries exceeds filelimit, not opening dir]
├── input.txt
├── qwen2_5_vl_3b_annots
│   ├── test_annots  [347 entries exceeds filelimit, not opening dir]
│   └── train_annots  [626 entries exceeds filelimit, not opening dir]
├── qwen2_vl_2b_annots
│   ├── test_annots  [347 entries exceeds filelimit, not opening dir]
│   └── train_annots  [626 entries exceeds filelimit, not opening dir]
├── qwen2_vl_7b_annots
│   ├── test_annots  [347 entries exceeds filelimit, not opening dir]
│   └── train_annots  [626 entries exceeds filelimit, not opening dir]
└── sroie_v2
    └── SROIE2019
        ├── test
        │   ├── box
        │   ├── entities
        │   └── img
        └── train
            ├── box
            ├── entities
            └── img
  • The sroie_v2 folder contains the original data with a SROIE2019 subdirectory. The train and test subdirectories contain the splits, and the img directories contain the images.
  • The several top directories contain the annotations from different models, such as Claude, Gemini, etc. However, we will use the annotations in the qwen2_5_vl_3b_annots directory as Qwen2.5 VL 3B gave the best results. All the annotations have been saved in plain text files. The following is an example.

tan woon yann
BOOK TA .K (TAMAN DAYA) SDN BHD
789417-W
NO.5: 55,57 & 59, JALAN SAGU 1B,
TAMAN DAYA,
81100 JOHOR BAHRU,
JOHOR.
Document No : TD01167104
Date : 25/12/2018 8:13:39 PM
Cashier : MANIS
Member :
CASH BILL
CODE/DESC PRICE Disc AMOUNT
QTY RM
95569390-H0118 RF MODELLING CLAY KIDDY FISH
1 PC * 9.000 0.00 9.00
Total : 9.00
Rounding Adjustment : 0.00
Round d Total (RM): 9.00
Cash 90.00
CHANGE 0.00
GOODS SOLD ARE NOT RETURNABLE OR
EXCHANGEABLE
如有问题,敬请原谅。谢谢!
THANK YOU
PLEASE COME AGAIN!

And the following is the corresponding receipt image.

Sample training image from the SROIEv2 dataset.
Figure 2. Sample training image from the SROIEv2 dataset corresponding to the above text.

It is the same dataset that we used for fine-tuning SmolVLM for receipt OCR.

There are 626 training and 347 validation samples. Please go through the directory structure once after downloading the dataset. It will help you understand the dataset better.

The Project Directory Structure

Let’s take a look at the entire project directory structure.

├── evals
│   ├── fine_tuned_test_cer.ipynb
│   ├── ground_truth_cer.ipynb
│   └── readme.txt
├── inference_data
│   ├── image_1.jpeg
│   ├── image_2_cropped.jpeg
│   └── image_2.jpeg
├── input
│   ├── claude35haiku_annots
│   ├── gemini-2.0-flash
│   ├── qwen2_5_vl_3b_annots
│   ├── qwen2_vl_2b_annots
│   ├── qwen2_vl_7b_annots
│   ├── sroie_v2
│   └── input.txt
├── notebooks
│   ├── annot_notebooks
│   ├── trained_models
│   ├── wandb
│   ├── phi_3_5_vision_inference_pretrained.ipynb
│   └── phi_3_5_vision_lora_sft_sroie.ipynb
└── requirements.txt
  • The evals directory contains the notebooks that we use to carry out CER evaluation on the validation dataset after training.
  • The inference_data directory contains a few sample receipt images from the internet for testing after training.
  • We have already seen the input directory structure above.
  • The notebooks directory contains the training and inference notebooks along with the trained adapters in the trained_models folder.

If you want to run training or evaluation, please download the dataset from Kaggle and organize it according to the above structure. The rest of the notebooks and inference dataset will be provided for download as a zip file.

Download Code

Setup and Installing Requirements

We can install all the necessary libraries using the requirements file.

pip install -r requirements.txt

Additionally, we log everything to Weights and Biases. So, create a .env file and add your Weight and Biases API key to the file.

WANDB_API_KEY=YOUR_API_KEY_HERE

This is all we need for the setup. Let’s move on to the training notebook.

Fine-Tuning Phi-3.5 Vision Instruct for Receipt OCR

The code for the Phi-3.5 Vision Instrcut fine-tuning is present in the phi_3_5_vision_lora_sft_sroie.ipynb Jupyter Notebook. Let’s go through the important sections of the code.

Import Statements and Setup

The following are all the libraries and modules that we need along the way.

import glob
import pandas as pd
import os
import torch
import gc
import time
import matplotlib.pyplot as plt
import transformers

from datasets import Dataset
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

Let’s also set the Transformers tokenizer’s parallelism to false to avoid a few warnings.

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

from dotenv import load_dotenv

load_dotenv()

wandb_api_key = os.getenv('WANDB_API_KEY')

%%bash
wandb login $wandb_api_key

Along with that, we also log into Weight and Biases for logging.

Define Data Paths and Load Dataset

The following code block defines the paths to the annotation files and the image directory.

train_annot_paths = '../input/qwen2_5_vl_3b_annots/train_annots/'
test_annot_paths = '../input/qwen2_5_vl_3b_annots/test_annots/'

root_image_path = '../input/sroie_v2/SROIE2019'

Next, we need to define a preprocessing function for the data conversion into the required format for Phi-3.5 Vision Instruct.

def format_data(samples):
    data_samples = []
    image_paths = []

    query = 'You are a vision language model expert at OCR of receipts, invoices, and forms. OCR this image accurately.'

    for i in range(len(samples['image_paths'])):
        image = samples['image_paths'][i]
        label = samples['text_paths'][i]

        message = [
            {'role': 'user', 'content': f'<|image_1|>\n{query}'},
            {'role': 'assistant', 'content': label}
        ]

        image_paths.append(image)
        data_samples.append(message)
    return {
        'data': data_samples,
        'images': image_paths
    }

In the above code block. We have set a default prompt to steer the model in the correct direction. Along with that, as we are dealing with images and prompts here, we also:

  • Set the roles of the user and assistant correctly.
  • This necessitates a default <|image_1|>\n placeholder followed by the user query for the user role.
  • For the assistant role, the content is the label from the text file corresponding to the image.

We maintain two lists. data_samples for the prompt data and image_paths to store the image paths. We will later read and feed the image to the processor in the collation function.

Next, we need to read the text files and prepare the paths for images & corresponding text labels.

train_text_files = glob.glob(os.path.join(train_annot_paths, '*.txt'))
test_text_files = glob.glob(os.path.join(test_annot_paths, '*.txt'))

train_text_files.sort()
test_text_files.sort()

def prepare_img_txt_list(text_files, split='train'):
    all_images, all_texts = [], []
    
    for text_file in text_files:
        text = open(text_file).read()
    
        if len(text) > 200:
            text_file_name = text_file.split(os.path.sep)[-1].split('.txt')[0]
    
            image_file_name = os.path.join(f'{root_image_path}/{split}/img/', text_file_name+'.jpg')
    
            all_images.append(image_file_name)
            all_texts.append(text)

    return all_images, all_texts

all_train_images, all_train_texts = prepare_img_txt_list(train_text_files, split='train')
all_test_images, all_test_texts = prepare_img_txt_list(test_text_files, split='test')

train_df = pd.DataFrame(
    {
        'image_paths': all_train_images,
        'text_paths': all_train_texts
    }
)

test_df = pd.DataFrame(
    {
        'image_paths': all_test_images,
        'text_paths': all_test_texts
    }
)

train_from_df = Dataset.from_pandas(train_df)
test_from_df = Dataset.from_pandas(test_df)

train_dataset = train_from_df.map(
    format_data, 
    batched=True, 
    batch_size=1,
    num_proc=8,
    remove_columns=train_from_df.column_names
)
test_dataset = test_from_df.map(
    format_data, 
    batched=True, 
    batch_size=1,
    num_proc=8,
    remove_columns=test_from_df.column_names
)

In the above code block, we first create Hugging Face datasets from the train and test data frames (lines 40 and 41). Then we map the datasets to the format_data function to create the final datasets.

Printing a single sample from train_dataset looks like the following.

{
   "data":[
      {
         "content":"<|image_1|>\nYou are a vision language model expert at creating at OCR of receipts, invoices, and forms. OCR this image accurately.",
         "role":"user"
      },
      {
         "content":"tan woon yann\nBOOK TA .K (TAMAN DAYA) SDN BHD\n789417-W\nNO.5: 55,57 & 59, JALAN SAGU 1B,\nTAMAN DAYA,\n81100 JOHOR BAHRU,\nJOHOR.\nDocument No : TD01167104\nDate : 25/12/2018 8:13:39 PM\nCashier : MANIS\nMember :\nCASH BILL\nCODE/DESC PRICE Disc AMOUNT\nQTY RM\n95569390-H0118 RF MODELLING CLAY KIDDY FISH\n1 PC * 9.000 0.00 9.00\nTotal : 9.00\nRounding Adjustment : 0.00\nRound d Total (RM): 9.00\nCash 90.00\nCHANGE 0.00\nGOODS SOLD ARE NOT RETURNABLE OR\nEXCHANGEABLE\n如有问题,敬请原谅。谢谢!\nTHANK YOU\nPLEASE COME AGAIN!",
         "role":"assistant"
      }
   ],
   "images":"../input/sroie_v2/SROIE2019/train/img/X00016469612.jpg"
}

We have a dictionary with data and images keys. The former contains the preprocessed data and the latter contains the image path.

Load the Phi-3.5 Vision Instruct Model

The following code block loads the model from sovitrath/Phi-3.5-vision-instruct. This has a slightly modified version of the model script, which works correctly with more recent versions of Transformers. The architecture remains unchanged.

model_id = 'sovitrath/Phi-3.5-vision-instruct'

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization_config=bnb_config,
    # _attn_implementation='flash_attention_2', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
    _attn_implementation='eager', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
processor.tokenizer.pad_token = processor.tokenizer.eos_token
processor.tokenizer.padding_side = 'left'

We will carry out QLoRA training of the Phi model. We also define the end of sequence and padding tokens for the processor.

Let’s define the LoRA configuration.

# Configure LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.0,
    target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
    use_dora=True,
    init_lora_weights='gaussian'
)

# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)

# Print trainable parameters
peft_model.print_trainable_parameters()

We set a rank of 8 and alpha of 16. This amounts to around 5.9 million trainable parameters in the model.

Fine-Tuning the Model

Before fine-tuning, we need to define the TrainingArguments and the collate function.

output_dir = 'trained_models/adapters/phi_3_5_vision_qwen2_5_vl_3b_gt_20251008'

# Configure training arguments using SFTConfig
training_args = transformers.TrainingArguments(
    output_dir=output_dir,
    logging_dir=output_dir,
    # num_train_epochs=1,
    max_steps=1200, # 625,
    per_device_train_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning
    per_device_eval_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning
    gradient_accumulation_steps=4, # 4
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=400,
    eval_steps=400,
    save_steps=400,
    logging_strategy='steps',
    eval_strategy='steps',
    save_strategy='steps',
    save_total_limit=2,
    optim='adamw_torch_fused',
    bf16=True,
    report_to='wandb',
    remove_unused_columns=False,
    gradient_checkpointing=True,
    dataloader_num_workers=4,
    # dataset_text_field='',
    # dataset_kwargs={'skip_prepare_dataset': True},
    load_best_model_at_end=True,
    save_safetensors=True,
)

One major thing to note in the above code block is that we are using a batch size of 1. Phi-3.5 Vision Instruct model’s processor does not accept a list of images, so it will throw an error with multi-image batches.

We have set the gradient accumulation step to 4 to compensate for efficient backpropagation. However, a batch size of 1 is still inefficient in terms of compute usage.

Next, the collate function.

def collate_fn(examples):
    user_prompt = examples[0]['data'][0]['content']
    messages = [{'role': 'user', 'content': user_prompt}]
    
    ground_truth = examples[0]['data'][1]['content']
    answer = f"{ground_truth}<|end|>\n<|endoftext|>"
    
    prompt = processor.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )

    image = Image.open(examples[0]['images'])
    
    # Process prompt and image
    batch = processor(prompt, [image], return_tensors='pt', padding=True)
    
    # Tokenize answer
    answer_input_ids = processor.tokenizer(
        answer, add_special_tokens=False, return_tensors='pt'
    )['input_ids']

    # Concatenate input_ids
    batch['input_ids'] = torch.cat([batch['input_ids'], answer_input_ids], dim=1)
    
    # Create labels with proper masking
    ignore_index = -100
    prompt_length = batch['input_ids'].size(1) - answer_input_ids.size(1)
    labels = torch.cat([
        torch.full((1, prompt_length), ignore_index, dtype=torch.long),
        answer_input_ids
    ], dim=1)
    
    batch['labels'] = labels
    
    # Update attention mask
    batch['attention_mask'] = torch.ones_like(batch['input_ids'])

    ######### THE FOLLOWING WILL NOT BE NEEDED WITH SFTTrainer and SFTConfig #########
    ######### ONLY NEEDED FOR WHEN USING TrainingArguments and Trainer #########
    # Ensure only floating-point tensors require gradients
    for key, value in batch.items():
        if isinstance(value, torch.Tensor) and torch.is_floating_point(value):
            batch[key] = value.clone().detach().requires_grad_(True)
    
    return batch

The function accepts one example at a time, and we extract the message and image path. We pass these to the processor and apply the chat template.

We have borrowed a few code snippets from this article, such as the last lines regarding the gradients. This is quite necessary for the correct training of the model.

Finally, we define the Trainer and start the training. The training happened on a 10GB RTX 3080 GPU.

trainer = transformers.Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=collate_fn
)

trainer.can_return_loss = True

trainer.train()

The following are the training logs.

Phi-3.5 Vision Instruct fine-tuning logs.
Figure 3. Phi-3.5 Vision Instruct fine-tuning logs.

The least loss is achieved at 400 steps. We will use the model saved at that point for inference.

We can also use the following code to save the best model in the end.

trainer.save_model(training_args.output_dir)
processor.image_processor.save_pretrained(training_args.output_dir)
processor.tokenizer.save_pretrained(training_args.output_dir)

This ends all the training steps needed. Next, we will move on to inference.

Inference Using the Phi-3.5 Vision Instruct Model

Let’s jump into the inference notebook now. The code for this is present in the phi_3_5_vision_ft_inference.ipynb Jupyter Notebook.

Imports:

import torch
import matplotlib.pyplot as plt

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

Loading the trained Phi-3.5 Vision Instruct model:

model_id = 'trained_models/adapters/phi_3_5_vision_qwen2_5_vl_3b_gt_20251008'
# Hub path also works.
# model_id = 'sovitrath/Phi-3.5-Vision-Instruct-OCR'

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    # _attn_implementation='flash_attention_2', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
    _attn_implementation='eager', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

The model has also been published in Hugging Face under Phi-3.5-Vision-Instruct-OCR.

So, you can either use the local directory path for model_id or the hub model path in case you are not training the model yourself.

Test function:

def test(model, processor, image, max_new_tokens=1024, device='cuda'):
    placeholder = f"<|image_1|>\n"
    messages = [
        {
            'role': 'user',
            'content': placeholder + 'OCR this image accurately'
        },
    ]
    
    # Prepare the text input by applying the chat template
    text_input = processor.tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False
    )

    if image.mode != 'RGB':
        image = image.convert('RGB')
        
    # Prepare the inputs for the model
    model_inputs = processor(
        text=text_input,
        images=[image],
        return_tensors='pt',
    ).to(device)  # Move inputs to the specified device

    # Generate text with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Trim the generated ids to remove the input ids
    trimmed_generated_ids = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    # Decode the output text
    output_text = processor.batch_decode(
        trimmed_generated_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )

    return output_text[0]  # Return the first decoded output text

The test function loads the image, defines the messages list, and takes care of the forward pass. It returns the final text output after truncating the user input.

Running inference:

We have a few images from the internet on which we have run inference. Let’s check the results.

Result 1:

Inference result 1 after training Phi-3.5 Vision Instruct on the receipt OCR dataset.
Figure 4. Inference result 1 after training Phi-3.5 Vision Instruct on the receipt OCR dataset.

We can see that the model has captured the overall structure of the document’s layout. However, the OCR is not entirely correct. For example, the prices of the items captured are not correct in all cases.

Result 2:

Inference result 2 for Phi-3.5 receipt OCR.
Figure 5. Inference result 2 for Phi-3.5 receipt OCR.

This is a slightly more complex scenario with a crumpled receipt. In this case, the model is able to capture the textual components correctly; however, it is struggling to capture the numerical components accurately.

Result 3:

Inference result 3 for Phi-3.5 receipt OCR.
Figure 6. Inference result 3 for Phi-3.5 receipt OCR.

In this case, we have a lot of small text. Interestingly, the model is capturing a lot of nuanced things, like the email. Still, we can see the struggle with capturing the numbers correctly or not capturing them at all.

It seems like applying some form of augmentation to our original dataset will help the model immensely in capturing different numerical components under different scenarios.

Note: This model is now part of the local_file_search project.

CER Evaluation

We also have a notebook under the evals directory for running CER (Character Error Rate) Evaluation. The code is present in the fine_tuned_test_cer.ipynb Jupyter Notebook.

We are not going through the notebook here; however, I highly recommend taking a look at it on your own. We use the annotations obtained from the Qwen2.5 VL 3B model test set as ground truth.

The current trained model reached a CER of 0.355. This is obviously not ideal, and there is much room for improvement. For a more realistic scenario, we can also run ROUGE and BLEU evaluations.

Summary and Conclusion

In this article, we covered the fine-tuning of Phi-3.5 Vision Instruct for receipt OCR. We started with the discussion of the SROIEv2 dataset, along with how we created the annotations. Next, we covered the training code, followed by running inference on unseen images. We also discussed the weaknesses of the current model. Finally, we covered the CER evaluation and a few points on how to improve the model

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *