Fine-Tuning DeepSeek-OCR 2


Fine-Tuning DeepSeek-OCR 2

In the last two articles, we covered the fundamentals of DeepSeek-OCR 2 in detail. These involved the inference pipeline using DeepSeek-OCR 2 and understanding the architecture as well. Both articles gave us a good idea of how the model performs, which components are novel, and how the model works in code. This article takes the next step, fine-tuning DeepSeek-OCR 2 using Unsloth.

Gradio inference demo after fine-tuning DeepSeek-OCR 2.
Figure 1. Gradio inference demo after fine-tuning DeepSeek-OCR 2.

Although DeepSeek-OCR 2 performs well on the English language out of the box, Indic languages are not its strong suit. Here, we will try to solve for such a use case by fine-tuning the model on Hindi-language images and the corresponding text.

What are we going to cover while fine-tuning DeepSeek-OCR 2?

  • Preparing the Hindi-language dataset and splitting it into training and test sets.
  • Training the DeepSeek-OCR 2 model on the dataset using the Unsloth library.
  • Analyzing performance via validation loss.
  • Running inference on test images using a simple Gradio UI.

Note: Although we are going to use the Unsloth version of DeepSeek-OCR 2, some of the inference components seem to be broken at the moment. Both Free-OCR and Grounded-OCR via Unsloth give inferior results compared to the pure Hugging Face Transformers implementation. Fine-tuning of course solves some of the pain points; however, we are mostly sticking with Unsloth for GPU memory efficiency while training the model. I have also raised an issue on Hugging Face regarding the same.

The Hindi OCR Dataset That We Are Using for Fine-Tuning DeepSeek-OCR 2

We will be using an OCR dataset from Hugging Face. This contains 388 images, primarily from the Bhagavad Gita in Devanagari script.

The text is rich in ligatures and is representative of formal Hindi typography. All in all, this is a good use case for fine-tuning the model.

Following is an example of the image and the OCR response using the pretrained Unsloth DeepSeek-OCR model.

Hindi language OCR using pretrained Unsloth DeepSeek-OCR 2.
Figure 2. Hindi language OCR using pretrained Unsloth DeepSeek-OCR 2.

We can clearly see that the pretrained model is not capable of recognizing the text. Fine-tuning the model will help it understand Devanagari scripts with dense paragraphs.

Project Directory Structure

Following is the project directory structure:

├── deepseek_ocr2
│   ├── assets
│   ...
│   ├── model.safetensors.index.json
│   ├── processor_config.json
│   ├── README.md
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── tokenizer.json
├── deepseek_ocr_lora
│   ├── adapter_config.json
│   ├── adapter_model.safetensors
│   ├── README.md
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── tokenizer.json
├── inference
│   ├── assets
│   ├── images
│   ├── outputs
│   ├── app.py
│   ├── config.py
│   ├── LICENSE
│   ├── README.md
│   ├── requirements.txt
│   └── run.py
├── test_data  [156 entries exceeds filelimit, not opening dir]
└── deepseek_ocr_2_unsloth_ft_hindi_ocr.ipynb
  • We have the training code in the deepseek_ocr_2_unsloth_ft_hindi_ocr.ipynb notebook.
  • The deepseek_ocr2 directory contains the Unsloth version of the locally downloaded DeepSeek-OCR 2 model.
  • deepseek_ocr_lora contains the fine-tuned LoRA.
  • The inference directory contains the inference code for DeepSeek-OCR 2. This covers both a CLI script and a simple Gradio application that we covered in one of the previous articles. It is a slightly modified version of that codebase to cater to the current article.
  • Finally, the test_data directory contains all the test images and their ground truth text after we split the dataset into a training and test set.

The article provides a zip file containing the training Jupyter Notebook, the inference scripts, the trained LoRA, and test data. You can unzip the file, install the requirements, and start experimenting right away.

Download Code

Installing Dependencies

All the training dependencies are installed via the Jupyter Notebook.

For inference, we can install all the necessary libraries using the requirements.txt file in the inference directory.

pip install -r inference/requirements.txt

This is all that we need to know before diving into the code.

Code for Fine-Tuning DeepSeek-OCR 2

All the code for fine-tuning code that we are going to discuss here is present in deepseek_ocr_2_unsloth_ft_hindi_ocr.ipynb Jupyter Notebook.

We will only discuss the crucial parts of the notebook in detail, as some of the code is standard boilerplate LLM code.

Installation, Imports, and Download the Model

The first few cells focus on the installation of the libraries, the necessary imports, and downloading the model into the current working directory.

from huggingface_hub import snapshot_download
from unsloth import FastVisionModel 
from transformers import AutoModel
from datasets import load_dataset
from transformers import Trainer, TrainingArguments
from unsloth import is_bf16_supported

import torch
import os

os.environ['UNSLOTH_WARN_UNINITIALIZED'] = '0'
snapshot_download('unsloth/DeepSeek-OCR-2', local_dir='deepseek_ocr2')
# For Unsloth Data Collator.
import torch
import math
import io

from dataclasses import dataclass
from typing import Dict, List, Any, Tuple
from PIL import Image, ImageOps
from torch.nn.utils.rnn import pad_sequence
from deepseek_ocr2.modeling_deepseekocr2 import (
    format_messages,
    text_encode,
    BasicImageTransform,
    dynamic_preprocess,
)

The above code block imports the necessary modules from the locally downloaded Unsloth DeepSeek-OCR repository. This is necessary to create the data collator that we will cover further in the article.

Loading the DeepSeek-OCR 2 Model

The next code block loads the DeepSeek-OCR 2 model.

model, tokenizer = FastVisionModel.from_pretrained(
    './deepseek_ocr2',
    load_in_4bit=False,
    auto_model=AutoModel,
    trust_remote_code=True,
    unsloth_force_compile=True,
    use_gradient_checkpointing='unsloth',
)

Currently, Unsloth does not support loading the model in INT4 format. So, loading and running inference requires between 7.5GB and 8GB VRAM.

Loading the Dataset

Next, let’s download the dataset and create training and test splits.

dataset = load_dataset('damerajee/hindi-ocr')

train_dataset = dataset['train']

split_dataset = train_dataset.train_test_split(test_size=0.10, seed=42)

train_data = split_dataset['train']
test_data = split_dataset['test']

print(f"Training samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")

# Save an image
test_data[0]['image'].save('image.jpg')

# Save all the test data in a directory.
os.makedirs('test_data', exist_ok=True)
for i in range(len(test_data)):
    test_data[i]['image'].save(f"test_data/test_data_{str(i)}.jpg")

    with open(f"test_data/test_data_{str(i)}.txt", 'w') as f:
        f.write(test_data[i]['text'])

We reserve 10% of the data for testing and will use the rest for training. This amounts to 349 samples for training and 39 for testing. Along with that, we also save all the test images and their corresponding ground truth text locally so that we can test and compare the results later.

Creating the PEFT Model

Next, we will create the PEFT version of the DeepSeek-OCR 2 model.

model = FastVisionModel.get_peft_model(
    model,
    target_modules = [
        'q_proj',
        'k_proj',
        'v_proj',
        'o_proj',
        'gate_proj',
        'up_proj',
        'down_proj',
    ],

    r=16,
    lora_alpha=16, # alpha=r at least
    lora_dropout=0,
    bias='none',
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
    # target_modules = 'all-linear', # Optional now! Can specify a list if needed
)

We are using a rank and alpha of 16. And you may also note that we are not fine-tuning any of the vision modules.

Instruction-Based Data Preparation

We need to format the current dataset into an instruction format for training. For this, we will follow the standard Unsloth vision fine-tuning format shown below.

[
{ "role": "<|User|>",
  "content": "",
  "images": []
},
{ "role": "<|Assistant|>",
  "content": ""
},
]

The following code block creates the formatted dataset.

instruction = "<image>\nFree OCR. "

def convert_to_conversation(sample):
    """Convert dataset sample to conversation format"""
    conversation = [
        {
            "role": "<|User|>",
            "content": instruction,
            "images": [sample['image']]
        },
        {
            "role": "<|Assistant|>",
            "content": sample["text"]
        },
    ]
    return {"messages": conversation}

final_train = [convert_to_conversation(sample) for sample in train_data]
final_test = [convert_to_conversation(sample) for sample in test_data]

print(len(final_train))
print(len(final_test))

One of the most important aspects of dataset preparation is the data collator. We use a custom data collator here that is provided as part of the official Unsloth DeepSeek-OCR 2 fine-tuning. We use the class as is and the next code block shows a truncated version of the same.

@dataclass
class DeepSeekOCR2DataCollator:
    """
    Args:
        tokenizer: Tokenizer
        model: Model
        image_size: Size for image patches (default: 768)
        base_size: Size for global view (default: 1024)
        crop_mode: Whether to use dynamic cropping for large images
        train_on_responses_only: If True, only train on assistant responses (mask user prompts)
    """
    tokenizer: Any
    model: Any
    image_size: int = 768
    base_size: int = 1024
    crop_mode: bool = True
    image_token_id: int = 128815
    train_on_responses_only: bool = True

    def __init__(
        self,
        tokenizer,
        model,
        image_size: int = 768,
        base_size: int = 1024,
        crop_mode: bool = True,
        train_on_responses_only: bool = True,
    ):
        self.tokenizer = tokenizer
        self.model = model
        self.image_size = image_size
        self.base_size = base_size
        self.crop_mode = crop_mode
        self.image_token_id = 128815
        self.dtype = model.dtype  # Get dtype from model
        self.train_on_responses_only = train_on_responses_only

        self.image_transform = BasicImageTransform(
            mean = (0.5, 0.5, 0.5),
            std = (0.5, 0.5, 0.5),
            normalize = True
        )
        self.patch_size = 16
        self.downsample_ratio = 4

        # Get BOS token ID from tokenizer
        if hasattr(tokenizer, 'bos_token_id') and tokenizer.bos_token_id is not None:
            self.bos_id = tokenizer.bos_token_id
        else:
            self.bos_id = 0
            print(f"Warning: tokenizer has no bos_token_id, using default: {self.bos_id}")
.
.
.
    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
        """Collate batch of samples"""
        batch_data = []

        # Process each sample
        for feature in features:
            try:
                processed = self.process_single_sample(feature['messages'])
                batch_data.append(processed)
            except Exception as e:
                print(f"Error processing sample: {e}")
                continue

        if not batch_data:
            raise ValueError("No valid samples in batch")

        # Extract lists
        input_ids_list = [item['input_ids'] for item in batch_data]
        images_seq_mask_list = [item['images_seq_mask'] for item in batch_data]
        prompt_token_counts = [item['prompt_token_count'] for item in batch_data]

        # Pad sequences
        input_ids = pad_sequence(input_ids_list, batch_first = True, padding_value = self.tokenizer.pad_token_id)
        images_seq_mask = pad_sequence(images_seq_mask_list, batch_first = True, padding_value = False)

        # Create labels
        labels = input_ids.clone()

        # Mask padding tokens
        labels[labels == self.tokenizer.pad_token_id] = -100

        # Mask image tokens (model shouldn't predict these)
        labels[images_seq_mask] = -100

        # Mask user prompt tokens when train_on_responses_only = True (only train on assistant responses)
        if self.train_on_responses_only:
            for idx, prompt_count in enumerate(prompt_token_counts):
                if prompt_count > 0:
                    labels[idx, :prompt_count] = -100

        # Create attention mask
        attention_mask = (input_ids != self.tokenizer.pad_token_id).long()

        # Prepare images batch (list of tuples)
        images_batch = []
        for item in batch_data:
            images_batch.append((item['images_crop'], item['images_ori']))

        # Stack spatial crop info
        images_spatial_crop = torch.cat([item['images_spatial_crop'] for item in batch_data], dim = 0)

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels,
            "images": images_batch,
            "images_seq_mask": images_seq_mask,
            "images_spatial_crop": images_spatial_crop,
        }

The above class returns a dictionary containing the input IDs, the attention mask, labels, images, image sequence masks, and the image spatial crops.

Training the Model

Let’s prepare the model for training, initialize the data collator, and the Trainer API.

FastVisionModel.for_training(model)

data_collator = DeepSeekOCR2DataCollator(
    tokenizer=tokenizer,
    model=model,
    image_size=768,
    base_size=1024,
    crop_mode=True,
    train_on_responses_only=True,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=final_train,
    eval_dataset=final_test,
    args = TrainingArguments(
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=1,
        warmup_steps=5,
        max_steps=500,
        # num_train_epochs = 1, # Set this instead of max_steps for full training runs
        learning_rate=2e-4,
        logging_steps=1,
        optim='adamw_8bit',
        weight_decay=0.001,
        lr_scheduler_type='linear',
        seed=3407,
        fp16=not is_bf16_supported(),  # Use fp16 if bf16 is not supported
        bf16=is_bf16_supported(),  # Use bf16 if supported
        output_dir='training_outputs',
        report_to='none',     # For Weights and Biases
        dataloader_num_workers=4,
        remove_unused_columns=False,
        save_steps=100,
        eval_steps=100,
        eval_strategy='steps',
        save_strategy='steps',
        load_best_model_at_end=True,
        metric_for_best_model='eval_loss',
        greater_is_better=False,
    ),
)

In the above, we are using:

  • A batch size of 4 for both training and evaluation data loaders
  • Gradient accumulation step of 1
  • Maximum training steps of 500
  • 4 data loader workers

With the above batch size and gradient accumulation configuration, the GPU requirement is just above 22GB. An L4 GPU was used for the above training. Howeever, you can reduce the batch size to 1 and increase the gradient accumulation steps to 4 to train it on a 16GB T4 GPU on either Colab or Kaggle notebooks.

The following code cell starts the training.

trainer_stats = trainer.train()

The following are the training stats.

Training stats from fine-tuning DeepSeek-OCR 2.
Figure 3. Training stats from fine-tuning DeepSeek-OCR 2.

As we can see, although we trained the model for 500 steps, we got the best model after 200 steps. We are saving the best model here and will use the same for inference later.

Saving the LoRA

Finally, we save the LoRA locally for loading and carrying out inference later.

model.save_pretrained("deepseek_ocr_lora")
tokenizer.save_pretrained("deepseek_ocr_lora")

With this, we complete all the necessary training code. In the next section, we will carry out inference using the fine-tuned model and extensively test the model on different test images.

Inference Using the Fine-Tuned DeepSeek-OCR 2 Model

We will use the Gradio app for running inference here that we have modified to accomodate for loading the Unsloth model. In this article, we will not cover the inference code in detail; however, please feel free to visit the previous articles in the series to understand the scripts.

Within the inference directory, we can run it using the following command:

cd inference

python app.py

Then navigate to http://127.0.0.1:7860/ in your browser.

Gradio UI for testing the trained DeepSeek-OCR 2 model.
Figure 4. Gradio UI for testing the trained DeepSeek-OCR 2 model.

If we wish, we can also load the pretrained weights by checking the appropriate option from the UI. By default, the trained LoRA is loaded from the parent project directory.

The following video shows one complete workflow of the application, where we upload an image and run inference.

Note: Be sure to always choose “free ocr” as the prompt for this application, as that is the prompt with which the model has been trained on.

Video 1. Uploading test image and testing the fine-tuned DeepSeek-OCR 2 model.

After the inference is complete, we get the rendered output in the Rendered Output tab. We can visualize the raw markdown and inference statistics in the respective tabs. However, we can see right away that the inference is not perfect. The OCR starts with “like this:” and then the Hindi text starts.

Let’s compare the generated and the original text. On top of the Gradio application, we have a Diff tab. This is for analyzing the difference between the inference text and the original text. We copied the inference text from the Raw Markdown tab. Let’s analyze the difference between the two.

In the above video, we paste the text generated by the fine-tuned model in text box 2 and the original text in text box 1. We can see all the additional text and the text that has been added by the model by hovering over the + and buttons. By analyzing the results, we can infer that there is a lot of room for improvement.

Further Improvements and Actions

  • Try to play around with the inference application. You can do batch inference by providing the absolute path to a directory in the Batch Processing tab.
  • You can also train with a LoRA rank of 32 and alpha of 64 to check if the performance improves.
  • Furthermore, there is a high chance that the results will improve by fine-tuning the vision modules as well.

Summary and Conclusion

In this article, we covered the fine-tuning of DeepSeek-OCR 2 on Hindi images and text, specifically, the Devanagari script. We started with the discussion of the dataset, followed by the training and inference. We also analyzed the inference and discussed the room for improvement.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *