Training Gemma 3n for Transcription and Translation

In continuation of the previous article, this week, we will cover training the Gemma 3n model for audio transcription and translation. Gemma 3n models, although multimodal, are not adept at transcribing German audio. Furthermore, even after fine-tuning Gemma 3n for transcription, the model cannot correctly translate those into English. That’s what we are targeting here. To teach the Gemma 3n model to transcribe and translate German audio samples, end-to-end.

Jump to Download Code

Figure 1. Demo of German to English transcription and translation after fine-tuning Gemma 3n.

Our tech stack will use a mix of Hugging Face libraries, Unsloth, and Together for the German to English translation dataset creation.

What will we cover while training Gemma 3n for transcription and translation:

Discussing the dataset in brief.
Setting up the environment.
Preparing the dataset for transcription and translation training. How do we create the German to English translation data for fine-tuning the model?
Training, inference, and gradio demo.

The German Transcription Dataset

We will use the same dataset from Hugging Face that we used in the last article for fine-tuning Gemma 3n for German audio transcription.

The dataset is available on Hugging Face and it contains around 12000 samples. I recommend going through the previous article to know a bit more about the dataset.

The Project Directory Structure

Let’s take a look at the project directory structure before jumping into the coding part.

├── gemma-3n-finetuned
├── outputs
├── gemma3n_e2b_german_translate_finetune.ipynb
├── german_to_english.csv
├── requirements.txt
├── app.py
└── german_to_english_translate.ipynb

The gemma-3n-finetuned and outputs directories contain the final and intermediate model outputs from training.
The gemma3n_e2b_german_translate_finetune.ipynb is the Jupyter Notebook containing the code for training the model for transcription and translation.
In the german_to_english_translate.ipynb Jupyter Notebook, we have the code to create the translation dataset. The final translation samples are present in the german_to_english.csv file.
The app.py Python file contains the Gradio application that we will create after training the model.

All the Jupyter Notebooks, trained adapter weights, and the requirements file are available via the download section.

Download Code

Installing the Dependencies

Just as we discussed in the previous post, we need to install the correct versions of all libraries for the training and inference to work correctly.

When working locally, you can install Unsloth first using the folliwing following command:

pip install unsloth unsloth-zoo

The above will install the latest versions of Unsloth and Unsloth Zoo, which should work, almost always. If you face issues, pin the installation to the following versions, which were used for the codebase in this article.

unsloth==2025.8.1
unsloth-zoo==2025.8.1

Next, install the rest of the libraries.

pip install -r requirements.txt

The above will also install the Together SDK that we need for creating the ground truth translation data.

Creating the Ground Truth German to English Translation Data

The very first step is creating the ground truth data for German to English translation. Without this, we cannot teach the model both transcription and translation.

For this, we will use the Together Serverless API service. The linked article also discusses free credits, free models, and making API calls for text and image generation, along with the setup steps.

For creating the translation dataset, we will make API calls to the Meta-Llama-3.1-8B-Instruct-Turbo model. In case you are short on credits and cannot create the dataset, the downloadable codebase already comes bundled with the CSV file.

The code for this is present in the german_to_english.ipynb Jupyter Notebook. Let’s go through the code briefly, as it is pretty much self-explanatory.

Import Statements and Initializing the Together Client

Let’s import all the libraries that we need and initialize the Together AI client.

from together import Together
from dotenv import load_dotenv
from datasets import load_dataset
from tqdm.auto import tqdm

import time
import pandas as pd

load_dotenv()

client = Together()

Before executing the above code, make sure to create a .env file and add your Together API key with the variable TOGETHER_API_KEY.

Function to Define the Prompt and Get Response

The following function accepts the German transcription from the dataset along with the message and returns the response.

def get_response(message):
    response = client.chat.completions.create(
        model='meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo',
        messages=[
          {
            'role': 'user',
            'content': message
          }
        ]
    )

    return response.choices[0].message.content

Loading the Dataset and Making the API Calls

Let’s load the dataset, create lists to store the data, and make the API calls.

dataset = load_dataset('kadirnar/Emilia-DE-B000000')

german_sentences = []
english_sentences = []
sample_ids = []

for i, data in tqdm(enumerate(dataset['train']), total=len(dataset['train'])):
    # if i == 3:
    #     break
    text  = data['text']
    sample_id = data['_id']

    message = (
        'Translate this sentence from German to English. Give no other text:\n'
        f'German: {text}\n'
        f'English: '
    )

    response = get_response(message)

    german_sentences.append(text)
    english_sentences.append(response)
    sample_ids.append(sample_id)

    time.sleep(2)

We have a sleep time of 2 seconds between each API call to avoid rate limit issues. The entire process took around 12 hours, along with a credit cost of around $2.

Larger models like Llama 70B will give better translation results, but will be much slower in response and will also be costlier.

Save the Results in a CSV File

Finally, we save the results in a CSV file.

df = pd.DataFrame(columns=['_id', 'german', 'english'])

df['_id'] = sample_ids
df['german'] = german_sentences
df['english'] = english_sentences

df.to_csv('german_to_english.csv', index=False)

Figure 2. The German to English translation CSV file samples.

We also save the IDs in the _id column that we can use as the index column later.

Training Gemma 3n for German Audio Transcription and Translation

Let’s move over to the primary objective of this article: training the Gemma 3n model for transcription and translation.

The code is present in the gemma3n_e2b_german_translate_finetune.ipynb Jupyter Notebook. The first few cells contain the code for the installation of libraries in case you are running on Colab or Kaggle. We are skipping them here.

The code will remain mostly similar to the previous article, with changes mostly to the dataset preparation steps. So, we will keep the code discussion of the other parts brief.

Importing the Necessary Libraries for Training

The following code cell contains all the necessary imports.

from unsloth import FastModel
from huggingface_hub import snapshot_download
from datasets import load_dataset, Audio
from IPython.display import Audio, display
from transformers import WhisperProcessor
from evaluate import load
from trl import SFTTrainer, SFTConfig
from transformers import TextStreamer
from functools import partial

import torch
import pandas as pd

# Because of FailOnRecompileLimitHit: recompile_limit reached with one_graph=True
# Solution found here => https://github.com/huggingface/transformers/issues/39427
torch._dynamo.config.cache_size_limit = 32

On RTX GPUs, we may face torch._dynamo.exc.FailOnRecompileLimitHit. The last line of code in the above code block handles that.

Loading the Model

Next, we load the Gemma 3n E2B model in 4-bit quantized format.

model, processor = FastModel.from_pretrained(
    model_name='unsloth/gemma-3n-E2B-it-unsloth-bnb-4bit',
    dtype=None,
    max_seq_length=1024,
    load_in_4bit=True,
    full_finetuning=False
)

Preparing the Dataset

Loading and preparing the dataset for transcription and translation training is the core of this article. We will tackle this next.

# Load dataset.
dataset = load_dataset('kadirnar/Emilia-DE-B000000', split='train')

# Divide into train and test split.
train_samples = 11500
train_dataset = dataset.select(range(train_samples))
test_dataset = dataset.select(range(train_samples, len(dataset)))

Just like in the previous article, we divide it into a training set containing 11500 samples and the rest for testing.

The next step involves creating the data formatting function.

def format_intersection_data(samples: dict, df: pd.DataFrame) -> dict[str, list]:
    """Format intersection dataset to match expected message format"""
    formatted_samples = {'messages': []}
    for idx in range(len(samples['audio'])):
        # Extract audio and text data from HF dataset.
        audio = samples['audio'][idx]['array']
        label = str(samples['text'][idx])

        # Extract translation data from the CSV.
        _id = str(samples['_id'][idx])
        eng_data = df.loc[_id, 'english']
        # de_data = df.loc[_id, 'german']
        # print('Orig german: ', label)
        # print('DF german: ', de_data)
        # print('DF english: ', eng_data)

        content_assistant_text = (
            f'GERMAN TRANSCRIPTION: {label}\n'
            f'ENGLISH TRANSLATION: {eng_data}'
        )

        message = [
            {
                'role': 'system',
                'content': [
                    {
                        'type': 'text',
                        'text': 'You are an assistant that transcribes and translates speech accurately.',
                    }
                ],
            },
            {
                'role': 'user',
                'content': [
                    {'type': 'audio', 'audio': audio},
                    {'type': 'text', 'text': 'Please transcribe this audio and then translate from German to English.'}
                ]
            },
            {
                'role': 'assistant',
                'content':[{'type': 'text', 'text': content_assistant_text}]
            }
        ]

        # print(message)
        
        formatted_samples['messages'].append(message)
    return formatted_samples

In this case, along with the German transcription text, we also pass the English translated text to the assistant. This is what the model will predict. We have also carefully tuned the system and user prompts so that the model knows the task clearly.

Let’s load the CSV file and map the datasets to the above formatting function.

df = pd.read_csv('german_to_english.csv')
df = df.set_index('_id')

train_dataset = train_dataset.map(
    partial(format_intersection_data, df=df), 
    batched=True, 
    batch_size=4, 
    num_proc=8
)

test_dataset = test_dataset.map(
    partial(format_intersection_data, df=df), 
    batched=True, 
    batch_size=4, 
    num_proc=8
)

This is what the samples look like after they are formatted for model training.

user
You are an assistant that transcribes and translates speech accurately.










































Please transcribe this audio and then translate from German to English.
model
GERMAN TRANSCRIPTION:  Das dritte Gate, da ist das vierte Gate. Da ist das Joban Eerie, da muss ich hin.
ENGLISH TRANSLATION: The third gate, that is the fourth gate. That is the Joban Eerie, I must go there.

The entire audio array gets converted to . The user prompt follows it, and then the assistant’s response.

The final step of data preparation is the collation function for managing the text and audio batches.

def collate_fn(examples):
    texts = []
    audios = []

    for example in examples:
        # Apply chat template to get text
        text = processor.apply_chat_template(
            example['messages'], tokenize=False, add_generation_prompt=False
        ).strip()
        texts.append(text)

        # Extract audios
        audios.append(example['audio']['array'])

    # Tokenize the texts and process the images
    batch = processor(
        text=texts, audio=audios, return_tensors='pt', padding=True
    )

    # The labels are the input_ids, and we mask the padding tokens in the loss computation
    labels = batch['input_ids'].clone()

    # Use Gemma3n specific token masking
    labels[labels == processor.tokenizer.pad_token_id] = -100
    if hasattr(processor.tokenizer, 'image_token_id'):
        labels[labels == processor.tokenizer.image_token_id] = -100
    if hasattr(processor.tokenizer, 'audio_token_id'):
        labels[labels == processor.tokenizer.audio_token_id] = -100
    if hasattr(processor.tokenizer, 'boi_token_id'):
        labels[labels == processor.tokenizer.boi_token_id] = -100
    if hasattr(processor.tokenizer, 'eoi_token_id'):
        labels[labels == processor.tokenizer.eoi_token_id] = -100


    batch['labels'] = labels
    return batch

This ends all the data preparation steps we need for training the Gemma 3n model for transcription and translation.

Preparing the Model, SFTTrainer, and SFTConfig for Training Gemma 3n for Transcription and Translation

We will be fine-tuning both the text and audio layers of the Gemma 3n E2B model using QLoRA training methodology.

model = FastModel.get_peft_model(
    model,
    finetune_vision_layers=False,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=8,
    lora_alpha=16,
    lora_dropout=0,
    bias='none',
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
    target_modules=[
        'q_proj', 'k_proj', 'v_proj', 'o_proj',
        'gate_proj', 'up_proj', 'down_proj',

        # Audio layers
        'post', 'linear_start', 'linear_end',
        'embedding_projection',
    ],
    modules_to_save=[
        'lm_head',
        'embed_tokens',
        'embed_audio',
    ],
)

We need to initialize the SFTTrainer class with the correct configurations.

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    processing_class=processor.tokenizer,
    data_collator=collate_fn,
    args=SFTConfig(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=1,
        # Use reentrant checkpointing.
        gradient_checkpointing_kwargs={'use_reentrant': False},
        warmup_ratio=0.1,
        # max_steps=10,
        num_train_epochs=1,
        learning_rate=5e-5,
        logging_steps=200,
        eval_strategy='steps',
        eval_steps=500,
        save_strategy='steps',
        save_steps=500,
        optim='adamw_8bit',
        weight_decay=0.01,
        lr_scheduler_type='cosine',
        seed=3407,
        output_dir='outputs',
        report_to='none',
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model='eval_loss',

        # For audio finetuning.
        remove_unused_columns=False,
        dataset_text_field='',
        dataset_kwargs={'skip_prepare_dataset': True},
        dataset_num_proc=8,
        max_length=1024,
    )
)

We will be training the model for just one epoch, which should be enough to get started.

Finally, let’s start the training.

The entire training and inference was done on a system with a 10GB RTX 3080 GPU, 32GB RAM, and 10th generation i7 CPU.

trainer_stats = trainer.train()

Here is a sample of the logs.

Figure 3. Gemma 3n for translation training logs.

The training took around 3 hours and 30 minutes to complete.

Inference Using the Trained Model

Before going into the inference part, please go through the inference section of the previous article. We discuss the issues with the Gemma 3 Processor that Unsloth has and how to rectify them.

The rest of the inference part remains similar.

# Loading the saved model.
from unsloth import FastModel
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained('google/gemma-3n-E2B-it')

model, _ = FastModel.from_pretrained(
    model_name='gemma-3n-finetuned',
    max_seq_length=512,
    load_in_4bit=True,
    dtype=torch.bfloat16
)

def do_gemma_3n_inference(messages, max_new_tokens=128):
    _ = model.generate(
        **processor.apply_chat_template(
            messages,
            add_generation_prompt=True, # Must add for generation
            tokenize=True,
            return_dict=True,
            return_tensors='pt',
            truncation=False
        ).to('cuda', dtype=torch.bfloat16),
        max_new_tokens=max_new_tokens,
        do_sample=False,
        streamer=TextStreamer(processor, skip_prompt=True),
    )

messages = [
    {
        'role': 'system',
        'content': [
            {
                'type': 'text',
                'text': 'You are an assistant that transcribes and translates speech accurately.',
            }
        ],
    },
    {
        'role': 'user',
        'content': [
            {'type': 'audio', 'audio': test_dataset[-1]['audio']['array']},
            {'type': 'text', 'text': 'Please transcribe this audio and translate to English. Give both, the trancription and the translation'}
        ]
    }
]

do_gemma_3n_inference(messages, max_new_tokens=256)

We use the last sample from the test set and get the following result.

Transcription: Und ich hab so auf ein Stück Papier geschrieben. Ich hab viele Blockaden und kannst du mir helfen, diese Blockaden aufzulösen?
Translation: And I wrote it down on a piece of paper. I have many blockages and can you help me to resolve these blockages?

The model gives both the transcription and the translation. Checking with Google Translate, the answer looks pretty accurate.

At this moment, we have a good model that can transcribe German audio and translate it to English. Let’s move on to create an interactive Gradio application where we can upload audio files and carry out the entire process with a single click.

Creating a Gradio Application for Gemma 3n Transcription and Translation

The code for this is present in the app.py script.

The codebase also comes with convert_to_wav.ipynb Jupyter Notebook. This converts all the test samples in the dataset to .wav audio files so that we can easily upload and check the results in the app. Running the notebook will store all the audio files in the test_wav_files directory.

Let’s import the packages, load the model and processor, and define the inference function.

Imports:

import gradio as gr
import torch
import librosa
import numpy as np
import glob

from unsloth import FastModel
from transformers import AutoProcessor, TextIteratorStreamer
from threading import Thread

Load model and processor:

TARGET_SAMPLING_RATE = 16000

print("Loading model and processor...")

processor = AutoProcessor.from_pretrained('google/gemma-3n-E2B-it')

model, _ = FastModel.from_pretrained(
    model_name='gemma-3n-finetuned',
    max_seq_length=512,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)
print("Model and processor loaded successfully.")

Inference function:

def transcribe_and_translate(audio_input):
    """
    This function takes audio data from the Gradio component, processes it,
    and then streams the model's transcription and translation back to the UI.
    """
    if audio_input is None:
        yield "Error: Please upload or record a German audio file first."
        return

    sample_rate, audio_array = audio_input

    if audio_array.ndim > 1:
        audio_array = audio_array.mean(axis=1)

    audio_array = audio_array.astype(np.float32)

    if sample_rate != TARGET_SAMPLING_RATE:
        audio_array = librosa.resample(
            y=audio_array,
            orig_sr=sample_rate,
            target_sr=TARGET_SAMPLING_RATE
        )

    messages = [
        {
            'role': 'system',
            'content': [
                {
                    'type': 'text',
                    'text': 'You are an assistant that transcribes and translates speech accurately.',
                }
            ],
        },
        {
            'role': 'user',
            'content': [
                {'type': 'audio', 'audio': audio_array},
                {'type': 'text', 'text': 'Please transcribe this audio and translate it to English. Give both, the transcription and the translation.'}
            ]
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors='pt'
    ).to('cuda', dtype=torch.bfloat16)

    streamer = TextIteratorStreamer(processor, skip_prompt=True, skip_special_tokens=True)

    generation_kwargs = dict(
        **inputs,
        streamer=streamer,
        max_new_tokens=1024,
        do_sample=False,
    )
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    output_text = ""
    for new_text in streamer:
        output_text += new_text
        yield output_text

We also grab all the .wav file from the directory so that we do not need to upload them every time. And finally, create and launch the app.

# Grab all wav files in the directory
example_audios = glob.glob('test_wav_files/*.wav')

example_list = [ for audio in example_audios]

with gr.Blocks(theme=gr.themes.Soft()) as demo:
    gr.Markdown(
        """
        # German Audio Transcription & Translation
        Upload a German audio file, and the model will provide both the German transcription and its English translation.
        This uses a Gemma-3N model fine-tuned with Unsloth.
        """
    )
    with gr.Row():
        audio_input = gr.Audio(sources=["upload", "microphone"], type="numpy", label="German Audio")
        text_output = gr.Textbox(label="Transcription and Translation", lines=10, interactive=False)

    submit_btn = gr.Button("Transcribe and Translate", variant="primary")
    submit_btn.click(
        fn=transcribe_and_translate,
        inputs=audio_input,
        outputs=text_output
    )

    gr.Examples(
        examples=example_list,
        inputs=audio_input,
        outputs=text_output,
        fn=transcribe_and_translate,
        cache_examples=False
    )

if __name__ == "__main__":
    demo.launch(share=True)

We can launch the app via the following command.

python app.py

This is what the UI looks like.

Figure 4. Gemma 3n transcription and translation Gradio app UI.

The following video shows the Gemma 3n transcription and translation application in action.

The Gemma 3n translation and transcription Gradio app in action.

Key Takeaways

Although we trained a good model for transcription and translation, we are missing one core component here: a standard evaluation framework.

The evaluation code is present in the gemma3n_e2b_finetuned_eval.ipynb Jupyter Notebook. Running this gives a WER of 49.16%.

However, it is difficult to evaluate this model in a standard way while calculating WER (Word Error Rate). This is because the model does not always follow the format we have defined in the format_intersection_data. So, the final WER is not representative of the real number.

This brings another point into discussion. Training small models for complex tasks requires more data, and often, they need to be fully trained so that they can follow all instructions clearly. The instruction format changing during inference can be one of many drawbacks of QLoRA training.

Summary and Conclusion

In this article, we trained the Gemma 3n E2B model for German audio transcription and translation to English. Starting from the preparation of the translation data, the fine-tuning, to creating a Gradio application, we covered a lot. We also discussed some of the challenges of evaluating our model.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

Training Gemma 3n for Transcription and Translation

The German Transcription Dataset

The Project Directory Structure

Download Code

Installing the Dependencies

Creating the Ground Truth German to English Translation Data

Import Statements and Initializing the Together Client

Function to Define the Prompt and Get Response

Loading the Dataset and Making the API Calls

Save the Results in a CSV File

Training Gemma 3n for German Audio Transcription and Translation

Importing the Necessary Libraries for Training

Loading the Model

Preparing the Dataset

Preparing the Model, SFTTrainer, and SFTConfig for Training Gemma 3n for Transcription and Translation

Inference Using the Trained Model

Creating a Gradio Application for Gemma 3n Transcription and Translation

Key Takeaways

Summary and Conclusion

Leave a Reply Cancel reply