In continuation of the previous article, this week, we will cover training the Gemma 3n model for audio transcription and translation. Gemma 3n models, although multimodal, are not adept at transcribing German audio. Furthermore, even after fine-tuning Gemma 3n for transcription, the model cannot correctly translate those into English. That’s what we are targeting here. To teach the Gemma 3n model to transcribe and translate German audio samples, end-to-end.
Our tech stack will use a mix of Hugging Face libraries, Unsloth, and Together for the German to English translation dataset creation.
What will we cover while training Gemma 3n for transcription and translation:
- Discussing the dataset in brief.
- Setting up the environment.
- Preparing the dataset for transcription and translation training. How do we create the German to English translation data for fine-tuning the model?
- Training, inference, and gradio demo.
The German Transcription Dataset
We will use the same dataset from Hugging Face that we used in the last article for fine-tuning Gemma 3n for German audio transcription.
The dataset is available on Hugging Face and it contains around 12000 samples. I recommend going through the previous article to know a bit more about the dataset.
The Project Directory Structure
Let’s take a look at the project directory structure before jumping into the coding part.
├── gemma-3n-finetuned ├── outputs ├── gemma3n_e2b_german_translate_finetune.ipynb ├── german_to_english.csv ├── requirements.txt ├── app.py └── german_to_english_translate.ipynb
- The
gemma-3n-finetunedandoutputsdirectories contain the final and intermediate model outputs from training. - The
gemma3n_e2b_german_translate_finetune.ipynbis the Jupyter Notebook containing the code for training the model for transcription and translation. - In the
german_to_english_translate.ipynbJupyter Notebook, we have the code to create the translation dataset. The final translation samples are present in thegerman_to_english.csvfile. - The
app.pyPython file contains the Gradio application that we will create after training the model.
All the Jupyter Notebooks, trained adapter weights, and the requirements file are available via the download section.
Download Code
Installing the Dependencies
Just as we discussed in the previous post, we need to install the correct versions of all libraries for the training and inference to work correctly.
When working locally, you can install Unsloth first using the folliwing following command:
pip install unsloth unsloth-zoo
The above will install the latest versions of Unsloth and Unsloth Zoo, which should work, almost always. If you face issues, pin the installation to the following versions, which were used for the codebase in this article.
unsloth==2025.8.1 unsloth-zoo==2025.8.1
Next, install the rest of the libraries.
pip install -r requirements.txt
The above will also install the Together SDK that we need for creating the ground truth translation data.
Creating the Ground Truth German to English Translation Data
The very first step is creating the ground truth data for German to English translation. Without this, we cannot teach the model both transcription and translation.
For this, we will use the Together Serverless API service. The linked article also discusses free credits, free models, and making API calls for text and image generation, along with the setup steps.
For creating the translation dataset, we will make API calls to the Meta-Llama-3.1-8B-Instruct-Turbo model. In case you are short on credits and cannot create the dataset, the downloadable codebase already comes bundled with the CSV file.
The code for this is present in the german_to_english.ipynb Jupyter Notebook. Let’s go through the code briefly, as it is pretty much self-explanatory.
Import Statements and Initializing the Together Client
Let’s import all the libraries that we need and initialize the Together AI client.
from together import Together from dotenv import load_dotenv from datasets import load_dataset from tqdm.auto import tqdm import time import pandas as pd load_dotenv() client = Together()
Before executing the above code, make sure to create a .env file and add your Together API key with the variable TOGETHER_API_KEY.
Function to Define the Prompt and Get Response
The following function accepts the German transcription from the dataset along with the message and returns the response.
def get_response(message):
response = client.chat.completions.create(
model='meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo',
messages=[
{
'role': 'user',
'content': message
}
]
)
return response.choices[0].message.content
Loading the Dataset and Making the API Calls
Let’s load the dataset, create lists to store the data, and make the API calls.
dataset = load_dataset('kadirnar/Emilia-DE-B000000')
german_sentences = []
english_sentences = []
sample_ids = []
for i, data in tqdm(enumerate(dataset['train']), total=len(dataset['train'])):
# if i == 3:
# break
text = data['text']
sample_id = data['_id']
message = (
'Translate this sentence from German to English. Give no other text:\n'
f'German: {text}\n'
f'English: '
)
response = get_response(message)
german_sentences.append(text)
english_sentences.append(response)
sample_ids.append(sample_id)
time.sleep(2)
We have a sleep time of 2 seconds between each API call to avoid rate limit issues. The entire process took around 12 hours, along with a credit cost of around $2.
Larger models like Llama 70B will give better translation results, but will be much slower in response and will also be costlier.
Save the Results in a CSV File
Finally, we save the results in a CSV file.
df = pd.DataFrame(columns=['_id', 'german', 'english'])
df['_id'] = sample_ids
df['german'] = german_sentences
df['english'] = english_sentences
df.to_csv('german_to_english.csv', index=False)
We also save the IDs in the _id column that we can use as the index column later.
Training Gemma 3n for German Audio Transcription and Translation
Let’s move over to the primary objective of this article: training the Gemma 3n model for transcription and translation.
The code is present in the gemma3n_e2b_german_translate_finetune.ipynb Jupyter Notebook. The first few cells contain the code for the installation of libraries in case you are running on Colab or Kaggle. We are skipping them here.
The code will remain mostly similar to the previous article, with changes mostly to the dataset preparation steps. So, we will keep the code discussion of the other parts brief.
Importing the Necessary Libraries for Training
The following code cell contains all the necessary imports.
from unsloth import FastModel from huggingface_hub import snapshot_download from datasets import load_dataset, Audio from IPython.display import Audio, display from transformers import WhisperProcessor from evaluate import load from trl import SFTTrainer, SFTConfig from transformers import TextStreamer from functools import partial import torch import pandas as pd # Because of FailOnRecompileLimitHit: recompile_limit reached with one_graph=True # Solution found here => https://github.com/huggingface/transformers/issues/39427 torch._dynamo.config.cache_size_limit = 32
On RTX GPUs, we may face torch._dynamo.exc.FailOnRecompileLimitHit. The last line of code in the above code block handles that.
Loading the Model
Next, we load the Gemma 3n E2B model in 4-bit quantized format.
model, processor = FastModel.from_pretrained(
model_name='unsloth/gemma-3n-E2B-it-unsloth-bnb-4bit',
dtype=None,
max_seq_length=1024,
load_in_4bit=True,
full_finetuning=False
)
Preparing the Dataset
Loading and preparing the dataset for transcription and translation training is the core of this article. We will tackle this next.
# Load dataset.
dataset = load_dataset('kadirnar/Emilia-DE-B000000', split='train')
# Divide into train and test split.
train_samples = 11500
train_dataset = dataset.select(range(train_samples))
test_dataset = dataset.select(range(train_samples, len(dataset)))
Just like in the previous article, we divide it into a training set containing 11500 samples and the rest for testing.
The next step involves creating the data formatting function.
def format_intersection_data(samples: dict, df: pd.DataFrame) -> dict[str, list]:
"""Format intersection dataset to match expected message format"""
formatted_samples = {'messages': []}
for idx in range(len(samples['audio'])):
# Extract audio and text data from HF dataset.
audio = samples['audio'][idx]['array']
label = str(samples['text'][idx])
# Extract translation data from the CSV.
_id = str(samples['_id'][idx])
eng_data = df.loc[_id, 'english']
# de_data = df.loc[_id, 'german']
# print('Orig german: ', label)
# print('DF german: ', de_data)
# print('DF english: ', eng_data)
content_assistant_text = (
f'GERMAN TRANSCRIPTION: {label}\n'
f'ENGLISH TRANSLATION: {eng_data}'
)
message = [
{
'role': 'system',
'content': [
{
'type': 'text',
'text': 'You are an assistant that transcribes and translates speech accurately.',
}
],
},
{
'role': 'user',
'content': [
{'type': 'audio', 'audio': audio},
{'type': 'text', 'text': 'Please transcribe this audio and then translate from German to English.'}
]
},
{
'role': 'assistant',
'content':[{'type': 'text', 'text': content_assistant_text}]
}
]
# print(message)
formatted_samples['messages'].append(message)
return formatted_samples
In this case, along with the German transcription text, we also pass the English translated text to the assistant. This is what the model will predict. We have also carefully tuned the system and user prompts so that the model knows the task clearly.
Let’s load the CSV file and map the datasets to the above formatting function.
df = pd.read_csv('german_to_english.csv')
df = df.set_index('_id')
train_dataset = train_dataset.map(
partial(format_intersection_data, df=df),
batched=True,
batch_size=4,
num_proc=8
)
test_dataset = test_dataset.map(
partial(format_intersection_data, df=df),
batched=True,
batch_size=4,
num_proc=8
)
This is what the samples look like after they are formatted for model training.
user You are an assistant that transcribes and translates speech accurately. Please transcribe this audio and then translate from German to English. model GERMAN TRANSCRIPTION: Das dritte Gate, da ist das vierte Gate. Da ist das Joban Eerie, da muss ich hin. ENGLISH TRANSLATION: The third gate, that is the fourth gate. That is the Joban Eerie, I must go there.


