Instruction Tuning OPT-125M


Instruction Tuning OPT-125M

Large language models are pretrained on terabytes of language datasets. However, the pretraining dataset and strategy teach the model to generate the next token or word. In a real world sense, this is not much useful. Because in the end, we want to accomplish a task using the LLM, either through chat or instruction. We can do so through fine-tuning an LLM. Generally, we call this instruction tuning of the language model. To this end, in this article, we will use the OPT-125M model for instruction tuning.

Output sample after instruction tuning OPT-125M on the Open Assistant Guanaco dataset.
Figure 1. Output sample after instruction tuning OPT-125M on the Open Assistant Guanaco dataset.

Note: This is the very first article on tuning/using Language Models/Large Language Models. Here, we will not focus on the theory much, but rather on the dataset used, the code, and understanding the pipeline to get started right away. Specifically we will use a sample of the Open Assistant dataset along with the Hugging Face Supervised Fine-Tuning Trainer.

We will cover the following topics in this article

  • We will start with a short discussion of the dataset that we will use for instruction tuning of the OPT-125 model.
  • Next, we will discuss various ways to prepare the dataset for instruction tuning a language model.
  • Then, we will move to the coding part. Here, we will cover:
    • Dataset and training configurations.
    • Dataset preparation.
    • Training the OPT-125 model using Hugging Face SFT Trainer.
  • Finally, we will carry out inference using the trained model and finish with some potential improvement points.

The Open Assistant Guanaco Dataset

In this article, we will use the Open Assistant Guanaco dataset by Tim Dettmers that is available on Hugging Face. It is a subset of the popular Open Assistant dataset to train LLMs for instruction following.

The dataset that we will use contains 10000 samples roughly. This comes to around 9800 training and 500 validation samples. Unlike the original dataset, it contains only the text field which makes it easier for us to experiment with the dataset.

For example, here is a sample from the dataset.

A human-assistant instruction sample from the Open Assistant Guanaco dataset.
Figure 2. A human-assistant instruction sample from the Open Assistant Guanaco dataset.

As we can see, these are conversations between a human and a language model (assistant). The ### Human: tag precedes each human instruction/chat. And the LLM’s answer is preceded by the ### Assistant: tag. These tags or separators are crucial to teach the model which part of the conversation it should contribute to.

There are hundreds of instruction following/assistant/chat datasets right now on Hugging Face. And the above is just one format. Throughout the next few articles, we will uncover more varied datasets.

Not just chat, we can also train LLMs for specific tasks like text classification, question answering, and even text summarization.

Want to know how to use a transformer encoder for text classification? The following articles will surely help you.

Ways to Prepare Dataset for Instruction Tuning a Model

There are several ways to prepare a dataset for instruction tuning of the OPT-125M model. However, as we will be using the Supervised Fine-Tuning Trainer (SFT) pipeline here, there are two primary methods.

One is per sample padding where we pad each row according to the maximum context length that we provided. This means that if the batch size is 8, then there will be 8 samples in a batch and each sample will be padded independently with the padding token used by the tokenizer. With the Hugging Face SFT pipeline, we can achieve this using packing=False when initializing the Trainer pipeline.

Here is an example of the same.

Padded text sample when preparing instruction tuning dataset.
Figure 3. Padded text sample when preparing instruction tuning dataset.

The next method is continuous random sampling from the dataset, just like it happens in pretraining of the LLMs. Here, we sample data from each row randomly and append the end of sequence (EOS) token after each row’s data, then continue adding another sample until the context length is reached. In this case, if the batch size is 8, then each sample in a batch can contain chats from two different rows separated by the EOS token. Here is an example to make the concept clearer.

Packed sample when preparing instruction tuning dataset.
Figure 4 Packed sample when preparing instruction tuning dataset.

In the above example, the </s> is the EOS token. We can achieve this by using packing=True when initializing the SFT Trainer pipeline.

It is worthwhile to note that in each case, the model learns to predict the next token. However, in most cases, we will get better results with the latter method as the model always gets to predict a token from the real conversation rather than a pad token as may happen in the former case.

In this tutorial, we will cover the usage of the latter method.

Project Directory Structure

Let’s take a look at the directory structure of the project.

├── outputs
│   ├── opt_openassistant_guanaco_packed
│   │   ├── best_model
│   │   │   ├── config.json
│   │   │   ├── generation_config.json
│   │   │   ├── merges.txt
│   │   │   ├── model.safetensors
│   │   │   ├── special_tokens_map.json
│   │   │   ├── tokenizer_config.json
│   │   │   └── vocab.json
│   │   └── logs
│   │       ├── checkpoint-1358
│   │       ├── checkpoint-3390
│   │       └── runs
│   └── opt_openassistant_guanaco_per_sample_padding
│       ├── best_model
│       │   ├── config.json
│       │   ├── generation_config.json
│       │   ├── merges.txt
│       │   ├── model.safetensors
│       │   ├── special_tokens_map.json
│       │   ├── tokenizer_config.json
│       │   └── vocab.json
│       └── logs
│           ├── checkpoint-1231
│           ├── checkpoint-3070
│           └── runs
├── inference.ipynb
├── opt_openassistant_guanaco_packed_text.ipynb
└── opt_openassistant_guanaco_per_sample_padding.ipynb
  • In the root project directory, we have three notebooks. Two of them are training notebooks, one for packed sample training, one for per sample padded training, and another inference notebook. Although we will cover the packed sample training here, both of them are available to download for the readers.
  • The outputs directory contains the best model and logs.

All the notebooks and the packed sample training models are available for downloading via the download section.

Download Code

Libraries and Dependencies

We will use PyTorch as the base framework here. On top of that, there are several Hugging Face libraries that we need for running the notebooks successfully. They are:

  • accelerate
  • transformers
  • trl
  • datasets

You can use the requirements.txt file that comes with the code base to install all of them in one go.

pip install -r requirements.txt

That’s all we need to set up and now we can focus on the coding section.

Instruction Tuning OPT-125M

As discussed earlier, we will use the OPT-125M model for instruction tuning. This model is from Meta AI, published in the paper OPT: Open Pre-trained Transformer Language Models. It is the smallest model in the family of OPT models and is perfect for starting and experimenting with instruction tuning.

If you wish to know to fine-tune a GPT2 model for text generation, check out the previous article where we fine-tune DistIlGPT2 to generate detective stories.

All the code in the following training sections are present in the opt_openassistant_guanaco_packed_text.ipynb notebook.

Import Statements

Let’s start with importing the necessary modules and libraries.

import os
import torch

from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline,
    logging,
)
from trl import SFTTrainer

The SFTTrainer class from the trl library is one of the important ones. This library is specifically meant for fine-tuning LLMs.

On a side note, if we can prepare the dataset carefully, then we can also use the default Trainer API to fine-tune language models in a supervised manner for chat and instruction following. We will cover that in a future post.

Training and Dataset Hyperparameters

The following are all the configurations that we are going to use for dataset preparation and training our OPT-125M model for instruction following.

epochs = 10
batch_size = 8
num_workers = os.cpu_count()
bf16 = False
fp16 = True
gradient_accumulation_steps = 4
context_length = 512
learning_rate = 0.0001
model_name = 'facebook/opt-125m'
out_dir = 'outputs/opt_openassistant_guanaco'

Most of them are self explanatory. Let’s go through some of the more important ones from above:

  • bf16 and fp16: Instead of training in full precision (FP32), using either BF16 or FP16 consumes less GPU memory and is often faster to train. Newer NVIDIA GPUs (RTX series/Ampere architecture and later) allow the usage of BF16 which is computationally more stable compared to FP16 training. In case you training on GTX GPUs, or P100, or T4, consider using FP16. At a time, only one of them can be True.
  • gradient_accumulation_steps: Although we mention a batch size, we can additionally store the gradients before a backpropagation step. With gradient_accumulation_steps we can tell the trainer API how many batches to wait before a backpropagation step. So, in this case, our total batch size becomes 8*4=32. This makes training more stable and in LLM training, higher batch size often results in a better training run.
  • context_length: Each sample (along with the packing option that we use) can be several tokens long. The GPU may not be able to accomodate very long samples at a time. Mentioning a context length tells the tokenizer where to cut off samples and which samples to pad in case a sample is smaller than the context length.
  • model_name: This is the tag of the OPT-125M model on the Hugging Face Hub.

Loading the Open Assistant Guanaco Dataset

Using the load_dataset function, it is extremely simple to load any dataset available on the Hugging Face hub.

dataset = load_dataset('timdettmers/openassistant-guanaco')

print(dataset)

Printing the dataset shows the number of samples and the splits that it contains.

Open Assistant Guanaco dataset split.
Figure 5. Open Assistant Guanaco dataset split.

Now, let’s check one sample from the training set.

print(dataset['train']['text'][0])

This is a truncated part of the larger human-assistant chat.

### Human: Can you write a short introduction about the relevance of the term 
"monopsony" in economics? Please use examples related to potential monopsonies 
in the labour market and cite relevant research.### Assistant: "Monopsony" 
refers to a market structure where there is only one buyer for a particular 
good or service. In economics, this term is particularly relevant in the labor
 market, where a monopsony employer has significant power over the wages and 
working conditions of their employees. The presence of a monopsony can result 
in lower wages and reduced employment opportunities for workers, as the 
employer has little incentive to increase wages or provide better working conditions.

As we can see, each sample starts with a ### Human: tag, and each assistant answer starts with a ### Assistant: tag. There may be several instances where one conversation continues in the following row. I highly recommend going through some of the samples in the dataset to get a clearer picture of the structure and flow.

Preparing the OPT-125M Model for Instruction Tuning

Next, we need to load the OPT-125M model.

if bf16:
    model = AutoModelForCausalLM.from_pretrained(model_name).to(dtype=torch.bfloat16)
else:
    model = AutoModelForCausalLM.from_pretrained(model_name)

In the above code block, we load the model based on whether we choose FP16 or BF16 in the configurations. Further, the following is the architecture and the number of parameters in the model.

Model architecture and number of parameters of OPT-125M.
Figure 6. Model architecture and number of parameters of OPT-125M.

Loading the Tokenizer

When using the SFTTrainer API, we simply need to load the tokenizer and need not tokenize any samples ourselves. The trainer API will handle that on its own.

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
    use_fast=False
)

That’s all we need to do for the tokenization here.

Note that we are not covering the details of tokenizers here. However, the following articles can help you clear concepts of creating a simple tokenizer for the next character & next word generation using LSTM, and next word generation using Transformers.

Defining the Training Arguments, the Trainer API, and Training the Model

The final step is training the OPT-125M to follow instructions, But first, we need to define the training arguments and the trainer API.

The following code block initializes the TrainingArguments with all the desired configurations.

training_args = TrainingArguments(
    output_dir=f"{out_dir}/logs",
    evaluation_strategy='epoch',
    weight_decay=0.01,
    load_best_model_at_end=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_strategy='epoch',
    save_strategy='epoch',
    save_total_limit=2,
    bf16=bf16,
    fp16=fp16,
    report_to='tensorboard',
    dataloader_num_workers=num_workers,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    lr_scheduler_type='constant',
    num_train_epochs=epochs
)

Next is initializing the SFTTrainer.

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    dataset_text_field='text',
    max_seq_length=context_length,
    tokenizer=tokenizer,
    args=training_args,
    packing=True
)

This part is important when using the SFT Trainer API. Let’s go through the arguments.

  • train_dataset and eval_dataset: These are training and test attributes of the current dataset. However, they can also be different dataset instances, in case we prepare our custom dataset.
  • dataset_text: If we look at the dataset instance, we can observe that each sample is within the text field/column. This is what the argument is expecting. It needs to know from which field to extract the data if there are multiple fields in a dataset.
  • packing: As discussed at the beginning of the article, we are using packing=True which will append several samples together until the context_length is reached. Mostly, it tries to tie together samples of similar lengths.

There is one important point here. We did not prepare the dataset manually and do not know how the sentences are structured before being fed to the tokenizer. However, we can check that using the get_train_dataloader method of the trainer instance.

dataloader = trainer.get_train_dataloader()
for i, sample in enumerate(dataloader):
    print(tokenizer.decode(sample['input_ids'][0]))
    print('#'*50)
    if i == 5:
        break

The code block prints the first sample from 5 batches of the prepared dataset.

Here is a sample.

 iPhone oder iPad herunterladen und dort direkt mit einem Support-Mitarbeiter 
chatten oder telefonieren. Die Apple Support-App bietet auch eine bequeme 
Möglichkeit, den Reparaturstatus deiner Geräte zu überwachen und Termine im 
Apple Store zu vereinbaren.

Ich hoffe, diese Informationen sind hilfreich für dich!</s></s>### Human: I 
am trying to learn Portal 2 speedrunning. Can you help me with some of the 
basics, such as bhopping?### Assistant: Sure, I can help you with that!

Bhopping, or bunny hopping, is a technique in Portal 2 speedrunning where 
you jump repeatedly in order to maintain momentum and move faster. Here are 
some basic steps to follow:

1. Begin by moving forward and jumping. As you land, jump again immediately.
2. As you jump, strafe to one side using the A or D key on your keyboard. 
This will help you maintain your speed.
3. As soon as you land, jump again and strafe to the other side using the opposite key.
4. Repeat this pattern, jumping and strafing back and forth, to maintain your speed.

It can take some practice to get the timing right, but once you master the 
technique, it can be very useful for quickly moving through levels in Portal 2.

Here are a few additional tips for successful bhopping:

- Try to time your jumps so that you land just as your previous jump ends. 
This will help you maintain your speed and avoid losing momentum.
- Use your mouse to adjust your view as you jump, so that you can see where 
you're going and avoid obstacles.
- Be careful not to jump too often or you may lose control of your character 
and slow down.
- Practice on different surfaces, as some surfaces may be easier to bhop on than others.

Good luck with your speedrunning!### Human: I cannot, sadly.</s></s>### Human: Tại 
sao con đường thông thương cổ đại được gọi là “Con đường t

There are some important points to observe here.

  • Several samples have been appended randomly.
  • After each conversation ends, the end of sequence token (</s>) is appended by the tokenizer. This is important during the training phase, as it tells the model where a conversation ends.

Start the Training

Now, we simply need to call the train method to start the training.

history = trainer.train()

Here are the logs.

OPT-125M instruction tuning training logs.
Figure 7. OPT-125M instruction tuning training logs.

The model reached the lowest validation loss on the second epoch. The following is the validation loss graph from the training run.

Evaluation loss during instruction tuning training of OPT-125M.
Figure 8. Evaluation loss during instruction tuning training of OPT-125M.

It looks like we can use a learning rate scheduler and a larger dataset to train an even better model. In case you run the experiments, use the comment section to let others know of your findings.

Finally, we save the best model and the tokenizer to disk.

model.save_pretrained(f"{out_dir}/best_model")
tokenizer.save_pretrained(f"{out_dir}/best_model")

We are done with the training. Right now, we can move to the inference section.

Inference using the Instruction Tuned OPT-125M

The inference code resides in the inference.ipynb notebook and it is extremely simple to load the model and run inference.

First, we import all the necessary modules, define the computation device, and load the trained model & and the tokenizer.

from transformers import (
    AutoModelForCausalLM, 
    logging, 
    pipeline,
    AutoTokenizer
)

import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = AutoModelForCausalLM.from_pretrained('outputs/opt_openassistant_guanaco_packed/best_model/')
tokenizer = AutoTokenizer.from_pretrained('outputs/opt_openassistant_guanaco_packed/best_model/')

Second, we initialize the text generation pipeline. The Transformers library makes it seamless to generate text using any fine-tuned model.

pipe = pipeline(
    task='text-generation', 
    model=model, 
    tokenizer=tokenizer, 
    max_length=300, # Prompt + new tokens to generate.
    device_map=device
)

We pass the task, the model & tokenizer, and maximum length of the prompt along with the generated text and the computation device as arguments.

Third, we need to define a prompt template. Remember that our dataset had the coversation in a specific format. We need to replicate the same during inference to get proper response from the instruction tuned OPT-125 model.

template = '### Human: {}### Assistant:'

Later, we will format the text with the human prompt.

Finally, define a prompt, format the prompt template, and generate the text.

prompt = 'Write a few tips for staying healthy.'
prompt = template.format(prompt)

outputs = pipe(
    prompt, 
    do_sample=True, 
    temperature=0.7, 
    top_k=50, 
    top_p=0.95,
    repetition_penalty=1.1,
)
print(outputs[0]['generated_text'])

We also pass some additional argumets to the text generation pipeline.

  • do_sample: This argument enables different sampling methods during text generation. I would recommend going through the ducumentation to get a thorough overview as defining each sampling method is out of scope of this article.
  • temperature: A higher temparature adds variation to the generated text.
  • top_k and top_p: Both of these are sampling methods and are closely connected with the do_sample argument.
  • repetition_penalty: Sometimes, models will just keep on repeating the same text during generation. We can provide a repitition penalty (a value higher than the default 1.0) to avoid repititions in the generated content.

Here are the points generated by the model for the above prompt.

Inference result when prompting the instruction tuned OPT-125M model.
Figure 9. Inference result when prompting the instruction tuned OPT-125M model.

The model generates some really good points for staying healthy. The only issue is that the model does not finish the conversation properly. Perhaps using a larger model, or larger dataset, and training the current model for longer can rectify this as well. We can also add some manual post-processing to remove the unfinished sentences.

Further Steps

  • For further experiments, we can add learning rate scheduling, use larger context lengths, and train for longer.
  • We trained a smaller (OPT-125M) model for instruction tuning. We can also try larger models like the OPT-350M model and check its performance on performance tuning.

Summary and Conclusion

In this article, we trained the OPT-125M model for instruction tuning. Starting from the discussion of the dataset, format of the text, to training and inference, we covered each part in detail. We also discussed some further steps that can be taken to train a better model. I hope that this article was worth your time.

If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

References:

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

3 thoughts on “Instruction Tuning OPT-125M”

Leave a Reply

Your email address will not be published. Required fields are marked *