Grammar Correction using Hugging Face Transformers T5 on FCE Dataset

Jump to Download Code

Grammar Correction is one of the major problems in NLP (Natural Language Processing). Tools like Grammarly that help with automated grammar correction are invaluable in modern online writing. A lot of online tools like Grammarly pop up almost every few months. And guess what? They are all powered by AI, or NLP to be precise. But how do they work? Although it is difficult to pinpoint how grammar correction tools work, we can take some safe guesses. Most probably, they have a Transformer model under the hood. Grammarly has an official blog post on how Transformers help in GEC (Grammatical Error Correction). In this article, although we will not be building any state-of-the-art grammar correction model, we will train a very simple model using T5.

Example of Grammar correction using T5 — Figure 1. Example of grammar correction using T5.

We covered spelling correction using T5 in one of the previous articles. It was a minimal example to show how we can use Transformers for spelling correction. Similarly, in this article, we will touch upon every point briefly. This includes the dataset, the dataset preparation process, and the training. Our main focus is on creating a working solution with a code-first approach to grammar correction using Transformers. This will lead the way to future articles where we will dive deeper into this topic.

We will cover the following points for grammar correction using Hugging Face Transformers

We will start with the dataset discussion. To be precise, we will use the First Certificate in English (FCE) dataset in this article.
Next is the dataset preparation part. We need to prepare the dataset in such a way that we can feed it to the T5 model easily for training.
Then comes the training of the T5 Transformer model for grammar correction.
Finally, we will run inference using the trained model.

The FCE Dataset

The FCE (First Certificate in English) dataset is a subset of the Cambridge Learner Corpus (CLC). It is a part of the Building Educational Applications 2019 Shared Task: Grammatical Error Correction competition. The website hosts other datasets as well but we are interested in the FCE v2.1 under the Data section.

Downloading and extracting the dataset will reveal the following format.

fce_v2.1.bea19
└── fce
    ├── json
    │   ├── fce.dev.json
    │   ├── fce.test.json
    │   └── fce.train.json
    ├── json_to_m2.py
    ├── licence.txt
    ├── m2
    │   ├── fce.dev.gold.bea19.m2
    │   ├── fce.test.gold.bea19.m2
    │   └── fce.train.gold.bea19.m2
    └── readme.txt

All the data files will be extracted into fce_v2.1.bea19/fce directory. There is a json and an m2 subdirectory. However, we will deal with the JSON format of the dataset in the json subdirectory.

It contains a training, a dev, and a test set. Now, let’s look at an example from the training set.

{"text": "Dear Sir or Madam,\n\nI am writing in order to express my 
disappointment about your musical show \"Over the Rainbow\".\n\nI 
saws the show's advertisement hanging up of a wall in London where 
I was spending my holiday with some friends. I convinced them to go
there with me because I had heard good references about your 
Company and, above all, about the main star, Danny Brook.\n\nThe 
problems started in the box office, where we asked for the discounts
you announced in the advertisement, and the man who was selling 
the tickets said that they didn't exist.\n\nMoreover, the show was 
delayed forty-five minutes and the worst of all was that Danny Brook
had been replaced by another actor.\n\nOn the other hand, the 
theatre restaurant was closed because unknown reasons.\n\nYou 
promised a perfect evening but it became a big disastrous!\n\nI 
would like some kind of explanation and receive my money back. If 
you don't agree, I will act consequently.\n\nI look forward to 
hearing from you.\n\nYours faithfully,", "age": "21-25", "q": "1",
"script-s": "31", "edits": [[0, [[71, 76, "with", "RT"], 
[118, 122, "saw", "IV"], [159, 161, "on", "RT"], [292, 302, 
"reviews", "RN"], [303, 308, "of", "RT"], [338, 343, 
"because of", "RT"], [394, 396, "at", "RT"], [681, 698, 
"In addition", "ID"], [734, 741, "for", "R"], 
[811, 821, "disaster", "DN"], [866, 873, "to get", "FV"], 
[920, 932, "", "UY"]]]], "l1": "ca", "id": "TR1*0102*2000*01", 
"answer-s": "4.3"}

Each sample is in a dictionary format with several key-value pairs. Among them, we are interested in the "text" and "edits" key-value pairs.

The "text" key contains the essay as it was originally written by the author with a few grammatical errors. The "edits" key contains the correction edits in the following format:

[[annotator_id, [[char_start_offset, char_end_offset, correction], …]], …]

We are most interested in the char_start_offset, char_end_offset, and correction values. The former two values indicate the character indices where the word is wrong excluding the newline symbols ('\n'). For example, in the first case, the characters from 71 to 76 correspond to the word about which should be replaced with with.

As such, there are 2116 samples in the training set, 159 samples in the dev set, and 194 samples in the test set.

In the current format, it isn’t easy to write the dataset preparation code to be fed to the T5 model. For this reason, we will preprocess the dataset into an easier format which we will deal with while preparing the dataset for the model. We will do this in the coding section of the article.

For now, you can go ahead and download the dataset.

Project Directory Structure

Let’s take a look at the complete directory structure.

├── final_model_t5_small
│   ├── added_tokens.json
│   ├── config.json
│   ├── generation_config.json
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── spiece.model
│   └── tokenizer_config.json
├── input
│   ├── fce_v2.1.bea19
│   └── final
│       ├── test.json
│       ├── train.json
│       └── valid.json
├── results_t5_small
│   ├── checkpoint-5500
│   ├── checkpoint-6500
│   └── events.out.tfevents.1703464284.sovitdl.18962.0
├── preprocess_fce.py
└── t5_small.ipynb

The final_model_t5_small and results_t5_small contain the trained model and tokenizer after the training is done.
The input directory contains the FCE dataset that we explored in the previous section. Along with that, it contains a final directory with three JSON files. We will obtain these after executing the preprocessing script.
Directly inside the project directory, we have two files:
- t5_small.ipynb which contains the code to train the T5 model for grammar correction and run inference.
- preprocess_fce.py script that we will use to obtain the JSON files in the input/final directory.

You can download the Jupyter Notebook for training & inference along with the best weights via the “Download Code” section.

Dependencies

Before we move forward, we need to ensure that the environment is properly set up. We need the PyTorch framework for running the code in this article. Please go ahead and install it according to your configuration from the official site.

Along with that, we need to install Hugging Face transformers and datasets libraries.

pip install transformers

pip install datasets

That’s it. We are done with all the major dependencies that we need.

Download Code

Download the Source Code for this Tutorial

Grammar Correction using Hugging Face Transformers and the T5 Model

Let’s get into the coding part of the article. The first step, as we discussed earlier to bring the FCE dataset into a simpler format.

Preprocessing the FCE Dataset

For that, we have a simple preprocessing script in the preprocess_fce.py file. Here is the entire content of the file.

import json
import os

ROOTS = [
    'input/fce_v2.1.bea19/fce/json/fce.train.json',
    'input/fce_v2.1.bea19/fce/json/fce.dev.json',
    'input/fce_v2.1.bea19/fce/json/fce.test.json'
]
SPLITS = [
    'train', 
    'valid', 
    'test'
]

save_dir = 'input/final'
os.makedirs(save_dir, exist_ok=True)

def replace_multiple_substrings(original_string, replacements):
    # replacements is expected to be a list of tuples, each containing:
    # (start_index, end_index, new_substring)

    # Sort replacements by start_index to handle replacements in order
    replacements.sort(key=lambda x: x[0])

    result = original_string
    offset = 0  # This offset is necessary because the string length may change

    for start_index, end_index, new_substring in replacements:
        # Adjust indices based on the current offset
        adjusted_start = start_index + offset
        adjusted_end = end_index + offset

        # Check for invalid indices
        if adjusted_start < 0 or adjusted_end > len(result) or adjusted_start > adjusted_end:
            print(f"Error: Invalid indices for replacement '{new_substring}'. Skipping.")
            continue

        # Replace the specified part of the string
        result = result[:adjusted_start] + str(new_substring) + result[adjusted_end:]

        # Update the offset based on how the length of the string has changed
        offset += len(str(new_substring)) - (end_index - start_index)

    return result

for root, split in zip(ROOTS, SPLITS):
    data = []
    data_points = []
    with open(root, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    
    for i in range(len(data)):
        str_data = data[i]['text']
        re_data = data[i]['edits'][0][1]
        # print('STR: ', str_data)
        # print('RE: ', re_data)
        modified_string = replace_multiple_substrings(str_data, [data[:3] for data in re_data])
        # print(modified_string)
        
        data_point = {
            'original': str_data,
            'corrected': modified_string
        }
        data_points.append(data_point)
    
    with open(os.path.join(save_dir, split+'.json'), 'w') as f:
        json.dump(data_points, f, indent=4)

This is a simple script that converts the original FCE dataset into a much simpler format. A simpler dataset format will later reduce the code that we need to write while training the model.

In short, the above script:

Takes the original train, dev, and test files of the FCE dataset.
Reads the original text and the edits from the JSON files.
According to the edits, it creates a new text with the corrected words in place.

After executing the above script, you will find train.json, valid.json, and test.json files inside input/final directory.

Following is a sample from the training split.

{
        "original": "Dear Sir or Madam,\n\nI am writing in order to express my 
disappointment about your musical show \"Over the Rainbow\".\n\nI saws the 
show's advertisement hanging up of a wall in London where I was spending my 
holiday with some friends. I convinced them to go there with me because I had 
heard good references about your Company and, above all, about the main star, 
Danny Brook.\n\nThe problems started in the box office, where we asked for the 
discounts you announced in the advertisement, and the man who was selling the 
tickets said that they didn't exist.\n\nMoreover, the show was delayed forty-five 
minutes and the worst of all was that Danny Brook had been replaced by another 
actor.\n\nOn the other hand, the theatre restaurant was closed because unknown 
reasons.\n\nYou promised a perfect evening but it became a big disastrous!\n\nI 
would like some kind of explanation and receive my money back. If you don't agree, 
I will act consequently.\n\nI look forward to hearing from you.\n\nYours faithfully,",
        "corrected": "Dear Sir or Madam,\n\nI am writing in order to express my 
disappointment with your musical show \"Over the Rainbow\".\n\nI saw the show's 
advertisement hanging up on a wall in London where I was spending my holiday with 
some friends. I convinced them to go there with me because I had heard good reviews 
of your Company and, above all, because of the main star, Danny Brook.\n\nThe 
problems started at the box office, where we asked for the discounts you announced 
in the advertisement, and the man who was selling the tickets said that they didn't 
exist.\n\nMoreover, the show was delayed forty-five minutes and the worst of all 
was that Danny Brook had been replaced by another actor.\n\nIn addition, the theatre 
restaurant was closed for unknown reasons.\n\nYou promised a perfect evening but it 
became a big disaster!\n\nI would like some kind of explanation and to get my money 
back. If you don't agree, I will act .\n\nI look forward to hearing from you.\n\nYours 
faithfully,"
}

So, for each sample, we now have a dictionary-like format with an "original" key and a "corrected" key.

T5 for Grammar Correction

Now, let’s jump into the actual training notebook. The code here follows the content in the t5_small.ipynb Jupyter Notebook.

In case you want a brief about the T5 Transformer model, please take a look at the spelling correction article. In that article, we used the T5 model for single-word spelling correction, and it may be a good starting point if you are new to this topic.

Let’s start with the import statements.

import torch

from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    TrainingArguments, 
    Trainer
)
from datasets import load_dataset

From the transformers library, we import:

T5Tokenizer: To tokenize the dataset which consists of grammatically incorrect and correct sentences.
T5ForConditionalGeneration: This is for the loading of the T5 model.
TrainingArguments: This class initializes all the training arguments before starting the training.
Trainer: To initialize the Trainer object so that we can train the T5 model.

We also import the load_dataset function from the datasets library to load the prepared JSON files in a format that is directly compatible with the rest of the transformers pipeline.

Loading the Dataset

Next, we load the preprocessed training, validation, and test datasets.

dataset_train = load_dataset(
    'json', 
    data_files='input/final/train.json', 
    split='train'
)
dataset_valid = load_dataset(
    'json', 
    data_files='input/final/valid.json', 
    split='train'
)
dataset_test = load_dataset(
    'json', 
    data_files='input/final/test.json', 
    split='train'
)

When using the load_dataset function, the first argument is always the type of dataset that we are loading. As our dataset is in JSON format, so, we pass 'json'.

One other point to note here is the split argument. When loading external datasets, it becomes mandatory to give the split as 'train'. However, that does not change any attribute of the dataset, so, we can use each split as originally intended.

Defining Dataset and Training Configurations

The following code block contains a few dataset and training related configurations.

MODEL = 't5-small'
BATCH_SIZE = 16
MAX_LENGTH = 256
EPOCHS = 50
NUM_WORKERS = 8
OUT_DIR = 'results_t5_small'

MODEL: This is the model name that we will pass while loading the tokenizer and the model weights. For our grammar correction use case, we are using the T5 Small model.
BATCH_SIZE: We are using a batch size of 16 for the data loaders.
MAX_LENGTH: This is the maximum context length to consider for each sample in the JSON files. Beyond this length, the text samples will be truncated and smaller samples will be padded.
EPOCHS: The number of epochs to train the model for.
NUM_WORKERS: The number of parallel workers for the data loaders.
OUT_DIR: This is the output directory to save intermediate results.

Tokenizing the FCE Dataset

Tokenization is assigning an integer value to each word and breaking down a word into simpler ones if necessary. This is a much simpler explanation of what goes on inside. However, explaining the entire process of tokenization is out of the scope of this article.

Let’s see how we can tokenize the dataset that we have just loaded above.

tokenizer = T5Tokenizer.from_pretrained(MODEL)

# Function to convert text data into model inputs and targets
def preprocess_function(examples):
    inputs = [f"rectify: {inc}" for inc in examples['original']]
    model_inputs = tokenizer(
        inputs, 
        max_length=MAX_LENGTH, 
        truncation=True,
        padding='max_length'
    )

    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples['corrected'], 
            max_length=MAX_LENGTH, 
            truncation=True,
            padding='max_length'
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply the function to the whole dataset
tokenized_train = dataset_train.map(
    preprocess_function, 
    batched=True,
    num_proc=8
)
tokenized_valid = dataset_valid.map(
    preprocess_function, 
    batched=True,
    num_proc=8
)
tokenized_test = dataset_test.map(
    preprocess_function, 
    batched=True,
    num_proc=8
)

The first step is loading the tokenizer. We load the T5Tokenizer on the first line of the above code cell while passing the model name that we defined earlier.

In the second step, we have a preprocess_function. This accepts samples from the loaded dataset. Each sample consists of the original grammatically incorrect and the modified grammatically correct text. Note that we are appending the rectify text to each of the original incorrect text. T5 models work best when assigning a starting token based on the task. As we are correcting grammatical errors here, we have passed the above text.

Theoretically, it is possible to pass any string as a starting token. However, using something that aligns with the task is much better.

The inputs to the T5 model are going to be the incorrect sentences and the targets (labels) will be the correct sentences. Finally, we return a dictionary that contains both, the tokenized input and the tokenized targets.

The third step involves mapping all three splits to the preprocessing function. The num_proc argument defines how many parallel processes are being used for tokenization.

If you are new to NLP, then you can start with the following text classification articles which will help you better understand the pipeline of tokenization.

Loading the T5 Model

Now, let’s load the T5 model and check the number of trainable parameters.

# Load the pre-trained BART model
model = T5ForConditionalGeneration.from_pretrained(MODEL)

# Specify the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

We are using the from_pretrained method of the T5ForConditionalGeneration class to load the pretrained T5 Small model. It contains around 60.5 million parameters which is enough for getting started with our journey of GEC (Grammatical Error Correction).

Defining the Training Arguments

We will use the TrainingArguments class to initialize all the training arguments.

# Define the training arguments
training_args = TrainingArguments(
    output_dir=OUT_DIR,          
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE*2,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir=OUT_DIR,
    evaluation_strategy='steps',
    save_steps=500,
    eval_steps=500,
    load_best_model_at_end=True,
    save_total_limit=2,
    report_to='tensorboard',
    dataloader_num_workers=NUM_WORKERS
)

It accepts several arguments (more than 100 to be precise). However, in the above code block, we only pass the ones necessary for our usage.

According to the arguments:

The model will be evaluated and saved every 500 steps. But only two saved models will be preserved and others will be overwritten.
The best model will be loaded at the end so that we can save it one final time before proceeding to the inference section.

Starting the Training for Grammar Correction using T5

Before starting the training, we need to initialize the Trainer API as well. The next code cell does that and starts the training.

# Create the Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
)

# Start training
history = trainer.train()

The Trainer class accepts the model, the above defined training arguments, and training & validation datasets.

We invoke the train method of the instance to start the training.

Here are the training logs after 50 epochs.

Figure 2. T5 training logs for grammar correction.

The model was able to reach the best loss at 5500 steps after which it began to deteriorate. But as we are loading the best model after training, we can again save the best final model and tokenizer to disk to use at inference time. Let’s do that.

tokenizer.save_pretrained('final_model_t5_small')
model.save_pretrained('final_model_t5_small')

Next, we can evaluate the model on the test set as well.

trainer.evaluate(tokenized_test)

Figure 3. Grammar correction evaluation loss logs using the trained T5 model.

The evaluation loss on the test set is 0.47. One important point to keep note of here is that we are evaluating the grammar correction model based on the validation loss which is not entirely correct. In future posts, we will explore more accurate metrics for grammar correction models.

Following is the evaluation loss graph from the above training.

Figure 4. Evaluation loss graph after training T5 on FCE dataset.

As we can see, the loss was mostly reducing till the end of the training. To continue training further, most probably, we will have to apply a learning rate scheduling technique.

Grammar Correction Inference using the Trained T5 Model

Let’s move on to the inference phase now. Following are the steps to carry out for grammar correction inference:

First, we will load the best model weights and tokenizer from the disk.
Second, we will write a helper function for grammar correction inference.
Third, we will pass a list of sentences to the helper function to get the corrected sentences.

from transformers import T5ForConditionalGeneration, T5Tokenizer
    
model_path = 'final_model_t5_small'  # the path where you saved your model
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = T5Tokenizer.from_pretrained(model_path)

We load the final model and tokenizer from the disk in the above code block.

Now, let’s write a simple helper function called do_correction.

def do_correction(text, model, tokenizer):
    input_text = f"rectify: {text}"
    inputs = tokenizer.encode(
        input_text,
        return_tensors='pt',
        max_length=256,
        padding='max_length',
        truncation=True
    )

    # Get correct sentence ids.
    corrected_ids = model.generate(
        inputs,
        max_length=384,
        num_beams=5,
        early_stopping=True
    )

    # Decode.
    corrected_sentence = tokenizer.decode(
        corrected_ids[0],
        skip_special_tokens=True
    )
    return corrected_sentence

It simply processes the input text, generates the text IDs by forward passing through the model, and decodes the IDs to return the final text.

Please note that we are adding the same rectify text here as well before each sentence.

Finally, define a few sentences in a list and pass through the model.

sentences = [
    "He don't like to eat vegetables.",
    "They was going to the store yesterday.",
    "She don't sings very well.",
    "Between you and I, the decision not well received.",
    "The book I borrowed from the library, it was really interesting.",
    "Despite of the rain, they went for a picnic."
]

for sentence in sentences:
    corrected_sentence = do_correction(sentence, model, tokenizer)
    print(f"ORIG: {sentence}\nCORRECT: {corrected_sentence}")

Here are the outputs.

Figure 5. Grammar correction inference results using the trained T5 model.

The results are really good. The model can correct all the grammatical mistakes in the sentences. However, the T5 Small model does not do very well on long sentences with multiple errors. Here is an example.

Figure 6. An example of T5 grammar correction sub-optimal performance.

There are two errors in the corrected sentence. The model failed to rectify “data it” and in the final sentence “Its” should have been either “It is” or “It’s“.

In future articles, we will see how larger models with better training strategies can handle such cases.

Summary and Conclusion

In this article, we went through a code-first approach for grammar correction using Hugging Face Transformers. We trained the T5 Small model on the FCE dataset and ran inference on some unseen sentences. In the end, we also checked whether the model lacks, that is, long text with multiple errors. We will tackle these issues in future articles. I hope this article was worth your time.

If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Getting Started with Grammar Correction using Hugging Face Transformers