Text Preprocessing with Keras: 4 Simple Ways


Article Banner

Deep Learning involving text can be fascinating to deal with. But there are two main problems to look after. First, deep learning models do not take text data directly as input. Secondly, text can be really messy to deal with when designing deep learning models. But, Keras can help with the preprocessing of text data. Therefore, in this article, I am going to share 4 ways in which you can easily preprocess text data using Keras for your next Deep Learning Project.

An overview of what is to follow:

  • Keras text_to_word_sequence.
  • Keras hasing_trick.
  • Encoding with one_hot in Keras.
  • Keras Tokenizer.

So, let’s get started.

Keras text_to_word_sequence

Keras provides the text_to_word_sequence() function to convert text into token of words. While preprocessing text, this may well be the very first step that can be taken before moving further.

text_to_word_sequence() splits the text based on white spaces. It also filters out different punctuation marks and coverts all the characters to lower cases. The default list of punctuation marks that it removes is !”#$%&()*+,-./:;<=>?@[\]^_{|}~\t\n. One function providing so many functionalities is really great.

Now, let’s take a look at an example:

from keras.preprocessing.text import text_to_word_sequence
# define the text 
text = 'Text to Word Sequence Function works really well'
# tokenizing the text 
tokens = text_to_word_sequence(text)
print(tokens)

You should be getting the following output:

['text', 'to', 'word', 'sequence', 'function', 'works', 'really', 'well']

As you can see, the function converts the text into a list of words. Let’s look at the other functions that Keras provides.

Keras hashing_trick

Keras hashing_trick() function converts a text to a sequence of indexes in a fixed size hashing space.

This function is useful because as we know, deep learning models do not take text inputs. So, converting the text into a list with text_to_word_sequence() is only the first step. With hashing_trick() function, we can get back a list of word indexes. Let’ see a simple example.

from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.text import hashing_trick
# define the text 
text = 'An example for keras hashing trick function'
# tokenizing the text 
tokens = text_to_word_sequence(text)
length = len(tokens)
final_text = hashing_trick(text, length)
print(final_text)
[4, 5, 2, 1, 5, 2, 4]

In the above example, you can see that some of the words have been assigned the same index. This may be due to possible collisions by the hashing function. Also, if you want, you can provide hash_function as a parameter. By default it is python hash function.

Encoding with one_hot in Keras

It is very common to encode text data to integer data when working with deep learning models.

The one_hot() function in Keras allows us to do that with ease. The function takes takes two mandatory arguments as inputs. The first one is the text/file and the second one is the size of the vocabulary.

The following example explains the working of the one_hot() function.

from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence

#define the text
text = 'One hot encoding in Keras'
#tokenize the text
tokens = text_to_word_sequence(text)
length = len(tokens)
one_hot = one_hot(text, length)
print(one_hot)
 [3, 4, 1, 3, 3]

This function hashes the text using the python hash function. Also, by default it filters the text by !”#$%&()*+,-./:;<=>?@[\]^_{|}~\t\n. The default filter includes basic punctuation, tabs and newlines.

Keras Tokenizer

Keras also provides the Tokenizer class using which we can encode multiple text documents. This becomes very useful when handling large documents.

After the text documents have been fit, then we have access to four of its attributes which help us to analyze how the text has been preprocessed. The following are the four attributes:

word_counts: gives the dictionary of the words and the number of times they have appeared in the text.

word_docs: dictionary of words and the number of documents they appeared in.

word_index: dictionary of words and their indexes.

document_count: the total number of documents.

Let’s look at an example to have a better idea of the working of the Tokenizer class.

from keras.preprocessing.text import Tokenizer

# define the text
text = ['You are learning a lot', 'That is a good thing',
       'This will help you a lot']
# creating tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the document
tokenizer.fit_on_texts(text)
# print the attributes for the text and encode the doucment
print(tokenizer.word_counts)
print(tokenizer.word_docs)
print(tokenizer.word_index)
print(tokenizer.document_count)
encoded_text = tokenizer.texts_to_matrix(text)
print(encoded_text)
OrderedDict([('you', 2), ('are', 1), ('learning', 1), ('a', 3), ('lot', 2), ('that', 1), ('is', 1), ('good', 1), ('thing', 1), ('this', 1), ('will', 1), ('help', 1)])

defaultdict(<class 'int'>, {'a': 3, 'are': 1, 'lot': 2, 'you': 2, 'learning': 1, 'good': 1, 'is': 1, 'that': 1, 'thing': 1, 'help': 1, 'will': 1, 'this': 1}) {'a': 1, 'you': 2, 'lot': 3, 'are': 4, 'learning': 5, 'that': 6, 'is': 7, 'good': 8, 'thing': 9, 'this': 10, 'will': 11, 'help': 12}

3

[[0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]  [0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0.]  [0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]]

This article should get you started with your own journey to preprocess text for your own projects. If you want to get more details you can always look at the Keras Documentation.

Conclusion

In this article, you learned how to preprocess text data for a deep learning model. Now you can get started with your own project as well. Don’t forget to like, share and subscribe to the newsletter. Also, you can follow me on Twitter and Facebook to get regular updates about new articles.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *