Deep Learning involving text can be fascinating to deal with. But there are two main problems to look after. First, deep learning models do not take text data directly as input. Secondly, text can be really messy to deal with when designing deep learning models. But, Keras can help with the preprocessing of text data. Therefore, in this article, I am going to share 4 ways in which you can easily preprocess text data using Keras for your next Deep Learning Project.
An overview of what is to follow:
- Keras text_to_word_sequence.
- Keras hasing_trick.
- Encoding with one_hot in Keras.
- Keras Tokenizer.
So, let’s get started.
Keras text_to_word_sequence
Keras provides the text_to_word_sequence()
function to convert text into token of words. While preprocessing text, this may well be the very first step that can be taken before moving further.
text_to_word_sequence()
splits the text based on white spaces. It also filters out different punctuation marks and coverts all the characters to lower cases. The default list of punctuation marks that it removes is !”#$%&()*+,-./:;<=>?@[\]^_{|}~\t\n. One function providing so many functionalities is really great.
Now, let’s take a look at an example:
from keras.preprocessing.text import text_to_word_sequence # define the text text = 'Text to Word Sequence Function works really well' # tokenizing the text tokens = text_to_word_sequence(text) print(tokens)
You should be getting the following output:
['text', 'to', 'word', 'sequence', 'function', 'works', 'really', 'well']
As you can see, the function converts the text into a list of words. Let’s look at the other functions that Keras provides.
Keras hashing_trick
Keras hashing_trick()
function converts a text to a sequence of indexes in a fixed size hashing space.
This function is useful because as we know, deep learning models do not take text inputs. So, converting the text into a list with text_to_word_sequence()
is only the first step. With hashing_trick()
function, we can get back a list of word indexes. Let’ see a simple example.
from keras.preprocessing.text import text_to_word_sequence from keras.preprocessing.text import hashing_trick # define the text text = 'An example for keras hashing trick function' # tokenizing the text tokens = text_to_word_sequence(text) length = len(tokens) final_text = hashing_trick(text, length) print(final_text)
[4, 5, 2, 1, 5, 2, 4]
In the above example, you can see that some of the words have been assigned the same index. This may be due to possible collisions by the hashing function. Also, if you want, you can provide hash_function
as a parameter. By default it is python hash
function.
Encoding with one_hot in Keras
It is very common to encode text data to integer data when working with deep learning models.
The one_hot()
function in Keras allows us to do that with ease. The function takes takes two mandatory arguments as inputs. The first one is the text/file and the second one is the size of the vocabulary.
The following example explains the working of the one_hot()
function.
from keras.preprocessing.text import one_hot from keras.preprocessing.text import text_to_word_sequence #define the text text = 'One hot encoding in Keras' #tokenize the text tokens = text_to_word_sequence(text) length = len(tokens) one_hot = one_hot(text, length) print(one_hot)
[3, 4, 1, 3, 3]
This function hashes the text using the python hash
function. Also, by default it filters the text by !”#$%&()*+,-./:;<=>?@[\]^_{|}~\t\n. The default filter includes basic punctuation, tabs and newlines.
Keras Tokenizer
Keras also provides the Tokenizer
class using which we can encode multiple text documents. This becomes very useful when handling large documents.
After the text documents have been fit, then we have access to four of its attributes which help us to analyze how the text has been preprocessed. The following are the four attributes:
word_counts: gives the dictionary of the words and the number of times they have appeared in the text.
word_docs: dictionary of words and the number of documents they appeared in.
word_index: dictionary of words and their indexes.
document_count: the total number of documents.
Let’s look at an example to have a better idea of the working of the Tokenizer
class.
from keras.preprocessing.text import Tokenizer # define the text text = ['You are learning a lot', 'That is a good thing', 'This will help you a lot'] # creating tokenizer tokenizer = Tokenizer() # fit the tokenizer on the document tokenizer.fit_on_texts(text) # print the attributes for the text and encode the doucment print(tokenizer.word_counts) print(tokenizer.word_docs) print(tokenizer.word_index) print(tokenizer.document_count) encoded_text = tokenizer.texts_to_matrix(text) print(encoded_text)
OrderedDict([('you', 2), ('are', 1), ('learning', 1), ('a', 3), ('lot', 2), ('that', 1), ('is', 1), ('good', 1), ('thing', 1), ('this', 1), ('will', 1), ('help', 1)]) defaultdict(<class 'int'>, {'a': 3, 'are': 1, 'lot': 2, 'you': 2, 'learning': 1, 'good': 1, 'is': 1, 'that': 1, 'thing': 1, 'help': 1, 'will': 1, 'this': 1}) {'a': 1, 'you': 2, 'lot': 3, 'are': 4, 'learning': 5, 'that': 6, 'is': 7, 'good': 8, 'thing': 9, 'this': 10, 'will': 11, 'help': 12} 3 [[0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0.] [0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]]
This article should get you started with your own journey to preprocess text for your own projects. If you want to get more details you can always look at the Keras Documentation.
Conclusion
In this article, you learned how to preprocess text data for a deep learning model. Now you can get started with your own project as well. Don’t forget to like, share and subscribe to the newsletter. Also, you can follow me on Twitter and Facebook to get regular updates about new articles.