Building a Predictive Typing Model

Apr 15, 2017 15 min read datascience

Everyone using a smartphone or a mobile device has used an onscreen smart keyboard that tries to predict the next set of words that the user might want to type. Typically, upto 3 words are predicted, which are displayed in a row at the top of the keyboard. Given that typing on a glass pane without tactile feedback, could be very frustrating at times, the smart keyboard goes a long way in alleviating these issues.

But, how does predictive typing work? Basically, it’s a predictive typing model built using an ngram with backoff prediction model. Makes complete sense, right? Well, it’s less complicated than it sounds, and actually not hard to understand, as we shall see further.

Corpus

To begin, we need a large corpus of text with a wide-ranging vocabulary. The first name that comes to mind is that of William Shakespeare. Shakespeare wrote a number of plays and poems that are classics in the English language. These plays and poems are available in the public domain and are easily accessible through a number of sources, (see here and here). By some accounts, Shakespeare’s vocabulary ranged from 15,000-30,000 words, which makes it excellent to use it for our corpus.

David Robinson’s gutenbergr package makes it easy to download ‘The Complete Works Of Shakespeare’, from Project Gutenberg’s website. The downloaded dataset is a data frame with text appearing in rows along with the gutenberg id.

dfRaw <- gutenberg_download(100)
knitr::kable(head(dfRaw, 12))

gutenberg_id	text
100	Shakespeare
100
100	This Etext has certain copyright implications you should read!
100
100	<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
100	SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS
100	PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
100	WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE
100	DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
100	PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
100	COMMERCIALLY. PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
100	SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>

knitr::kable(dfRaw[sample(nrow(dfRaw), 6),])

gutenberg_id	text
100	Assure yourselves, will never be unkind.
100
100	Like one besotted on your sweet delights.
100	And spurn in pieces posts of adamant;
100	Whereto, when they shall know what men are rich,
100	TITUS. Marcus, even thou hast struck upon my crest,

Data Preparation

The copyright notice appears multiple times throughout the text. Fortunately, it doesn’t prevent us from using this data for non-commercial purposes. The first step is to merge the entire text and remove the copyright notices that are helpfully within << … >>. Secondly, in plays, the characters and their relationships and titles are defined with ‘DRAMATIS PERSONAE’. These lines have to be removed too.

After doing some quick cleaning, the text is reconverted to a data frame.

complete_works <- str_c(dfRaw$text, collapse = ".")
complete_works <- str_split(complete_works, "<<[^>]*>>")[[1]]
complete_works <- complete_works[!(str_detect(complete_works, "Dramatis Personae|DRAMATIS PERSONAE"))]
complete_works <- str_replace_all(complete_works, "\\[|\\]", ".")

dfContent <- as_tibble(complete_works) %>% slice(3:n())
rm(complete_works)

Finally, we parse the content into sentences. The tidytext package by David Robinson and Julia Silge, is most useful in doing this within the tidyverse framework.

Here’s what the function does below:

Replaces all ‘logical’ sentence break delimiters by a period for easier tokenization by sentence
Removes extra periods and spaces
Tokenizes into sentences
Removes all sentences containing ACT or SCENE
Removes all digits from sentences
Finally adds line numbers to each row (more on the importance of this later…)

A sample of 10 sentences is shown below.

getSentences <- function(dfContent) {
    df <- dfContent %>% 
        mutate(value = str_replace_all(value, ";|:|!|\\?| \\-|\\- | \\- ", "\\. ")) %>% 
        mutate(value = str_replace_all(value, "[\\.]+", "\\.")) %>% 
        mutate(value = str_replace_all(value, "[[:space:]]+", " ")) %>% 
        unnest_tokens(sentence, value, token = "sentences", to_lower = FALSE) %>% 
        dplyr::filter(!str_detect(sentence, "Act|ACT|Scene|SCENE")) %>% 
        mutate(sentence = str_replace_all(sentence, "[0-9]","")) %>% 
        mutate(lineNumber = row_number()) %>% 
        select(lineNumber, everything())
}

dfSentences <- getSentences(dfContent)
knitr::kable(dfSentences[sample(nrow(dfSentences), 10),])

lineNumber	sentence
55991	That with the King here resteth in his tent. .
16097	You are a merry man, sir. fare you well.
9717	He did ask favour.
100749	To the health of our general. .
40908	You are, I think, assur’d I love you not.
117828	Laurence. and another Watchman.
19243	Ere now denied the asker, and now again,.
145024	I swear to do this, though a present death.
70942	Stocking his messenger.
122957	For bringing wood in slowly.

Creating ngrams

“ngrams are a contiguous sequence of n items from a given sequence of text…” - Wikipedia

In this case, ngrams would be a contiguous sequence of n words. In the simplest form, an ngram of size 1 is a single word. An ngram of size 2 or more is a set of words that appear one after the other in a sequence.

For instance, given a sentence - how are you doing?

There are 4 Unigrams: how, are, you, doing
There are 3 Bigrams: how are, are you, you doing
There are 2 Trigrams: how are you, are you doing
There is 1 Quadrigram: how are you doing

So, using the data frame of clean sentences that we parsed, the next step is to create ngrams. Once again this is made simple using the unnest_tokens function from the tidytext package. And this is where grouping by line numbers is important because we want ngrams to be created by sequence of words, only within the same line. We do not want ngrams connected by a sentence break.

getNgrams <- function(dfSentences, n) {
    dfNgrams <- dfSentences %>% 
        group_by(lineNumber) %>% 
        unnest_tokens(word, sentence, token = "ngrams", n = n) %>% 
        ungroup() %>% 
        count(word, sort = TRUE) %>% 
        rename(freq = n)
}

All Words

df1grams <- getNgrams(dfSentences, 1)
dim(df1grams)
[1] 26453     2
wordsOfFreq1 <- df1grams %>% group_by(freq) %>% summarise(n = sum(freq)) %>% dplyr::filter(freq == 1) %>% .$n

There are 26453 words in the corpus, that shows how vast Shakespeare’s vocabulary was. Let’s see a wordcloud of top 100 words. No surprise, the most commonly used words are: the, and, i, to, of. Excluding these commonly used ‘stop words’, in the second wordcloud, we can see the words Shakespeare used most frequently. We can see a lot of references to royalty and nobility, present specially in Shakespearean tragedies. Some names are mentioned quite frequently - Richard, John, Caesar, Brutus and Henry. Some English counties or shires are mentioned quite often - Warwick, York and Gloucester.

Unigrams

Let’s see a plot of the distribution of words in corpus by usage.

Here’s a strategy for building ngram tables:

There is a long tail of 10522 words in the corpus that appear exactly once. Replace all words appearing just once, by UNK as a placeholder for a word not known from the corpus. There are a couple of motivations behind this:
- It reduces the size of ngram tables.
- More importantly, when an unknown word is encountered, the model knows how to tackle it before predicting the next set of words.
Re-summarise and rearrange the table by frequency.
Assign an index to each word.
Convert to data.table format. Some of the biggest benefits of this approach are:
- Rows could be assigned and referenced by a key. This makes it extremely simple to look up a row value.
- Lookup by a key is very fast compared to matching character strings.
- When we create tables for bigrams and higher order ngrams, we can replace the words by their corresponding integer indexes. It saves a lot of memory as compared to creating tibbles with character strings.

vTopUnigrams <- df1grams %>% dplyr::filter(freq >= 2) %>% .$word
df1grams <- df1grams %>% 
    mutate(word = ifelse(word %in% vTopUnigrams, word, "UNK")) %>% 
    group_by(word) %>% 
    summarise(freq = sum(freq)) %>% 
    arrange(desc(freq)) %>% 
    mutate(index1 = row_number()) %>% 
    select(index1, freq, word)

dt1grams <- as.data.table(df1grams)
# set key to word
dt1grams <- dt1grams[,.(index1,freq),key=word]
# create another table with key=index1
dt1gramsByIndex <- dt1grams[,.(word,freq),key=index1]
# print dimensions
dim(dt1grams)
[1] 15932     3

Here are the top 10 unigrams. UNK replaces the long tail of words appearing once. Finally, we are left with 15932 unigrams in our table.

	index1	word	freq
1	1	the	26821
2	2	and	25670
3	3	i	20473
4	4	to	19377
5	5	of	16954
6	6	a	14351
7	7	you	13568
8	8	my	12449
9	9	in	10881
10	10	that	10869

Bigrams

A data table of bigrams is created where the index of the first word is the lookup key. The second word cannot be UNK since it is a word used for making predictions from the bigram table.

df2grams <- getNgrams(dfSentences, 2) %>% 
    separate(word, c("word1", "word2"), sep = " ") %>% 
    mutate(word1 = ifelse(word1 %in% vTopUnigrams, word1, "UNK")) %>% 
    dplyr::filter(word2 %in% vTopUnigrams) %>% 
    group_by(word1, word2) %>% 
    summarise(freq = sum(freq)) %>% 
    arrange(desc(freq))

dt2grams <- as.data.table(df2grams)
dt2grams$index1 <- dt1grams[dt2grams$word1]$index1
dt2grams$index2 <- dt1grams[dt2grams$word2]$index1
dt2grams <- dt2grams[,.(index1,index2,freq)]
setkey(dt2grams, index1)

# print dimensions
dim(dt2grams)
[1] 254884      3

Here are the top 10 bigrams:

	word1	word2	freq
1			30512
2	i	am	1846
3	my	lord	1640
4	in	the	1611
5	i	have	1603
6	i	will	1553
7	of	the	1417
8	to	the	1380
9	it	is	1066
10	to	be	973

Trigrams

Similarly, a data table of trigrams is created where indexes of the first and second words form the lookup key. The third word cannot be UNK since it is a word used for making predictions from the trigram table.

df3grams <- getNgrams(dfSentences, 3) %>% 
    separate(word, c("word1", "word2", "word3"), sep = " ") %>% 
    mutate(word1 = ifelse(word1 %in% vTopUnigrams, word1, "UNK")) %>% 
    mutate(word2 = ifelse(word2 %in% vTopUnigrams, word2, "UNK")) %>% 
    dplyr::filter(word3 %in% vTopUnigrams) %>% 
    group_by(word1, word2, word3) %>% 
    summarise(freq = sum(freq)) %>% 
    arrange(desc(freq))

dt3grams <- as.data.table(df3grams)
dt3grams$index1 <- dt1grams[dt3grams$word1]$index1
dt3grams$index2 <- dt1grams[dt3grams$word2]$index1
dt3grams$index3 <- dt1grams[dt3grams$word3]$index1
dt3grams <- dt3grams[,.(index1,index2,index3,freq)]
setkey(dt3grams, index1,index2)

# print dimensions
dim(dt3grams)
[1] 476098      4

From the top 10 trigrams, we can get an idea how UNK will be useful in this context. the UNK of is one of the most frequently used word sequences in the trigram table, which implies that if we encounter an unknown word after ‘the’, then ‘of’ is the most likely word to be predicted from the trigram table.

	word1	word2	word3	freq
1				38526
2	i	pray	you	238
3	i	will	not	213
4	i	know	not	159
5	i	do	not	157
6	the	UNK	of	155
7	the	duke	of	141
8	i	am	not	139
9	i	am	a	138
10	i	would	not	127

Quadrigrams

Here are the top 10 quadrigrams:

	word1	word2	word3	word4	freq
1					47569
2	with	all	my	heart	46
3	i	know	not	what	38
4	give	me	your	hand	31
5	i	do	beseech	you	31
6	give	me	thy	hand	30
7	the	duke	of	york	29
8	i	do	not	know	28
9	i	would	not	have	27
10	ay	my	good	lord	25

Pentagrams

Here are the top 10 pentagrams:

	word1	word2	word3	word4	word5	freq
1						60534
2	i	am	glad	to	see	15
3	i	thank	you	for	your	12
4	i	had	rather	be	a	9
5	am	glad	to	see	you	8
6	as	i	am	a	gentleman	8
7	for	mine	own	part	i	8
8	i	pray	you	tell	me	8
9	know	not	what	to	say	8
10	and	so	i	take	my	7

These ngram tables comprise all the information needed for the prediction model.

Memory Usage Comparison

As mentioned above, storing words as integer indexes in a data table is far more efficient as compared to a data frame with character strings. For instance, here’s how a pentagram data table looks like that contains just the integer indexes of the words:

	index1	index2	index3	index4	index5	freq
1	1	1	1	9049	5	1
2	1	1	9049	5	1	1
3	1	4	2	9384	12918	1
4	1	4	687	31	4758	1
5	1	5	31	6421	29	1
6	1	9	183	2	1	1
7	1	11	1	46	15	1
8	1	11	1	57	16	1
9	1	11	1	1267	11402	1
10	1	11	1	1403	10842	1

In a comparison of memory needed for all ngram tables, we see a difference of a factor of 5 between the two approaches.

print(object.size(df1grams)+object.size(df2grams)+object.size(df3grams)+
          object.size(df4grams)+object.size(df5grams), units = "auto")
158 Mb

print(object.size(dt1grams)+object.size(dt1gramsByIndex)+object.size(dt2grams)+object.size(dt3grams)+
          object.size(dt4grams)+object.size(dt5grams), units = "auto")
30.4 Mb

Ngram With Backoff Prediction Algorithm

So, given the ngram data tables, here’s how the prediction algorithm works:

Sanitize the input text as done above
Parse the text into sentences
Determine the words in the last sentence of the input text. Use upto the last 4 words from the input text.
The word predictions are done starting from the ngram table containing the longest sequence of words, provided the input text contains atleast n-1 words. So, if the last 4 words from the input text match the first 4 words in the pentagram table, then the fifth word is chosen, that has the highest frequency of occurance after those 4 words. If more than one word prediction is desired, then the fifth words with the next highest frequency of occurance are also chosen.
If no matches or fewer than desired number of matches are found in the pentagram table, then the algorithm backoffs to the quadrigram table using the last 3 words typed, then to the trigram table using the last 2 words typed, then to the bigram table using the last word typed and finally to the unigram table, until the desired number of word predictions are returned.

Let’s see how this works concretely with some test cases:

Test Cases

The first test is to confirm whether the word predictions from the unigram table, match with what we expect. The top 10 unigram table contains 3 words beginning with the letter ‘t’ - the(1), to(4), that(10). And sure enough, this is what we get from our prediction function.

predictNextWords("t")
[1] "Predictions from 1grams: the,to,that,this,thou"
[1] "Top 3 Final Predictions: the,to,that"

From the bigram table, we could see that i am, i have and i will are among the top 10. So, the next test is to confirm these.

predictNextWords("i ")
[1] "Predictions from 2grams: am,have,will,do,would"
[1] "Top 3 Final Predictions: am,have,will"

Narrowing down to words that begin with letter ‘h’ gives us:

predictNextWords("i h")
[1] "Predictions from 2grams: have,had,hope,hear,heard"
[1] "Top 3 Final Predictions: have,had,hope"

Let’s predict some trigrams. i am not and i am a are among the top 10 trigrams, so that’s what we expect to see.

predictNextWords("i am ")
[1] "Predictions from 3grams: not,a,sure,glad,the"
[1] "Top 3 Final Predictions: not,a,sure"

Now let’s examine the UNK feature. We know that the UNK of is number 5 most common sequence in the trigram table so we should expect to see ‘of’ as the first prediction if the input text has an unknown word after ‘the’. So let’s input some gibberish after ‘the’ to make sure this word doesn’t exist in our vocabulary:

predictNextWords("the sjdhsbhbfh ")
[1] "Predictions from 3grams: of,and,in,that,the"
[1] "Top 3 Final Predictions: of,and,in"

Now that the known test cases are working well, let’s see the backoff algorithm in action:

predictNextWords("to be or not ")
[1] "Predictions from 5grams: to"
[1] "Predictions from 4grams: to"
[1] "Predictions from 3grams: to,at,i,allow'd,arriv'd"
[1] "Top 3 Final Predictions: to,at,i"
predictNextWords("to be or not to ")
[1] "Predictions from 5grams: be"
[1] "Predictions from 4grams: be,crack"
[1] "Predictions from 3grams: be,crack,the,have,me,do"
[1] "Top 3 Final Predictions: be,crack,the"
predictNextWords("be or not to ")
[1] "Predictions from 5grams: be"
[1] "Predictions from 4grams: be,crack"
[1] "Predictions from 3grams: be,crack,the,have,me,do"
[1] "Top 3 Final Predictions: be,crack,the"
predictNextWords("to be or not to be ")
[1] "Predictions from 5grams: that"
[1] "Predictions from 4grams: that,found,gone,endur'd,endured,seen"
[1] "Top 3 Final Predictions: that,found,gone"
predictNextWords("to be or not to be, that ")
[1] "Predictions from 5grams: is"
[1] "Predictions from 4grams: is,which"
[1] "Predictions from 3grams: is,which,you,he,i,thou,can"
[1] "Top 3 Final Predictions: is,which,you"
predictNextWords("to be or not to be, that is ")
[1] "Predictions from 5grams: the"
[1] "Predictions from 4grams: the"
[1] "Predictions from 3grams: the,not,my,to,a"
[1] "Top 3 Final Predictions: the,not,my"
predictNextWords("to be or not to be, that is the ")
[1] "Predictions from 5grams: question"
[1] "Predictions from 4grams: question,very,way,humour,best,brief"
[1] "Top 3 Final Predictions: question,very,way"
predictNextWords("be that is the ")
[1] "Predictions from 5grams: question"
[1] "Predictions from 4grams: question,very,way,humour,best,brief"
[1] "Top 3 Final Predictions: question,very,way"

Using one of the most well known speeches from Hamlet, we can see that the algorithm gets the first word prediction from the pentagram table. But since we are looking for the next 3 words, the algorithm backoffs to predictions from quadrigram and trigram tables, until it has atleast 3 words. In this case, the pentagram table gives us the best match that we are looking for. We can also observe that once we have atleast 4 words in our input text, the words prior to those do not matter for the next word predictions.

Let’s see another example:

predictNextWords("What a lovely day ")
[1] "Predictions from 5grams: "
[1] "Predictions from 4grams: "
[1] "Predictions from 3grams: "
[1] "Predictions from 2grams: and,to,is,of,i"
[1] "Top 3 Final Predictions: and,to,is"
predictNextWords("a lovely day ")
[1] "Predictions from 4grams: "
[1] "Predictions from 3grams: "
[1] "Predictions from 2grams: and,to,is,of,i"
[1] "Top 3 Final Predictions: and,to,is"
predictNextWords("lovely day ")
[1] "Predictions from 3grams: "
[1] "Predictions from 2grams: and,to,is,of,i"
[1] "Top 3 Final Predictions: and,to,is"
predictNextWords("day ")
[1] "Predictions from 2grams: and,to,is,of,i"
[1] "Top 3 Final Predictions: and,to,is"

Here we see that even though there are enough words to start matching from the higher ngram tables, the matches are gotten only from the bigram table using ‘day’ as the first word.

Another one:

predictNextWords("Beware the ides of ")
[1] "Predictions from 5grams: march"
[1] "Predictions from 4grams: march"
[1] "Predictions from 3grams: march"
[1] "Predictions from 2grams: march,the,my,his,a,this"
[1] "Top 3 Final Predictions: march,the,my"
predictNextWords("Beware the ides of mar")
[1] "Predictions from 5grams: march"
[1] "Predictions from 4grams: march"
[1] "Predictions from 3grams: march"
[1] "Predictions from 2grams: march,marriage,marcius,marcus,mars"
[1] "Top 3 Final Predictions: march,marriage,marcius"
predictNextWords("Beware the ides of march")
[1] "Predictions from 5grams: march"
[1] "Predictions from 4grams: march"
[1] "Predictions from 3grams: march"
[1] "Predictions from 2grams: march,marching"
[1] "Predictions from 1grams: march,marching,march'd,marches,marchioness"
[1] "Top 3 Final Predictions: march,marching,march'd"

Shiny App

Finally, here’s a Shiny app that demos this predictive typing model.

The R markdown file for this post is available here.

rstats shinyapps tidytext data.table models predictive