Auto-Complete

Emre Çetiner
5 min readApr 3, 2021

When we write something on Google, we see that what we write is completed by google. So how does this algorithm work? How does Google understand and complete it? In this article, I will develop a demo application using python about auto-complete algorithm development.

Pre-process the data

First, we need to split a sentence (split easily) into a word phrase simply. As an example, we need to separate it as “I have a dog” → “I have a dog”. In the second step, we need to create an auxiliary word for your token as a word list. We can use the phrase Jacob's fingertips as an example. → [‘Jacob’ stood ‘at at his fingertips’.]

Tokenized sentence:

tokenized_sentences = []
for sentence in sentences:
sentence = sentence.lower()
tokenized = nltk.word_tokenize(sentence)
tokenized_sentences.append(tokenized)

Then, using this, we are expected to divide the data we use into different sets as testing and training. We print out the test and train data as follows.

616296 data are split into 493036 train and 123260 test set
First training sample:
['disappointing', '!', '!', '!', '!']
First test sample
['star']

Then we need to create a new list by using the count function in the data set to find out which words are used and how many times. Thus, we will find the most used words.

word_counts = {}
for sentence in tokenized_sentences:
for token in sentence:
if token not in word_counts:
word_counts[token] = 1
else:
word_counts[token] += 1

Out of Vocabulary (OOV) durumu modellenmesi

One of the problems seen in auto prediction models is that if words that do not exist in the word set that we trained the model are entered by the user, our model may be unable to predict the next word. This new word that does not exist is called “OOV”. We need to train the training data so that there are some ‘unknown’ words to work on.

We will create a list of the most frequently used words in the training set, called closed vocabulary, and convert all other words that are not part of the closed vocabulary into the ‘unk’ markers. (unk → unknown)

First, we need to find the words that are OOV. An algorithm that returns the words that do not exist in the list as OOV and an algorithm have been developed as follows.

closed_vocab = []
word_counts = count_words(tokenized_sentences)
for word in word_counts: # complete this line
cnt=word_counts[word]
if cnt>=count_threshold:
closed_vocab.append(word)

vocabulary = set(vocabulary)
replaced_tokenized_sentences = []
for sentence in tokenized_sentences:
replaced_sentence = []
for token in sentence:
if token in vocabulary:
replaced_sentence.append(token)
else:
replaced_sentence.append(‘<unk>’)
replaced_tokenized_sentences.append(replaced_sentence)

N-gram based language models

Count-Based models are the processes that make the n-level Markov assumption and then estimate the n-gram probabilities by applying the smoothing technique. In the N-gram model, the process of predicting a word string is dedicated to guessing one word at a time. The language model probability p (w1, w2,…, wn) is a product of word probabilities based on the history of previous words. It is also called a Markov chain where the number of previous states (words mentioned here) is the order of the pattern.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. Elements can be phonemes, syllables, letters, words, or base pairs, depending on the application. N-grams are typically collected from a text or speech corpus. When Latin numerical prefixes are used, an n-gram number 1 is referred to as “unigram”; dimension 2 is a “bigram” (or less commonly a “digram”); dimension 3 is a “trigram”.

You have a text, the process of dividing this text into 1,2,3… groups. As you can see from the picture above
*Unigram: groups of 1 word
*Bigrams: groups of 2 words
*Trigrams: Groups of 3 words.

n_grams = {}
for sentence in data:
sentence = [“<s>”]*n + sentence + [“<e>”] ( unigram olacaksa 1 tane s , ama sürekli 1 tane end yani e miz olur)
sentence = tuple(sentence)
for i in range(len(sentence) — n + 1):
n_gram = sentence[i:i+n]
if n_gram in n_grams : # complete this line
n_grams[n_gram] += 1
else:
n_grams[n_gram] = 1

input : [[‘i’, ‘like’, ‘a’, ‘cat’]

→Uni-gram:
{(‘<s>’,): 2, (‘i’,): 1, (‘like’,): 2, (‘a’,): 2, (‘cat’,): 2, (‘<e>’,): 2, (‘this’,): 1, (‘dog’,): 1, (‘is’,): 1}

→bi-gram

{(‘<s>’, ‘<s>’): 2, (‘<s>’, ‘i’): 1, (‘i’, ‘like’): 1, (‘like’, ‘a’): 2, (‘a’, ‘cat’): 2, (‘cat’, ‘<e>’): 2, (‘<s>’, ‘this’): 1, (‘this’, ‘dog’): 1, (‘dog’, ‘is’): 1, (‘is’, ‘like’): 1}

While calculating the probability of the word, we also need to use the k smoothing ‘algorithm to get rid of the” unknown “zero states. (| 𝑉 || V | is the number of words in the vocabulary)

As we can see from this formula, if we are zero-count, our probability in the unknown state will always be the same 1 / V.

previous_n_gram = tuple(previous_n_gram)
previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)
denominator = previous_n_gram_count + k*vocabulary_size
n_plus1_gram = previous_n_gram+(word,)
n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)
numerator = n_plus1_gram_count+k
probability = numerator/denominator

Perplexity

In natural language processing, perplexity is a way of evaluating language models. A language model is a probability distribution over entire sentences or texts. … It is often possible to achieve lower perplexity on more specialized corpora, as they are more predictable.

Auto-complete system

As the last step, we will examine the development of the autocomplete system algorithm. Logically, it will be sufficient for this step to present the word with the highest possible probability value in the algorithm we have previously constructed.

n = len(list(n_gram_counts.keys())[0]) 
previous_n_gram = previous_tokens[-n:]

probabilities = estimate_probabilities(previous_n_gram,
n_gram_counts, n_plus1_gram_counts,
vocabulary, k=k)
suggestion = None
max_prob = 0


if start_with is None:
start_with = ''
for k in probabilities:
if k.startswith(start_with): if probabilities[k] > max_prob:
max_prob = probabilities[k]
suggestion = k

Our model output will tell us that the probability of a ‘will be greater than a cat sentence like i.

I hope you enjoy the read

--

--

Emre Çetiner

Boun MSC.- Metu / Digital Services Product Growth & Data Analyst @Turkcell for more; https://www.linkedin.com/in/mehmetemrecetiner