Word Suggestion Algorithm

We use automatic correction (Autocorrect) every day on our mobile phones and computers. So how does the autocorrect mechanism work? In this article, I will try to explain an example structure using the Python programming language regarding the autocorrect algorithm used in NLP. Even if the algorithm that I will design is not the same as the algorithms used in phones, it is useful to say that it is a very efficient structure :)

Data Preprocessing

In most machine learning processes, cleaning the data with preprocessing and making it meaningful is one of the most important steps. Let’s use a very famous quote from Shakespeare previously defined in our NLP algorithm.

“O for a Muse of fire, that would ascend The brightest heaven of invention,
A kingdom for a stage, princes to act …… “

https://tr.wikipedia.org/wiki/William_Shakespeare

In this step, we have actually received a reference source. In this reference resource, we first need to find out which words are used and put them into a list structure (not a difficult step, actually :)) Then we need to find the words that are meaningful to us as output count by count.
We can summarize this step with an example as follows.
I am very eager to learn and explain → After entering the process function, you may think as if you leave “I am willing to learn, tell,”. The important thing in this step is not to take the repetitive words, if we do not omit the repetitive words, it will have a negative effect on the existing probability state.
Then we need to find the word frequency values ​​from the txt file that we defined with the count function.
“I am going, I am coming, I love” It should be noted that it counts as I → 1 because we receive unique words.
Then the probability of a word to be found in the text should be calculated. For example, the word x is in the text file with a probability of 0.0045.
We need to extract the character probability as in the table below.

String Manipulations

If the input entered in this step is taught to the NLP algorithm, it should be output as an existed word. Otherwise, we will examine how to find the correct character with string manipulation.

  • delete_letter: one character removed.
  • switch_letter: two adjacent letters switched.
  • replace_letter: one character replaced by another different letter.
  • insert_letter: an additional character inserted.

Below you can see the shortcode about the delete operation as an example. Other manipulation operations will be coded in a similar logical manner.

def delete_letter(word):

delete_l = []
split_l = []

split_l=[(word[:i],word[i:]) for i in range(len(word)+1) ]
delete_l=[(L + R[1:]) for L, R in split_l if R ]

return delete_l

Let’s take the case of accidentally spelling hoappy instead of happy

delete = [‘oappy’, ‘happy’, ‘hoppy’, ‘hoapy’, ‘hoapy’, ‘hoapp’]

After switch operation :

switch = ['ohappy', 'haoppy', 'hopapy', 'hoappy', 'hoapyp']

And the output of the “replace” operation is as follows:

replace= [‘aoappy’, ‘boappy’, ‘coappy’, ‘doappy’, ‘eoappy’, ‘foappy’, ‘goappy’, ‘haappy’, ‘hbappy’, ‘hcappy’, ‘hdappy’, ‘heappy’, ‘hfappy’, ‘hgappy’, ‘hhappy’, ‘hiappy’, ‘hjappy’, ‘hkappy’, ‘hlappy’, ‘hmappy’, ‘hnappy’, ‘hoaapy’, ‘hoabpy’, ‘hoacpy’, ‘hoadpy’, ‘hoaepy’, ‘hoafpy’, ‘hoagpy’, ‘hoahpy’, ‘hoaipy’, ‘hoajpy’, ‘hoakpy’, ‘hoalpy’, ‘hoampy’, ‘hoanpy’, ‘hoaopy’, ‘hoapay’, ‘hoapby’, ‘hoapcy’, ‘hoapdy’, ‘hoapey’, ‘hoapfy’, ‘hoapgy’, ‘hoaphy’, ‘hoapiy’, ‘hoapjy’, ‘hoapky’, ‘hoaply’, ‘hoapmy’, ‘hoapny’, ‘hoapoy’, ‘hoappa’, ‘hoappb’, ‘hoappc’, ‘hoappd’, ‘hoappe’, ‘hoappf’, ‘hoappg’, ‘hoapph’, ‘hoappi’, ‘hoappj’, ‘hoappk’, ‘hoappl’, ‘hoappm’, ‘hoappn’, ‘hoappo’, ‘hoappp’, ‘hoappq’, ‘hoappr’, ‘hoapps’, ‘hoappt’, ‘hoappu’, ‘hoappv’, ‘hoappw’, ‘hoappx’, ‘hoappy’, ‘hoappya’, ‘hoappyb’, ‘hoappyc’, ‘hoappyd’, ‘hoappye’, ‘hoappyf’, ‘hoappyg’, ‘hoappyh’, ‘hoappyi’, ‘hoappyj’, ‘hoappyk’, ‘hoappyl’, ‘hoappym’, ‘hoappyn’, ‘hoappyo’, ‘hoappyp’, ‘hoappyq’, ‘hoappyr’, ‘hoappys’, ‘hoappyt’, ‘hoappyu’, ‘hoappyv’, ‘hoappyw’, ‘hoappyx’, ‘hoappyy’, ‘hoappyz’, ‘hoappz’, ‘hoapqy’, ‘hoapry’, ‘hoapsy’, ‘hoapty’, ‘hoapuy’, ‘hoapvy’, ‘hoapwy’, ‘hoapxy’, ‘hoapyy’, ‘hoapzy’, ‘hoaqpy’, ‘hoarpy’, ‘hoaspy’, ‘hoatpy’, ‘hoaupy’, ‘hoavpy’, ‘hoawpy’, ‘hoaxpy’, ‘hoaypy’, ‘hoazpy’, ‘hobppy’, ‘hocppy’, ‘hodppy’, ‘hoeppy’, ‘hofppy’, ‘hogppy’, ‘hohppy’, ‘hoippy’, ‘hojppy’, ‘hokppy’, ‘holppy’, ‘homppy’, ‘honppy’, ‘hooppy’, ‘hopppy’, ‘hoqppy’, ‘horppy’, ‘hosppy’, ‘hotppy’, ‘houppy’, ‘hovppy’, ‘howppy’, ‘hoxppy’, ‘hoyppy’, ‘hozppy’, ‘hpappy’, ‘hqappy’, ‘hrappy’, ‘hsappy’, ‘htappy’, ‘huappy’, ‘hvappy’, ‘hwappy’, ‘hxappy’, ‘hyappy’, ‘hzappy’, ‘ioappy’, ‘joappy’, ‘koappy’, ‘loappy’, ‘moappy’, ‘noappy’, ‘ooappy’, ‘poappy’, ‘qoappy’, ‘roappy’, ‘soappy’, ‘toappy’, ‘uoappy’, ‘voappy’, ‘woappy’, ‘xoappy’, ‘yoappy’, ‘zoappy’]

After inserting step:

insert = [‘ahoappy’, ‘bhoappy’, ‘choappy’, ‘dhoappy’, ‘ehoappy’, ‘fhoappy’, ‘ghoappy’, ‘hhoappy’, ‘ihoappy’, ‘jhoappy’, ‘khoappy’, ‘lhoappy’, ‘mhoappy’, ‘nhoappy’, ‘ohoappy’, ‘phoappy’, ‘qhoappy’, ‘rhoappy’, ‘shoappy’, ‘thoappy’, ‘uhoappy’, ‘vhoappy’, ‘whoappy’, ‘xhoappy’, ‘yhoappy’, ‘zhoappy’, ‘haoappy’, ‘hboappy’, ‘hcoappy’, ‘hdoappy’, ‘heoappy’, ‘hfoappy’, ‘hgoappy’, ‘hhoappy’, ‘hioappy’, ‘hjoappy’, ‘hkoappy’, ‘hloappy’, ‘hmoappy’, ‘hnoappy’, ‘hooappy’, ‘hpoappy’, ‘hqoappy’, ‘hroappy’, ‘hsoappy’, ‘htoappy’, ‘huoappy’, ‘hvoappy’, ‘hwoappy’, ‘hxoappy’, ‘hyoappy’, ‘hzoappy’, ‘hoaappy’, ‘hobappy’, ‘hocappy’, ‘hodappy’, ‘hoeappy’, ‘hofappy’, ‘hogappy’, ‘hohappy’, ‘hoiappy’, ‘hojappy’, ‘hokappy’, ‘holappy’, ‘homappy’, ‘honappy’, ‘hooappy’, ‘hopappy’, ‘hoqappy’, ‘horappy’, ‘hosappy’, ‘hotappy’, ‘houappy’, ‘hovappy’, ‘howappy’, ‘hoxappy’, ‘hoyappy’, ‘hozappy’, ‘hoaappy’, ‘hoabppy’, ‘hoacppy’, ‘hoadppy’, ‘hoaeppy’, ‘hoafppy’, ‘hoagppy’, ‘hoahppy’, ‘hoaippy’, ‘hoajppy’, ‘hoakppy’, ‘hoalppy’, ‘hoamppy’, ‘hoanppy’, ‘hoaoppy’, ‘hoapppy’, ‘hoaqppy’, ‘hoarppy’, ‘hoasppy’, ‘hoatppy’, ‘hoauppy’, ‘hoavppy’, ‘hoawppy’, ‘hoaxppy’, ‘hoayppy’, ‘hoazppy’, ‘hoapapy’, ‘hoapbpy’, ‘hoapcpy’, ‘hoapdpy’, ‘hoapepy’, ‘hoapfpy’, ‘hoapgpy’, ‘hoaphpy’, ‘hoapipy’, ‘hoapjpy’, ‘hoapkpy’, ‘hoaplpy’, ‘hoapmpy’, ‘hoapnpy’, ‘hoapopy’, ‘hoapppy’, ‘hoapqpy’, ‘hoaprpy’, ‘hoapspy’, ‘hoaptpy’, ‘hoapupy’, ‘hoapvpy’, ‘hoapwpy’, ‘hoapxpy’, ‘hoapypy’, ‘hoapzpy’, ‘hoappay’, ‘hoappby’, ‘hoappcy’, ‘hoappdy’, ‘hoappey’, ‘hoappfy’, ‘hoappgy’, ‘hoapphy’, ‘hoappiy’, ‘hoappjy’, ‘hoappky’, ‘hoapply’, ‘hoappmy’, ‘hoappny’, ‘hoappoy’, ‘hoapppy’, ‘hoappqy’, ‘hoappry’, ‘hoappsy’, ‘hoappty’, ‘hoappuy’, ‘hoappvy’, ‘hoappwy’, ‘hoappxy’, ‘hoappyy’, ‘hoappzy’, ‘hoappya’, ‘hoappyb’, ‘hoappyc’, ‘hoappyd’, ‘hoappye’, ‘hoappyf’, ‘hoappyg’, ‘hoappyh’, ‘hoappyi’, ‘hoappyj’, ‘hoappyk’, ‘hoappyl’, ‘hoappym’, ‘hoappyn’, ‘hoappyo’, ‘hoappyp’, ‘hoappyq’, ‘hoappyr’, ‘hoappys’, ‘hoappyt’, ‘hoappyu’, ‘hoappyv’, ‘hoappyw’, ‘hoappyx’, ‘hoappyy’, ‘hoappyz’]

After this step, we will define a single character & double character correction algorithm to combine all steps.
In single character correction, we will develop all the steps above in double character correction, and after single character correction, we will develop one more character correction.
We will give the output after double character correction as input to the suggestion algorithm.

Suggestion algorithm:

-If the word is in vocabulary, it suggests the word.
- Otherwise, if there are suggestions from edit_one_letter in the vocabulary, it will use them.
- If it is still not correct, if there are suggestions from edit_two_letters in the vocabulary, it will use them.
- Otherwise, suggest whatever you have entered into hand prisoner :)
And will give us the following suggestion:) Nice suggestion accepted

But in real NLP algorithms, such a clean process does not work; Because the algorithm must also understand the written sentence and decide whether the word written after it is wrong or not. Many algorithms related to Autocorrection are the most common minimum edit distance and Levenshtein distance algorithms available. These algorithms are basically an algorithm based on expressing the degree of closeness between two possible words vectorially and proposing the most logical word in the least costly (ie the least letter replacement) way. In Python, I preferred to share the existing function instead of sharing the code I wrote in dozens of lines :) from Levenshtein, import distance is one of the best libraries you can use.
I prefer to give information in my articles without getting too boring. You can ask detailed questions about the code :)

Boun MSC.- Metu

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store