“Hello World” of Natural Language Processing (NLP) Auto Correction

Introduction:

Natural language processing (NLP) is the field of artificial intelligence that relates linguistics to computer science. After understanding the concepts of NLP, we will start to know how powerful are numbers. Some of NLP applications are spelling correction, Sentiment analysis, Fake news detection, Neural Machine Translation, Question and Answering, Chatbot, etc.

Auto-Correction:

We will start with one of the primary applications of NLP that is autocorrection. This is like a hello world problem in NLP. So the question is as follows: we should develop an algorithm that auto-corrects the wrong words and suggests the correct ones.

For example: if the input is given as “alphabat,” the algorithm should return “alphabet”. Now let us understand the algorithm. Basically, it involves four steps.

  1. To identify the misspelled word.
  2. To find all the words which are n distance away.
  3. Filter them.
  4. Calculate word probabilities.

Let us deeply understand how this algorithm works with an example. Consider the misspelled word ‘ het ‘ and the correct word ‘ hat ‘.Now we know that the misspelled word should be close to the right word. Here a question arises: how do we quantify the closeness of two words.. We quantify this using a minimum edit distance algorithm. The minimum edit distance algorithm takes two words as input and outputs the minimum number of edits required to get one-word from the other. We store all the words that are 1, 2 edit distance away from the misspelled word. Since the correct word should be close to a misspelled word, the correct word should be among the stored words, so we search it and output it.

Minimum Edit Distance Algorithm :

Minimum edit distance algorithm: This algorithm is used to quantify the relation between two words, i.e., it tells us how close the words are considering the spelling ( Note: Not Meaning ). This algorithm gives us the minimum number of edits required to change one word to another. Python implementation of this algorithm is as follows.

def min_edit_distance( str1 , str2 , l1 , l2):
  if l1==0:
    return l2
  if l2==0:
    return l1
  if str1[l1-1]==str2[l2-1]:
    return correction(str1,str2,l1-1,l2-1)
  return 1 + min(min_edit_distance(str1,str2,l1-1,l2) , min_edit_distance(str1,str2,l1-1,l2) , min_edit_distance(str1,str2,l1-1,l2-1) ) 

STEP 1:

To identify the misspelled word, we check whether the word is present in the dictionary or not. We get this dictionary from novels, articles, etc., for learning purposes; we can create a dictionary from stories I used Shakespeare’s text.

python implementation for step 1 :

  1. First, open the file and read all the text.
  2. Store all the unique words in a list. 
  3. Write a function that takes a list as input and gives a dictionary whose keys are words and values are the frequency of words.
  4. Write a function that takes the above-obtained dictionary, outputs t probability of words, i.e., the ratio of the frequency of that word and sum of all frequencies. 
import re
from collections import Counter
import numpy as np
import pandas as pd

def process(fname):
    words = [] 
    with open(fname) as fl:
        data = fl.read()
    data = data.lower()
    words = re.findall('\w+',data)    
    return words

words = process('shakespeare.txt')
vocab = set(words)

def count(words):
    countd = {}  
    countd = Counter(words)          
    return countd

count_dict = count(words)

def probabilities(count_dict):
    p = {} 
    l = sum(count_dict.values())
    for k in count_dict.keys():
        p[k] = count_dict[k] / l    
    return p
probs = probabilities(count_dict)

STEP 2:

This is the step where the beauty of the algorithm lies. In this algorithm, we store all the words which are n edit distance away. Let us know the types of edits possible.

  1. Insert: inserting an alphabet. 
  2. Delete: Deleting an alphabet.
  3. Swap: swapping two adjacent alphabets.
  4. Replace: replacing an alphabet.

For example, consider the word ‘het.’

Inserting : ‘ahet’, ‘bhet’ etc.

Delete : ‘et’, ‘ht ‘, ‘he ‘.

Swap : ‘eht ‘, ‘hte ‘etc.

Replace : ‘hat ‘, ‘hbt ‘, ‘aet ‘. 

All the words obtained above are 1 edit distance away. 

In this algorithm, we consider all the words which are 1, 2 distance away.

all we have is a list of words at most 2 edit distance away from the misspelled word, this completes the second step.

python implementation for step 2:

  1. First, write 4 functions that return all possible words corresponding to the edits insert, delete, replace, swap.
  2. Write a function that returns the words which are one edit distance away given the word.
  3. Using the above function write a function which returns the words which are two edit distance away from the word.
  4. Using the above two functions, write a function that returns all the words which are at most two edits away.
def delete(word):    
    d = []
    s = []
    for i in range(len(word)):
        s.append((word[:i],word[i:]))
    for j,k in s:
        d.append(j+k[1:])    
    return d

def switch(word):   
    sw = []
    sp = []
    l=len(word)    
    for i in range(l):
        sp.append((word[:i],word[i:]))
    sw = [j + k[1] + k[0] + k[2:] for j,k in sp if len(k) >= 2]     
    return sw

def replace(word):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    r = []
    s = []
    for i in range(len(word)):
        s.append((word[0:i],word[i:]))
    r = [j + l + (k[1:] if len(k)> 1 else '') for j,k in s if k for l in letters]
    rs=set(r)    
    r = sorted(list(rs))      
    return r

def insert(word):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    i = []
    s = []
    for j in range(len(word)+1):
        s.append((word[0:j],word[j:]))
    i = [ m + l + n for m,n in s for l in letters] 
    return i

def edit1(word): 
    edit1s = set()
    edit1s.update(delete(word))
    edit1s.update(replace(word))
    edit1s.update(insert(word))
    edit1s.update(switch(word))
    return edit1s

def edit2(word):   
    edit2s = set()
    edit_one = edit1(word)
    for w in edit_one:
        if w:
            edit_two = edit1(w)
            edit2s.update(edit_two)
    return edit2s

STEP 3 : 

In this step, we take the list and compare it with the dictionary and store those in the dictionary.

STEP 4 :

This is the last step in calculating the probabilities of all the words present in the list using the corpus and printing the word with the highest probability.

Code for Step 3 and Step 4 is as follows:

def correct(word, probs, vocab, n=2):

    suggestions = []
    n_best = []
    suggestions = list((word in vocab and word) or edit1(word).intersection(vocab) or edit2(word).intersection(vocab))
    n_best = [[s,probs[s]] for s in list(reversed(suggestions))] 
    print("input word = ", word, "\nsuggestions = ", suggestions)
    return n_best

my_word = 'nme' 
corrections = correct(my_word, probs, vocab, 4)

This is how the autocorrect algorithm works. 


Note: This is a guest post, and opinion in this article is of the guest writer. If you have any issues with any of the articles posted at www.marktechpost.com please contact at asif@marktechpost.co

Shivesh Kodali is a content writing consultant at MarktechPost. He is currently pursuing his B.Tech in Electronics and Communication Engineering from Indian Institute of Technology(IIT), Kharagpur. He is a Deep Learning fanatic who loves understanding and implementing its complex algorithms.