A simple but fully functional spell checker implemented with 21 lines of python code

Introduce

When you use Google or Baidu to search, Google can always provide a very good spell check when you enter the search content. For example, if you enter speling, Google will return spelling immediately.

Below is a simple but fully functional spell checker implemented with 21 lines of python code.

Code

import re, collections

def words(text): return re.findall('[az]+', text.lower())

def train(features):

model = collections.defaultdict(lambda: 1)

for f in features:

model[f] += 1

return model

NWORDS = train(words(file('big.txt').read()))

alphabet ='abcdefghijklmnopqrstuvwxyz'

def edits1(word):

splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]

deletes = [a + b[1:] for a, b in splits if b]

transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]

replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]

inserts = [a + c + b for a, b in splits for c in alphabet]

return set(deletes + transposes + replaces + inserts)

def known_edits2(word):

return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):

candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]

return max(candidates, key=NWORDS.get)

The correct function is the entry point of the program, and the incorrectly spelled words passed in will return the correct ones. Such as:

>>> correct("cpoy")

'copy'

>>> correct("engilsh")

'english'

>>> correct("sruprise")

'surprise'

In addition to this code, as part of machine learning, there must be a lot of sample data. Big.txt is prepared as our sample data.

Principle behind

The above code is implemented based on Bayesian. In fact, the spell check implemented by Google and Baidu is also implemented by Bayesian, but it is definitely more complicated than this.

First, briefly introduce the principle behind it. If the reader has known it before, you can skip this paragraph.

Given a word, we try to choose the most likely correct spelling suggestion (the suggestion may also be the input word). Sometimes it is not clear (for example, should lates be corrected to late or latest?), we use probability to decide which one to suggest. We find the most likely spelling suggestion c from all possible correct spellings related to the original word w:

argmaxc P(c|w)

By Bayes' theorem, the above formula can be transformed into

argmaxc P(w|c) P(c) / P(w)

The following describes the meaning of the above formula:

P(c|w) represents the probability that you originally wanted to enter the word c when you entered the word w.

P(w|c) represents the probability that the user wants to enter the word c but enters w, which can be considered given by us.

P(c) represents the probability that the word c appears in the sample data

P(w) represents the probability that the word w appears in the sample number. It can be determined that the probability of P(w) is the same for all possible words c, so the above formula can be converted to

argmaxc P(w|c) P(c)

All our codes are based on this formula, the following analysis of specific code implementation

Code analysis

Use the words() function to extract the words in big.txt

def words(text): return re.findall('[az]+', text.lower())

re.findall('[az]+' uses the python regular expression module to extract all words that meet the condition of'[az]+', that is, words composed of letters. (Regular expressions are not described in detail here, there are Interested students can read the introduction to regular expressions. text.lower() converts text into lowercase letters, that is, "the" and "The" are defined as the same word.

Use the train() function to calculate the number of occurrences of each word and then train a suitable model

def train(features):

model = collections.defaultdict(lambda: 1)

for f in features:

model[f] += 1

return model

NWORDS = train(words(file('big.txt').read()))

Thus NWORDS[w] represents the number of times the word w appears in the sample. What if a word does not appear in our sample? The processing method is to set their times to 1 by default, which is implemented here through the collections module and lambda expressions.

collections.defaultdict() creates a default dictionary, lambda:1 sets each value in this dictionary to 1 by default. (For lambda expressions, see Introduction to Lambda)

Now we have processed the P(c) in the formula argmaxc P(w|c) P(c), and then processed P(w|c), that is, the probability of entering the word c but incorrectly entering the word w, through the "edit distance "--The number of edits required to change a word into another is measured. An edit may be a deletion, an exchange (two adjacent letters), an insertion, and a modification. The following function returns a set of all possible words w from editing c once:

def edits1(word):

splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]

deletes = [a + b[1:] for a, b in splits if b]

transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]

replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]

inserts = [a + c + b for a, b in splits for c in alphabet]

return set(deletes + transposes + replaces + inserts)

Related papers show that 80-95% of spelling errors are only one edit distance from the word you want to spell. If you feel that one edit is not enough, then we will do it again.

def known_edits2(word):

return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

At the same time, there may be those whose edit distance is 0 times, which means they are spelled correctly:

def known(words):

return set(w for w in words if w in NWORDS)

We assume that the probability of editing distance 1 time is much greater than that of 2 times, and that the probability of editing distance is much greater than that of 1 time. The following uses the correct function to first select the word with the smallest edit distance, and the corresponding P(w|c) will be larger as a candidate word, and then the word with the largest P(c) will be selected as the spelling suggestion

def correct(word):

candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]

return max(candidates, key=NWORDS.get)

Ningbo Autrends International Trade Co.,Ltd. , https://www.mosvapor.com