A modern replacement for spell(1)

Abhinav Upadhyay <abhinav@NetBSD.org>

AsiaBSDCon 2017

Problems with old spell

Very ancient - dates back to Unix version 7

Problems with old spell

Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check

Problems with old spell

Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check
Rules not 100% accurate, for example mis-spellings like appled, coffeed, undoubt, repremanded, undoubtedlys are not caught

Problems with old spell

Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check
Rules not 100% accurate, for example mis-spellings like appled, coffeed, undoubt, repremanded, undoubtedlys are not caught
Rules only work for English language

Problems with old spell

Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check
Rules not 100% accurate, for example mis-spellings like appled, coffeed, undoubt, repremanded, undoubtedlys are not caught
Rules only work for English language
No support for spell corrections

Problems with old spell

Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check
Rules not 100% accurate, for example mis-spellings like appled, coffeed, undoubt, repremanded, undoubtedlys are not caught
Rules only work for English language
No support for spell corrections
No library interface for other applications to add spell check support - shells, pkgin, pkg_add, apropos could benefit

A modern spell(1)

Uses a expanded dictionary instead of inflection rules (size of new dictionary 5.1M compared to 2.4 M of the old dictionary)

A modern spell(1)

Uses a expanded dictionary instead of inflection rules
Levenshtein distance and soundex techniques used for finding possible corrections

A modern spell(1)

Uses a expanded dictionary instead of inflection rules
Levenshtein distance and soundex techniques used for finding possible corrections
Also support for n-gram models to do context sensitive corrections and grammar checks (experimental WIP)

A modern spell(1)

Uses a expanded dictionary instead of inflection rules
Levenshtein distance and soundex techniques used for finding possible corrections
Also support for n-gram models to do context sensitive corrections and grammar checks (experimental WIP)
A tool to parse any corpus and generate dictionary to do application specific spell check

A modern spell(1)

Uses a expanded dictionary instead of inflection rules
Levenshtein distance and soundex techniques used for finding possible corrections
Also support for n-gram models to do context sensitive corrections and grammar checks (experimental WIP)
A tool to parse any corpus and generate dictionary to do localized or application specific spell check
The core spell checking and correction functionality available as a reusable library

How spell correction works

Levenshtein distance - minimum number of edits required to convert one string into another
Generate all possible words at distance 1 or 2 and see which ones of them are in the dictionary
Lower weight to corrections involving a change in the 1st character or replacement of a character given
Higher weight to corrections having the same soundex code
On no match at distance 1, same process done at distance 2
If still no match, word having the same soundex code with minimum edit distance selected.

Support for other languages

Levenshtein distance is language agnostic
Dictionary for any language can be generated and used for spell checking
But before that some work needed to add support for wide chars

Comparison with GNU aspell

Total number of tests: 3945
Matches at first place: 91.33% (aspell 74%)
Matches at positions 1-5: 95.26% (aspell 96.6%)
Matches at positions 1-10: 95.59% (aspell 98.2%)
Matches at positions 1-25: 95.77% (aspell 99%)
Matches at positions 1-50: 95.84% (aspell 99.2%)
Matches at 1-100: 95.92% (aspell 99.2%)

A modern replacement for spell(1)

Problems with old spell

Problems with old spell

Problems with old spell

Problems with old spell

Problems with old spell

Problems with old spell

Problems with old spell

A modern spell(1)

A modern spell(1)

A modern spell(1)

A modern spell(1)

A modern spell(1)

A modern spell(1)

How spell correction works

Support for other languages

Comparison with GNU aspell

Questions?

Thank you :-)