Recently, I decided to write a simple spellchecker app in Haskell. I decided to do this because I wanted to increase my Haskell skills in completing a small scale project, and also because I haven’t done a small project in a while and I wanted to prove to myself that I can still do that.
As part of my Haskell skill set, I want to include Hedgehog as part of the project, to learn about property testing. It was while I was writing properties about my spellchecker that I discovered the power of property testing.
Instead of going through the code development linearly, I’m going to pair up each function with it’s properties, up until I get to the point where I have stopped to write this blog post. I will make a followup post later when I finish the rest of the spellchecker.
Levenshtein Distance Calculation
Implementation
To begin with, I researched implementations of edit distance calculations in Haskell, because I knew about Levenshtein distance already. I found this code on Wikipedia:
lDistance :: (Eq a) => [a] -> [a] -> Int
lDistance [] t = length t
lDistance s [] = length s
lDistance (a:s') (b:t') =
if
a == b
then
lDistance s' t'
else
1 + minimum
[ lDistance (a:s') t'
, lDistance s' (b:t')
, lDistance s' t'
]
Since this is not an intro to Haskell post, I’m going to not explain this line by line. The gist of this code is that this function called lDistance takes two lists (containing any type) and recursively checks the edit distances of smaller and smaller parts of the list, summing up the number of edits allowed in Levenshtein’s rules (adding an element, removing an element, or substituting one element for another), and then returns the smallest number it finds. That number is the Levenshtein distance between those two lists.
This particular implementation is recursive, and not efficient. It takes my low-end laptop a relatively long time to calculate large lists’ edit distances.
Property Test
Copy-pasting the code above was the easy part. Writing the property tests for it took a lot of thinking on my part, since I’m new to this and I didn’t look for existing property tests to copy or emulate. I was able to get a head start on writing some of the Hedgehog code, since there was an example code that involves generating a string to test against. Since a String is a List of Chars in Haskell, I was able to adapt that to my needs. So, the first part of my test suite defines what a “word” is:
import Hedgehog
import qualified Hedgehog.Gen as Gen
import qualified Hedgehog.Range as Range
import Lib
genWord :: Gen String
genWord =
Gen.list (Range.linear 0 10) Gen.alpha
What this code does (besides our imports) is to define a Hedgehog generator that creates a string (Gen String). The string is a list of between 0 and 10 alphabetical characters. Each time a property test is run, it creates a new string of random characters to use in your tests. The letters come from the set of Latin characters in both upper and lower case (without any diacritics). But, you don’t have just one test run each time you run your test suite; tests are run hundreds if not thousands of times.
Because of this, I don’t have to write as many unit tests. All 1052 combinations of inputs for that one word are possible candidates for testing that I don’t have to write by hand.
Now, there is a possibility that there is an edge case that deserves its own unit test, but that can be added when it is found empirically. Also, I chose a small number for the upper bound length of a word because my currently implementation is really inefficient.
Enough about the generator, what about the actual property test?
prop_lDistance_of_a_word_with_itself_is_zero :: Property
prop_lDistance_of_a_word_with_itself_is_zero =
property $ do
w <- forAll genWord
lDistance w w === 0
What this property test says is this: “the lDistance between a word and itself is zero.” Each time that this test is run, it will generate a string somewhere between 0 and 10 characters long and call lDistance on itself. Every time I’ve run the test since then, it has passed with 100 test runs.
But, how do I test that lDistance does what it’s supposed to do with different words? For this property, I appended “a” to the generated word, and calculate the lDistance, which should be one:
prop_lDistance_of_two_words_one_char_apart_is_one :: Property
prop_lDistance_of_two_words_one_char_apart_is_one =
property $ do
w <- forAll genWord
let w' = 'a':w in lDistance w w' === 1
Great! If I was really thorough, I’d write a property for more branches of the lDistance function, but I know that I’m going to reimplement it in a more efficient manner, as suggested in the Wikipedia article.
Spelling Suggestions
Implementation
The next step in my spellchecker implementation that I thought of is a function that returns spelling suggestions based on the distance of each word. However, I didn’t know what the best distance to use was, I decided to write a function that could take any number as the “maximum distance allowed” and make suggestions based on that number.
import qualified Data.Text as T
suggestions :: Int -> [T.Text] -> T.Text -> Maybe [T.Text]
suggestions maxDist dict word = if elem word dict then Nothing else Just $ filter distance dict
where
distance entry (lDistance (T.unpack word) (T.unpack entry)) <= maxDist
Now, what this code does is that if a word is in the dictionary, then return Nothing. Nothing in this case means “no suggestions to make” and my plan is to check for Nothing in other parts of the code base when looking at suggestions. Otherwise, if the word is not in the dictionary, then create a list of suggestions that are within the maxDist of the word.
Property Test
The property tests for this actually reveal a logic bug, and that makes property testing exciting for me.
Here are the tests:
genDict :: Gen [String]
genDict =
Gen.list (Range.linear 0 100) genWord
prop_if_a_word_is_in_dictionary_suggestions_returns_Nothing :: Property
prop_if_a_word_is_in_dictionary_suggestions_returns_Nothing =
property $ do
w <- forAll genWord
let text = T.pack w in suggestions 1 [text] text === Nothing
prop_given_an_empty_word_suggestions_returns_whole_dictionary :: Property
prop_given_an_empty_word_suggestions_returns_whole_dictionary =
property $ do
d <- forAll genDict
let word = T.pack ""
dict = Prelude.map T.pack d
in suggestions 100 dict word === Just dict
Here is the output of what happens when I run these tests:
Running 1 test suites...
Test suite spell-test: RUNNING...
Main
lDistance of a word with itself is zero: OK (0.01s)
✓ lDistance of a word with itself is zero passed 100 tests.
lDistance of two words one char apart is one: OK (0.04s)
✓ lDistance of two words one char apart is one passed 100 tests.
if a word is in dictionary suggestions returns Nothing: OK (0.02s)
✓ if a word is in dictionary suggestions returns Nothing passed 100 tests.
given an empty word suggestions returns whole dictionary: FAIL (0.01s)
✗ given an empty word suggestions returns whole dictionary failed at test/Spec.hs:47:16
after 3 tests.
┏━━ test/Spec.hs ━━━
41 ┃ prop_given_an_empty_word_suggestions_returns_whole_dictionary :: Property
42 ┃ prop_given_an_empty_word_suggestions_returns_whole_dictionary =
43 ┃ property $ do
44 ┃ d <- forAll genDict
┃ │ [ "" ]
45 ┃ let word = T.pack ""
46 ┃ dict = Prelude.map T.pack d
47 ┃ in suggestions 100 dict word === Just dict
┃ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
┃ │ ━━━ Failed (- lhs) (+ rhs) ━━━
┃ │ - Nothing
┃ │ + Just [ "" ]
This failure can be reproduced by running:
> recheck (Size 2) (Seed 12497716144314615761 423022239135706347) given an empty word suggestions returns whole dictionary
Use '--hedgehog-replay "Size 2 Seed 12497716144314615761 423022239135706347"' to reproduce.
1 out of 4 tests failed (0.04s)
Test suite spell-test: FAIL
In this case, having a dictionary that contains only an empty string, suggestions will match the “word” of empty string with the “dictionary entry” of empty string and return Nothing.
What’s amazing to me is that I could have caught that logic bug just from either thinking about it really hard or did really extensive repl-testing or unit-testing, but because Hedgehog generates the inputs for me, I don’t have to think of these edge cases myself.
Conclusions
I have to go back to the drawing board with my implementation, but because I’m getting more comfortable with writing property tests, I can do more risky things like reimplement the lDistance function, or make my suggestions function more parametric without worrying about my properties breaking while I do it.
I hope that this post was useful for someone. If it was, please comment below.
Un buen blog! Voy a marcar unos pocos de estos .. Mollee Johannes Spiro
LikeLike