Assignment 5: Spell checking with FSTs

Deadline: Feb 21, 2022 14:00 CET
Late deadline: Feb 25, 2022 14:00 CET

A common application of finite state transducers (FSTs) is spelling correction. Given a misspelled word, we can view spelling correction as a two-step process:

Generating candidate words
Ranking them (and selecting top candidate in case of automatic correction)

For the first step, a common approach is finding words that are close to the misspelled word in terms of edit distance. In this assignment we will use an FST to generate strings that are one or two edit distance away from a given (misspelled) word. For example, the following FST generates all strings where one of the following edit operations are allowed: (1) ‘p’ is replaced with ‘b’, (2) ‘a’ is deleted, (3) a ‘k’ is inserted. The symbol Σ indicates all identity mappings in the alphabet, e.g., a:a, b:b, k:k, p:p for the example below.

This is, of course, just a part of a practical spell checker. We typically use a weighted FST to obtain a ranked list, and it is also common to represent the lexicon as an FST as well, so that (the inverse of) the composition of the lexicon and the error model produces only the valid words that are similar to the misspelled word. For more information on use of FSTs for spell checking, see, for example, Hassan et al. (2008) or Pirinen and Lindén et al. (2014).

Exercises

5.1 Implementing a simple FST class

Implement the FST class in the template and its add_transition() method. In our use case, we will typically want to know which target state to go and which output symbol to emit when we are in a particular source state and receive a particular input symbol. Note that we may have multiple (target_state, output_symbol) pairs for a single (source_state, input_symbol). You are required to keep all edges (transitions) in a Python dictionary, where the keys are (source_state, input_symbol) tuples, and the values are sets of (target_state, output_symbol) tuples.

5.2 Dot files again!

Implement the method write() in the FST class that writes out the Graphviz dot representation of the FST.

5.3 FST transduction

Implement the method transduce() that returns an iterable containing all possible transductions of the input string according to the FST.

5.4 Levenshtein FST

Implement the function levenshtein_fst() that builds and returns a Levenshtein FST over a given alphabet with the given distance. The FST should transduce a given string to all strings that are ‘n’ edit operations away from the given string. The generated FST should allow insertion, deletion and substitution operations.

You are not required to implement it, but you are strongly recommended to also think about how to represent transposition, which is a typical spelling error where two consecutive characters are swapped (e.g., linguistics misspelled as linguistisc).

5.5 Putting the pieces together

Implement the function generate_alternatives(), which generates all alternative spellings of a given (misspelled) word allowed by the given FST and valid according to the given lexicon.

An example lexicon file sample-lexicon.txt is provided for your convenience. However, you are recommended to debug your system on a smaller data set (a smaller alphabet allows easier visual inspection of the resulting FSTs).

Note that this, again, is a simplification. Ideally, we also represent the lexicon as an FST, and compose it with the Levenshtein FST to generate only valid word forms.

Wrapping up

Do not forget to

indicate your name(s) in the file header
commit all your changes
push it to your GitHub repository
tag it as final (and do not forget to push the tag)