Assignment 5: Spell checking with FSTs
Deadline: Feb 21, 2022 14:00 CET
Late deadline: Feb 25, 2022 14:00 CET
A common application of finite state transducers (FSTs) is spelling correction. Given a misspelled word, we can view spelling correction as a two-step process:
- Generating candidate words
- Ranking them (and selecting top candidate in case of automatic correction)
For the first step, a common approach is finding words that are close to the misspelled word in terms of edit distance. In this assignment we will use an FST to generate strings that are one or two edit distance away from a given (misspelled) word. For example, the following FST generates all strings where one of the following edit operations are allowed: (1) ‘p’ is replaced with ‘b’, (2) ‘a’ is deleted, (3) a ‘k’ is inserted. The symbol Σ indicates all identity mappings in the alphabet, e.g., a:a, b:b, k:k, p:p for the example below.
This is, of course, just a part of a practical spell checker. We typically use a weighted FST to obtain a ranked list, and it is also common to represent the lexicon as an FST as well, so that (the inverse of) the composition of the lexicon and the error model produces only the valid words that are similar to the misspelled word. For more information on use of FSTs for spell checking, see, for example, Hassan et al. (2008) or Pirinen and Lindén et al. (2014).
Exercises
5.1 Implementing a simple FST class
Implement the FST
class in the template and its
add_transition()
method.
In our use case, we will typically want to know
which target state to go and which output symbol to emit
when we are in a particular source state and receive a particular input symbol.
Note that we may have multiple (target_state, output_symbol)
pairs
for a single (source_state, input_symbol)
.
You are required to keep all edges (transitions) in a Python dictionary,
where the keys are (source_state, input_symbol)
tuples,
and the values are sets of (target_state, output_symbol)
tuples.
5.2 Dot files again!
Implement the method write()
in the FST class that writes out
the Graphviz
dot
representation of the FST.
5.3 FST transduction
Implement the method transduce()
that returns an iterable
containing all possible transductions of the input string
according to the FST.
5.4 Levenshtein FST
Implement the function levenshtein_fst()
that
builds and returns a Levenshtein FST over a given alphabet
with the given distance.
The FST should transduce a given string to all strings
that are ‘n’ edit operations away from the given string.
The generated FST should
allow insertion, deletion and substitution operations.
You are not required to implement it,
but you are strongly recommended to also think about
how to represent transposition,
which is a typical spelling error
where two consecutive characters are swapped
(e.g., linguistics
misspelled as linguistisc
).
5.5 Putting the pieces together
Implement the function generate_alternatives()
,
which generates all alternative spellings of a given (misspelled) word
allowed by the given FST and valid according to the given lexicon.
An example lexicon file sample-lexicon.txt is provided for your convenience. However, you are recommended to debug your system on a smaller data set (a smaller alphabet allows easier visual inspection of the resulting FSTs).
Note that this, again, is a simplification. Ideally, we also represent the lexicon as an FST, and compose it with the Levenshtein FST to generate only valid word forms.
Wrapping up
Do not forget to
- indicate your name(s) in the file header
- commit all your changes
- push it to your GitHub repository
- tag it as
final
(and do not forget to push the tag)