Assignment 2: Dictionary-based segmentation
Deadline: Dec 13, 2021 14:00 CET
Late deadline: Dec 17, 2021 14:00 CET
In this set of exercises you will experiment with a few different approaches to segmenting a string based on a dictionary. The problem studied, segmenting a string into useful units, is a prevalent problem in computational linguistics. The problem is also a first step towards parsing, another important problem in the field.
Later exercises in this assignment build on the previous ones, and it is possible to use an earlier exercise as the starting point.
General information
The policy and general information about the assignments are provided at the course web page. Please also check the information provided in the introductory assignment (ungraded).
Exercises
2.1 Checking if a string can be constructed from given lexicon (2p)
Implement the function is_sentence()
in the template,
which determines if the given string can be constructed
by concatenating the “words” in the provided lexicon.
The words can be in any order, and they can be repeated.
As long as there is a way to construct the given string
by combining the words from the lexicon,
we assume that it is a valid “sentence”,
and the function returns True
,
otherwise it should return False
.
For example, given the lexicon {'a', 'ab', 'bc', 'cd', 'd'}
,
strings aad
and abcd
are sentences.
The first string, aad
can be constructed
with sequences of words from the lexicon as a-a-d
,
and abcd
can be constructed using two alternative
ways: a-bc-d
and ab-cd
.
On the other hand bdc
and e
are not sentences,
since there is no way to build these strings from the given lexicon.
You are free to accept or reject the empty string ''
.
For this exercise, implement a recursive algorithm that returns True
for valide sentences based on given lexicon,
and False
if it is not possible to arrange the words in the lexicon
to construct the input string.
Do not use dynamic programming or other form of optimization.
However, your function should return True
as soon as it identifies a possible combination
without examining other possible combinations,
and you should avoid testing splits of the input string
that are not supported by the lexicon.
2.2 Enumerate all possible ways the string can be constructed (2p)
Implement the function tokenize()
,
that returns all possible ways the given string can be constructed
from the given lexicon.
Using the example from exercise 2.1,
the function should return [['a', 'bc', 'd'], ['ab', 'cd']]
for the string abcd
with the example lexicon
(you are free to use another sequence type, e.g., tuple
).
For strings with no possible tokenization based on the lexicon,
the function should return the empty sequence, e.g., []
.
Similar to 2.1, use a recursive brute-force algorithm, without any optimization.
2.3 Greedy tokenization (2p)
Implement the function tokenize_greedy()
,
which returns only a single tokenization
(or an empty sequence if the procedure fails to find one)
by using the longest token that matches with the beginning of the
string, and looks for another token at end of the token identified.
For example, with the example lexicon in 2.1,
the function should return [['ab', 'cd']]
for the string abcd
,
since it matches the longest possible prefix
in the lexicon (ab
), and continues the matching after ab
with the same strategy, resulting in another match cd
.
Since the whole string can be constructed
(we reach the end of the input) using this strategy,
the function returns success with the segmentation ab-cd
,
even though it misses the other possible segmentation a-bc-d
.
Note that this strategy may also fail to find a valid tokenization.
For example, for the string abc
, the initial maximal-match
matches ab
, leaving c
which is not in the lexicon.
As a result, this greedy strategy fails (returns []
),
even though a valid segmentation, a-bc
, exists.
2.4 Memoization (dynamic programming) (2p)
Implement the function tokenize_memo()
which turns the original brute-force approach in 2.2 into
a dynamic programming approach.
Your function should store the earlier solutions,
and should never repeat the computation for the same substring
during the recursive calls.
2.5 Testing running time and success for typical use (1p)
Your assignment repository contains an example lexicon and an example set of sentences (or utterances). Both files are constructed from real-world child-directed utterances obtained from the CHILDES database, a large collection of corpora used (mainly) for child language acquisition. Each utterance is on a single line, where words are separated by white space. The lexicon is constructed from another set of utterances. As a result, there are some unknown words in the test sentences that are not listed in the lexicon. The lexicon file simply contains a single word per line.
Test all three tokenization algorithms, reporting two quantities:
- average running time of the algorithm for each sentence
- percentage of correct results. We consider the output of an algorithm correct if the “gold-standard” tokenization of the string (according to the way it is indicated in the test file) matches one of the tokenizations offered by the algorithm exactly.
Your function, compare_average()
should calculate these values
for each algorithm, and print the results to the output.
An example run is provided below:
Greedy: 0.00001226 seconds, 84.40% correct.
Brute-force: 0.00003485 seconds, 94.00% correct.
Memo: 0.00003425 seconds, 94.00% correct.
Make sure that running times are average of multiple runs
over the sentences in the test data
(you may also consider randomizing the test sentences before each run).
Also note that you will need to remove the spaces from the sentences
before calling your tokenize*()
functions (you need them to test
the correctness of the algorithms).
2.6 Testing worst-case running time (1+1p)
Identify a worst-case input (a string and a lexicon) for the brute-force tokenization algorithm above, and calculate the average running times of all three tokenization algorithm for input strings of increasing length (each string should be the worst-case input given the lexicon and the length).
The function compare_worstcase()
in the template should
include the worst-case lexicon and plot the running time of each algorithm
on the worst-case strings of increasing length (from 1-20 should be
feasible on a reasonable computer, but you are recommended to explore
the limits of the algorithms using larger strings).
The plot, which plots running time of all three algorithms
for increasing string size n
should be saved as a PDF file worst-case.pdf
.
For plotting, you are recommended to use the Python matplotlib
library.
You are recommended to scale y-axis logarithmically.
Wrapping up
Do not forget to
- indicate your name(s) in the file header
- commit all your changes
- push it to your GitHub repository
- tag is as
final
(and do not forget to push the tag)