Assignment 2: Dictionary-based segmentation

Deadline: Dec 13, 2021 14:00 CET
Late deadline: Dec 17, 2021 14:00 CET

In this set of exercises you will experiment with a few different approaches to segmenting a string based on a dictionary. The problem studied, segmenting a string into useful units, is a prevalent problem in computational linguistics. The problem is also a first step towards parsing, another important problem in the field.

Later exercises in this assignment build on the previous ones, and it is possible to use an earlier exercise as the starting point.

General information

The policy and general information about the assignments are provided at the course web page. Please also check the information provided in the introductory assignment (ungraded).

Exercises

2.1 Checking if a string can be constructed from given lexicon (2p)

Implement the function is_sentence() in the template, which determines if the given string can be constructed by concatenating the “words” in the provided lexicon. The words can be in any order, and they can be repeated. As long as there is a way to construct the given string by combining the words from the lexicon, we assume that it is a valid “sentence”, and the function returns True, otherwise it should return False.

For example, given the lexicon {'a', 'ab', 'bc', 'cd', 'd'}, strings aad and abcd are sentences. The first string, aad can be constructed with sequences of words from the lexicon as a-a-d, and abcd can be constructed using two alternative ways: a-bc-d and ab-cd. On the other hand bdc and e are not sentences, since there is no way to build these strings from the given lexicon. You are free to accept or reject the empty string ''.

For this exercise, implement a recursive algorithm that returns True for valide sentences based on given lexicon, and False if it is not possible to arrange the words in the lexicon to construct the input string. Do not use dynamic programming or other form of optimization. However, your function should return True as soon as it identifies a possible combination without examining other possible combinations, and you should avoid testing splits of the input string that are not supported by the lexicon.

2.2 Enumerate all possible ways the string can be constructed (2p)

Implement the function tokenize(), that returns all possible ways the given string can be constructed from the given lexicon. Using the example from exercise 2.1, the function should return [['a', 'bc', 'd'], ['ab', 'cd']] for the string abcd with the example lexicon (you are free to use another sequence type, e.g., tuple). For strings with no possible tokenization based on the lexicon, the function should return the empty sequence, e.g., [].

Similar to 2.1, use a recursive brute-force algorithm, without any optimization.

2.3 Greedy tokenization (2p)

Implement the function tokenize_greedy(), which returns only a single tokenization (or an empty sequence if the procedure fails to find one) by using the longest token that matches with the beginning of the string, and looks for another token at end of the token identified.

For example, with the example lexicon in 2.1, the function should return [['ab', 'cd']] for the string abcd, since it matches the longest possible prefix in the lexicon (ab), and continues the matching after ab with the same strategy, resulting in another match cd. Since the whole string can be constructed (we reach the end of the input) using this strategy, the function returns success with the segmentation ab-cd, even though it misses the other possible segmentation a-bc-d.

Note that this strategy may also fail to find a valid tokenization. For example, for the string abc, the initial maximal-match matches ab, leaving c which is not in the lexicon. As a result, this greedy strategy fails (returns []), even though a valid segmentation, a-bc, exists.

2.4 Memoization (dynamic programming) (2p)

Implement the function tokenize_memo() which turns the original brute-force approach in 2.2 into a dynamic programming approach. Your function should store the earlier solutions, and should never repeat the computation for the same substring during the recursive calls.

2.5 Testing running time and success for typical use (1p)

Your assignment repository contains an example lexicon and an example set of sentences (or utterances). Both files are constructed from real-world child-directed utterances obtained from the CHILDES database, a large collection of corpora used (mainly) for child language acquisition. Each utterance is on a single line, where words are separated by white space. The lexicon is constructed from another set of utterances. As a result, there are some unknown words in the test sentences that are not listed in the lexicon. The lexicon file simply contains a single word per line.

Test all three tokenization algorithms, reporting two quantities:

average running time of the algorithm for each sentence
percentage of correct results. We consider the output of an algorithm correct if the “gold-standard” tokenization of the string (according to the way it is indicated in the test file) matches one of the tokenizations offered by the algorithm exactly.

Your function, compare_average() should calculate these values for each algorithm, and print the results to the output. An example run is provided below:

Greedy:      0.00001226 seconds, 84.40% correct.
Brute-force: 0.00003485 seconds, 94.00% correct.
Memo:        0.00003425 seconds, 94.00% correct.

Make sure that running times are average of multiple runs over the sentences in the test data (you may also consider randomizing the test sentences before each run). Also note that you will need to remove the spaces from the sentences before calling your tokenize*() functions (you need them to test the correctness of the algorithms).

2.6 Testing worst-case running time (1+1p)

Identify a worst-case input (a string and a lexicon) for the brute-force tokenization algorithm above, and calculate the average running times of all three tokenization algorithm for input strings of increasing length (each string should be the worst-case input given the lexicon and the length).

The function compare_worstcase() in the template should include the worst-case lexicon and plot the running time of each algorithm on the worst-case strings of increasing length (from 1-20 should be feasible on a reasonable computer, but you are recommended to explore the limits of the algorithms using larger strings). The plot, which plots running time of all three algorithms for increasing string size n should be saved as a PDF file worst-case.pdf. For plotting, you are recommended to use the Python matplotlib library. You are recommended to scale y-axis logarithmically.

Wrapping up

Do not forget to

indicate your name(s) in the file header
commit all your changes
push it to your GitHub repository
tag is as final (and do not forget to push the tag)