Assignment 6: CKY parsing

Deadline: Feb 28, 2022 14:00 CET
Late deadline: Mar 4, 2022 14:00 CET

In this set of exercises, you will build a CKY recognizer for a given grammar. The example application follow is “delexicalized parsing”. That is, the sequences we will parse are not the words, but part of speech tags. Our aim here is simplification of the exercise (e.g., by reducing the size of the lexicon). However, delexicalized parsing is useful in some applications. For example, one can use a large treebank to build a delexicalized parser for “resource-rich” language, and use the same parser for a related “low-resource” language. If the languages share the grammatical structure to a large extent, but the lexicons are substantially different, such an approach may help building parsers for low-resource languages.

Throughout the following exercises, we will follow the convention that lower case strings, such as n, v, prn, are the terminal symbols, while the upper case strings, NP, VP, but also N, V, PRN are nonterminal symbols.

We will work with a simple text file format for storing context-free grammars. The file contains a single grammar rule per line, and the format of each line is LHS -> RHS, where LHS stands for “left-hand-side” and RHS stands for “right-hand-side” (of a rule). Since we are working with context free grammars, LHS has to contain a single nonterminal, but RHS may be one or more terminal or nonterminal symbols (we only work with grammars having at least one symbol on the RHS). Any line starting with ‘#’ is a comment and should be ignored. An example grammar is provided in your repository.

As usual, please follow the additional instructions in the template.

Exercises

6.1 Reading the grammar [1p]

Implement the function read_grammar() which takes a filename (containing a grammar description as defined above), and returns a list of grammar rules represented as tuples (LHS, RHS). Remember that RHS may contain multiple symbols, hence it has to be another sequence (tuple or list).

6.2 Generating sentences from a context-free grammar [2p]

Implement the function generate_sentence() which generates a random sentence from the given grammar (as returned from read_grammar()).

The generation procedure should start with a sentential form that contains only S, and rewriting the first nonterminal symbol in the sentential form using a randomly chosen rule with matching LHS until there are no nonterminals left. Our implementation also includes a maxlen parameter. If the length of the sentential form reaches maxlen, you should rest the sentential form to S, and try to generate another sentence which is at most maxlen words long.

You can use the example grammar for testing. However, your implementation should be general enough to work with any CFG with the restrictions defined above (e.g., no ε rules).

Questions (not required for the assignment)

Is the generation procedure defined above guaranteed to terminate in general?
Is the generation procedure defined above guaranteed to terminate for our example grammar?
The data structure you are asked to use for the grammar is not efficient. What would be a better data structure for the purpose of this exercise?

6.3 Adjusting the grammar: CNF [2p]

Our example grammar is not in Chomsky normal form (CNF). Since CKY requires the input grammar to be in CNF, convert the grammar to CNF, write the resulting grammar as grammar-cnf.txt following the same format used for grammar.txt. Note that you do not need to write any Python code for this exercise. You can use your favorite editor to create/edit grammar-cnf.txt.

You may want to verify the well-formedness of your grammar by generating some sentences from it.

Do not forget to add the file grammar-cnf.txt to your repository.

6.4 CKY recognition [5p]

Implement the cky_recognize() function, which returns True if the provided sentence is in the language, False otherwise.

Questions (not required for the assignment)

Again, the data structure you are asked to use for the grammar is not efficient for this exercise either. What would be a better data structure for the purpose of this exercise?
What changes would you need if we wanted to do parsing, i.e., if we wanted to be able to recover the parse tree(s) from the CKY chart.

Wrapping up

Do not forget to

indicate your name(s) in the file header
commit all your changes
push it to your GitHub repository
tag it as final (and do not forget to push the tag)