Assignment 3: Binary trees

Deadline: Jan 10, 2022 14:00 CET
Late deadline: Jan 14, 2022 14:00 CET

In this set of exercises you will practice working with binary trees. We will work with examples of binary phrase structure, or constituency trees (syntactic analyses of natural language sentences). However, binary trees are useful data structures for many other applications.

General information

The policy and general information about the assignments are provided at the course web page. Please also check the information provided in the introductory assignment (ungraded).

Exercises

3.1 Implement an array-based binary tree

Implement the following functions in the template which provide utilities for manipulating binary trees represented as a Python list (using it as an array). Please follow the (standard) procedure for implementing binary trees with arrays as described during the lectures. Throughout the implementation, we use indices of the array where a node is stored as the ID of the node.

add_node(): Add a right or left child to a given node with the given label, and return its index. This method should also extend the array when needed. The size of the array should be limited to the minimum storage needed for the current tree. An index value out of bounds indicates a non-existing node.
find_node(): Return an iterable (a sequence, or better, a generator), that returns/yields IDs (indices) of the nodes with the given label. If there is another match found at the left child (or one of its descendants) it should be returned before the node. Possible matches at the right child or its descendants should be returned after the node.
get_height(): Return the height of a node with the given index.
get_depth(): Return the depth of a node with the given index.

See the assignment template for detailed description of each function.

For empty slots in the array (the locations corresponding to non-existing nodes), use None as a placeholder.

3.2 Building a tree from bracketed representation

A popular format for storing parse trees is so-called Penn treebank notation. The nodes represent either a phrase label (like NP below), or a word, or terminal (like John below). The format includes a hierarchical organization of parenthetical expression. The label of an internal node is indicated just after an opening parentheses. The leaf nodes, the words, are not placed inside parenthesis, and indicated next to their immediate parent. The unary rules (internal nodes with a single child) are only allowed if the child is a leaf node (word). Hence, every expression inside the parentheses are either two labels (a leaf node preceded by its parent, no parentheses) or a node label followed by its left and right subtree (both subtrees sourrounded by parenteheses).

An example parse tree and its Penn treebank representation are provided below.

(S (NP John)
   (VP (V ate)
       (NP (Det a)
           (N pizza)
       )
   )
)

Note that we are only concerned with binary trees of the form above (actual Penn treebank format allows n-ary trees with arbitrary n).

3.3 Write a graphical representation in “dot” language

The Graphviz dot language is a well-known tool for visualizing graphs. Implement the function tree2dot() which outputs a representation of the tree in the dot language. Follow the directions in the template for the details.

An example dot file representing the above tree is provided for reference.

Tip: the solution can be based on multiple ways to traverse the tree, but the memory representation of the tree also helps with converting it to a ‘dot file’ in a straightforward way.

3.4 Calculate the PARSEVAL metrics

A common use for the tree representation we implement here is parsing, an important and well-studied task in computational linguistics.

Task of a parser is to generate a tree representation similar to ones exemplified above for a given sentence. As any automatic method, we need to evaluate the success of our parsers. PARSEVAL metrics are a standard way to evaluate parser outputs. Given the correct, ‘gold-standard’, analysis (often created manually), and a parser output, we calculate two scores precision and recall. The metrics are defined as:

                TP                       TP
precision = ---------        recall = ---------
             TP + FP                   TP + FN

where

true positives (TP) is the number of internal nodes of the parser output which has a matching node in the gold-standard tree
false positives (FP) are the number of internal nodes of the parser output that does not match any node in the gold-standard tree
false negatives (FN) is the number of internal nodes of the gold-standard tree that does not match any node in the parser output

Two nodes match if they have the same label, and their span (the words under the subtree rooted by the node) is the same.

For example, in the trees below, we have a false positive (FP) count of 2 (indicated by red in the figure). The first one (higher level NP) is a false positive, since there is no NP corresponding to the same span (word sequence the pizza with a friend) in the gold standard. For the second NP, the label of the node does not match (this is a PP in the gold standard). Similarly, the false negatives is also 2 (blue nodes in the gold standard analysis), since the VP node spanning ate the pizza is not predicted, and the label of the PP is predicted wrongly. The true positives, the internal nodes with the same span and the same label on both trees, is 11 (including the root node). So both precision and recall are 11/13.

Gold standard tree in Penn treebank format:

(S (NP John)
   (VP (VP (V ate) 
           (NP (Det the) (N pizza))
       )
       (PP (P with) (NP (Det a) (N friend)))
   )
)

Predicted tree in Penn treebank format:

(S (NP John)
   (VP (V ate) 
       (NP (NP (Det the) (N pizza)) 
           (NP (P with) (NP (Det a) (N friend)))
        )
    )
)

Wrapping up

Do not forget to

indicate your name(s) in the file header
commit all your changes
push it to your GitHub repository
tag it as final (and do not forget to push the tag)