Assignment 3: Binary trees

Deadline: Jan 10, 2022 14:00 CET
Late deadline: Jan 14, 2022 14:00 CET

In this set of exercises you will practice working with binary trees. We will work with examples of binary phrase structure, or constituency trees (syntactic analyses of natural language sentences). However, binary trees are useful data structures for many other applications.

General information

The policy and general information about the assignments are provided at the course web page. Please also check the information provided in the introductory assignment (ungraded).

Exercises

3.1 Implement an array-based binary tree

Implement the following functions in the template which provide utilities for manipulating binary trees represented as a Python list (using it as an array). Please follow the (standard) procedure for implementing binary trees with arrays as described during the lectures. Throughout the implementation, we use indices of the array where a node is stored as the ID of the node.

See the assignment template for detailed description of each function.

For empty slots in the array (the locations corresponding to non-existing nodes), use None as a placeholder.

3.2 Building a tree from bracketed representation

A popular format for storing parse trees is so-called Penn treebank notation. The nodes represent either a phrase label (like NP below), or a word, or terminal (like John below). The format includes a hierarchical organization of parenthetical expression. The label of an internal node is indicated just after an opening parentheses. The leaf nodes, the words, are not placed inside parenthesis, and indicated next to their immediate parent. The unary rules (internal nodes with a single child) are only allowed if the child is a leaf node (word). Hence, every expression inside the parentheses are either two labels (a leaf node preceded by its parent, no parentheses) or a node label followed by its left and right subtree (both subtrees sourrounded by parenteheses).

An example parse tree and its Penn treebank representation are provided below.

(S (NP John)
   (VP (V ate)
       (NP (Det a)
           (N pizza)
       )
   )
)

Note that we are only concerned with binary trees of the form above (actual Penn treebank format allows n-ary trees with arbitrary n).

3.3 Write a graphical representation in “dot” language

The Graphviz dot language is a well-known tool for visualizing graphs. Implement the function tree2dot() which outputs a representation of the tree in the dot language. Follow the directions in the template for the details.

An example dot file representing the above tree is provided for reference.

Tip: the solution can be based on multiple ways to traverse the tree, but the memory representation of the tree also helps with converting it to a ‘dot file’ in a straightforward way.

3.4 Calculate the PARSEVAL metrics

A common use for the tree representation we implement here is parsing, an important and well-studied task in computational linguistics.

Task of a parser is to generate a tree representation similar to ones exemplified above for a given sentence. As any automatic method, we need to evaluate the success of our parsers. PARSEVAL metrics are a standard way to evaluate parser outputs. Given the correct, ‘gold-standard’, analysis (often created manually), and a parser output, we calculate two scores precision and recall. The metrics are defined as:

                TP                       TP
precision = ---------        recall = ---------
             TP + FP                   TP + FN

where

Two nodes match if they have the same label, and their span (the words under the subtree rooted by the node) is the same.

For example, in the trees below, we have a false positive (FP) count of 2 (indicated by red in the figure). The first one (higher level NP) is a false positive, since there is no NP corresponding to the same span (word sequence the pizza with a friend) in the gold standard. For the second NP, the label of the node does not match (this is a PP in the gold standard). Similarly, the false negatives is also 2 (blue nodes in the gold standard analysis), since the VP node spanning ate the pizza is not predicted, and the label of the PP is predicted wrongly. The true positives, the internal nodes with the same span and the same label on both trees, is 11 (including the root node). So both precision and recall are 11/13.

Gold standard tree in Penn treebank format:

(S (NP John)
   (VP (VP (V ate) 
           (NP (Det the) (N pizza))
       )
       (PP (P with) (NP (Det a) (N friend)))
   )
)

Predicted tree in Penn treebank format:

(S (NP John)
   (VP (V ate) 
       (NP (NP (Det the) (N pizza)) 
           (NP (P with) (NP (Det a) (N friend)))
        )
    )
)

Wrapping up

Do not forget to