Assignment 4: Graphs: connectivity and cycle detection

Deadline: Jan 24, 2022 14:00 CET
Late deadline: Jan 28, 2022 14:00 CET

In this assignment, we will exercise with a few simple graph algorithms. Our case study is about dependency analyses of natural language sentences. In particular, we will work with dependency trees in CoNLL-U format.

A dependency tree, as used in (computational) linguistics, is a type of directed graph. In a dependency tree, the words are the nodes, and the relations between the words are represented with labeled directed edges. One of the nodes participating in a dependency relation is called the head and the other is called the dependent. Conventionally, the edge representing the relation is from the head to the dependent.

Since dependency trees are trees, they have to be connected and acyclic. Although the CoNLL-U format restricts the types of graphs that can be represented (e.g., all nodes, except the dummy ‘root’ node, have one and only one parent), the structure can still be wrong (due to errors during manual annotation or parsing). The general idea of the exercise is to check for connectivity and cycles, and repair them (in a naive way) if the analysis provided is not a tree.

Note that you will not be able to find examples of cyclic and/or disconnected examples in published treebanks like the ones in the Universal Dependencies repositories. Making up your malformed examples for testing is part of the exercise.

Exercises

4.1 Representing a dependency tree as an adjacency matrix

Read the given CoNLL-U file (see the template for details) and return a list of adjacency matrix representation of all sentences in the file.

For this set of exercises we will ignore the words and the label of the dependencies, and work with the indices of the words as node labels and treat the dependencies as unlabeled edges. This means we will only pay attention to column 1 (ID) and column 7 (HEAD) in the CoNLL-U file. The resulting adjacency matrix is a n by n matrix a value True in index [i,j] means that the node/word i is the head of the word j. Make sure to reserve the index 0 for the artificial root node. Use a numpy array with Boolean values for adjacency matrices.

As well as the usual comment lines, we ignore so-called “multiword tokens” (indexes that include a -), and “empty words” (indexes with a decimal point).

You are not allowed to use an external library for reading the CoNLL-U files.

An example

The CoNLL-U sentence

# sent_id = 1
# text = This is a test
1	This	this	PRON	DT	Number=Sing|PronType=Dem	4	nsubj	_	_
2	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	4	cop	_	_
3	a	a	DET	DT	Definite=Ind|PronType=Art	4	det	_	_
4	test	test	NOUN	NN	Number=Sing	0	root	_	SpaceAfter=No

should be represented with the following matrix

array([[False, False, False, False,  True],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False,  True,  True,  True, False]])

Question

Is the adjacency matrix representation a good idea in terms of space complexity and the time complexity of the operations below?

(You do not need to answer this question, but answering it by yourself or discussing it with your assignment partner will help you understand the concepts better.)

4.2 Checking for connectivity

Implement the function is_connected() that returns True if all nodes are reachable from the root node. Note that this is a different definition of connectivity than the definitions of connectivity on general directed graphs.

4.3 Finding cycles

Implement the function find_cycle() that finds a cycle and returns the list of nodes that participate in a cycle if the graph is cyclic, an empty list otherwise.

Question

Can you have a cyclic, but connected graph represented in a CoNLL-U file?

4.4 Breaking the cycles

If the given graph is cyclic, break all cycles by attaching an arbitrarily chosen node in the cycle to the root node.

Wrapping up

Do not forget to