Assignment 4: Graphs: connectivity and cycle detection
Deadline: Jan 24, 2022 14:00 CET
Late deadline: Jan 28, 2022 14:00 CET
In this assignment, we will exercise with a few simple graph algorithms. Our case study is about dependency analyses of natural language sentences. In particular, we will work with dependency trees in CoNLL-U format.
A dependency tree, as used in (computational) linguistics, is a type of directed graph. In a dependency tree, the words are the nodes, and the relations between the words are represented with labeled directed edges. One of the nodes participating in a dependency relation is called the head and the other is called the dependent. Conventionally, the edge representing the relation is from the head to the dependent.
Since dependency trees are trees, they have to be connected and acyclic. Although the CoNLL-U format restricts the types of graphs that can be represented (e.g., all nodes, except the dummy ‘root’ node, have one and only one parent), the structure can still be wrong (due to errors during manual annotation or parsing). The general idea of the exercise is to check for connectivity and cycles, and repair them (in a naive way) if the analysis provided is not a tree.
Note that you will not be able to find examples of cyclic and/or disconnected examples in published treebanks like the ones in the Universal Dependencies repositories. Making up your malformed examples for testing is part of the exercise.
Exercises
4.1 Representing a dependency tree as an adjacency matrix
Read the given CoNLL-U file (see the template for details) and return a list of adjacency matrix representation of all sentences in the file.
For this set of exercises
we will ignore the words and the label of the dependencies,
and work with the indices of the words as node labels
and treat the dependencies as unlabeled edges.
This means we will only pay attention to column 1 (ID)
and column 7 (HEAD) in the CoNLL-U file.
The resulting adjacency matrix is a n by n matrix
a value True
in index [i,j]
means that the node/word i
is
the head of the word j
.
Make sure to reserve the index 0
for the artificial root node.
Use a numpy array with Boolean values for adjacency matrices.
As well as the usual comment lines,
we ignore so-called “multiword tokens”
(indexes that include a -
),
and “empty words” (indexes with a decimal point).
You are not allowed to use an external library for reading the CoNLL-U files.
An example
The CoNLL-U sentence
# sent_id = 1
# text = This is a test
1 This this PRON DT Number=Sing|PronType=Dem 4 nsubj _ _
2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop _ _
3 a a DET DT Definite=Ind|PronType=Art 4 det _ _
4 test test NOUN NN Number=Sing 0 root _ SpaceAfter=No
should be represented with the following matrix
array([[False, False, False, False, True],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, True, True, True, False]])
Question
Is the adjacency matrix representation a good idea in terms of space complexity and the time complexity of the operations below?
(You do not need to answer this question, but answering it by yourself or discussing it with your assignment partner will help you understand the concepts better.)
4.2 Checking for connectivity
Implement the function is_connected()
that returns True
if all nodes are reachable from the root node.
Note that this is a different definition of connectivity
than the definitions of connectivity on general directed graphs.
4.3 Finding cycles
Implement the function find_cycle()
that finds a cycle
and returns the list of nodes that participate in a cycle
if the graph is cyclic, an empty list otherwise.
Question
Can you have a cyclic, but connected graph represented in a CoNLL-U file?
4.4 Breaking the cycles
If the given graph is cyclic, break all cycles by attaching an arbitrarily chosen node in the cycle to the root node.
Wrapping up
Do not forget to
- indicate your name(s) in the file header
- commit all your changes
- push it to your GitHub repository
- tag it as
final
(and do not forget to push the tag)