Assignment 1: Searching in a dictionary

Deadline: Nov 22, 2021 14:00 CET
Late deadline: Nov 26, 2021 14:00 CET

In this assignment, you are asked to implement a(n approximate) search in a dictionary. The objectives of the assignment are to refresh your Python knowledge, and familiarize yourself with the setup we will use throughout this course by implementing a few simple algorithms.

Note that all instances of the word “dictionary” below refer to a dictionary in the sense of words and their definitions. We use dict to refer to the Python data structure.

General information

The policy and general information about the assignments are provided at the course web page. Please also check the information provided in the introductory assignment (ungraded).

Exercises

1.1 Read and process the data

The data format we use in this exercise is a text file containing a dictionary item on each line, where the word and the associated data (e.g., its description) is separated by a single tab character. An example data file is provided, where each record contains a number as the data associated with each word (a placeholder for the definition).

After reading the dictionary records to a list of tuples (word, data), check if the word list is sorted according to lexicographic order, and sort the list only if the list is not already sorted. Return the sorted dictionary records.

Throughout all exercises, assume that the data file does not include any duplicate keys.

You should implement this exercise in the function read_data() provided in the template.

1.2 Search a word in a sorted word list

Implement the function search_word() in the template, that searches a word in a list returned by read_data() function, and returns the index of the record if the word is in the list. If the word is not in the list, the function should return the index of the previous record that would be immediately before the word if it was in the dictionary. For example, if words compute and computers placed consecutively in the word list, and we are looking for the unknown word computer, the function should return the index of compute. If the unknown word would be placed as the first item in the sorted dictionary, the function should return -1.

The worst-case complexity of your search should be better than O(n).

1.3 Lookup and display a word

Implement the function lookup_word() in the template, which searches for the given word using search_word() implemented above, and returns prints the word and the associated data (description) if the word is in the dictionary, and otherwise it prints the k words that would be placed just before the word, and k words that would be placed just after the word if the word was in the dictionary.

Using the example above, if we lookup the unknown word computer with k=1, the output should look like:

compute: data associated with compute
computer (not found)
computers: data associated with computers

1.4 Insert a new item to the dictionary

Implement the function insert() in the template that inserts a (word, description) pair to the correct location in the given sorted list of dictionary records.

You are not allowed to use the built insert() method of the Python lists, for this exercise.

1.4 Soundex algorithm

Not correctly spelling the words that one wants to lookup in a dictionary is a common problem. The Soundex algorithm is designed to allow finding words that “sounds like” a given word. The algorithm clusters letters into the classes:

There are a number of alternative implementations. For this exercise we will use the following variant:

  1. The first letter of the code is the first letter of the word
  2. Replace the letters after the first letter with their numeric codes given above
  3. Replace repeated consecutive codes with a single code. Also make sure that repeated letters at the beginning of the word are also contracted to a single letter or code. Note that repeated codes with a class 0 letter between them are kept. Only the instances of repeated letters that immediately follow each other are contracted to a single code
  4. Remove all 0’s from the code
  5. Pad the code with 0’s if it is shorter than 4 characters, or truncate it to 4 characters if it is longer

Delete any character (e.g., numbers, punctuation, uppercase letters) that is not one of the characters listed above. However, treat them as class 0 when removing duplicates. For example ac-cd should be coded as a223 (not a230).

Implement the above algorithm in function code_soundex() in the template, which returns the Soundex code for a given word.

1.5 Approximate lookup with Soundex

We will implement an approximate lookup function in this exercise. As a preparatory step, implement the function build_sdict() function that creates and returns a dict for the given list of dictionary items (a list of word-definition tuples), The keys of the dict is the Soundex codes, and the values should be list of words that share the same Soundex code.

Implement the function lookup_soundex() that takes a word, a list of dictionary items, and a dict returned from build_sdict(), and prints out the results of lookup similar to lookup_word() defined above, except it tries to print all words with the same Soundx code if there are words with the same Soundex code in the dictionary.

computer (not found)
compute: data associated with compute
computers: data associated with computers
computing: data associated with computers
... possibly others with the same Soundex code ...

If the word is not in the dictionary, and and if it does not “sound like” any word in the dictionary, the function should declare that the word is not found.

Wrapping up

Do not forget to