Read chapter 12.
Default function arguments can specify values for function arguments that usually have the same value. For example, the function int foo(int x, int y = 3) can be called using either foo(4,5) or foo(4). In the former case, the parameter values are x = 4 and y = 5. In the latter case, y's parameter value is 3. See also the textbook pp. 60-62.
You and possibly a CS1321 classmate are to implement an open-chained hash table and incorporate it into a program to find duplicate WWW pages. You may solve the problem in any way except that you may not use any code available in written or electronic form.
A hash table is a data structure supporting insertion, removal, and querying of elements in expected constant time by using a hash function. A hash function converts an element into a number specifying where the element should be stored. You are to implement an open-chained hash table storing URL objects, which conceptually are names of WWW pages. The algorithms and hash function to use were presented during lecture.
(The hash table described below has been designed to support finding URLs with the same contents. Other hash tables, e.g., the STL hash table, are similarly designed but details may vary. For example, most hash tables store key-value pairs.)
The hashTable class should support the operations listed in Table 1. These operations are similar to but not identical to these provided by the STL hash tables.
All hashTable operations use == or != to compare keys.
The insert function ensures that the given key is in the hash table. If a matching key, i.e., one equivalent using ==, is already in the hash table, the given key is not inserted. Thus, all keys in the hash table are always unique. insert returns a pair in which
As discussed in lecture, our hash function converts a string, e.g., a URL's contents, to a position in the hash table's main vector. There exist many different hash functions, but we describe one that has proved to work well in practice.
Conceptually, our hash function
In practice, converting the string into a base-256 number yields a very large number that can overflow the largest integer that most computers can store. Instead, we combine the first two steps. After converting each string's character into its ASCII code, multiply by the golden ratio. After each multiplication or addition drop the integral portion of the result. This will ensure that the numbers will always have reasonable size.
A good hash function spreads out a hash table's contents, but if any significant number of table positions have too many entries, using the hash table will take too much time. Thus, if the hash table becomes too full, we should enlarge the table. If twice the number of items in the hash table exceeds the size of the hash table's main vector, double the vector's size. If the number of items is less than one-eighth of the main vector's size, halve its size.
When resizing the hash table, every URL in the hash table must be rehashed, since the hash function's values depend on the table's size. One implementation strategy is to insert the URLs into a newly created vector. Another strategy is to always have two vectors in the hash table object, plus a variable indicating which vector currently contains the URLs. To resize, one can then clear the other vector, resize it to the new size, insert into it all the URLs from the vector currently containing the URLs, and mark it as the vector currently containing the URLs.
Hash tables are commonly used as dictionaries, i.e., data structures supporting insertion, querying, and removal. They are also useful in determining duplication. For example, one problem faced by WWW search engines is identification of mirrored WWW pages, i.e., duplicates of other WWW pages, because it is not desirable to present the user with a long list of duplicate WWW pages. Directly comparing all pairs of WWW pages takes too long. Instead, WWW pages can be hashed and pages compared only if they hash to the same location. This reduces the running time from O(n2) to O(n), where nis the number of WWW pages.
findDuplicates.cc uses a hashTable containing URLs to identify duplicates. Given the name of a file containing a list of URLs as its one command-line argument, the program identifies all duplicate WWW pages. For example, if three URLS ``foo,'' ``bar,'' and ``baz'' have the same contents, two matches will be printed.
A URL (uniform resource locator) object stores a name. Conceptually, this is the name of a WWW page, FTP file, etc. Because Trinity's Internet connection is often slow, we just use filenames rather than WWW addresses.
Two URLs are equal (==) if and only if their files have exactly the same contents. Broken links, i.e., URLs without a file in the local directory, are not equal to any other URL. The hash table should hash the URL's contents, not its name.
As far as Jeffrey D. Oldham can tell, WWW search engines use technology similar to this homework to eliminate duplicates. The scheme, however, is fooled by ``almost mirrors'' which differ in small ways, e.g., hypertext links or use counters. Recent research shows that careful choice of hash functions can help automatically group related WWW pages. Other uses remain undiscovered.
Using STL containers and functions may ease implementation. We suggest using a vector of vectors of strings. Let the main vector be the vector of vectors.
Useful vector functions include:
The number of distinct characters is UCHAR_MAX, which is defined in limits.h.
A possible order for implementation is
The cmp UNIX command compares two specified filenames, printing whether the files differ or not. For more information, see the ``diff'' info pages accessible via ``Control-h i'' inside Emacs.
If you find that using URLs is confusing, try implementing and testing hashTable storing another type, e.g., key_type. Then revise to use URLs.
We have provided code to find duplicate WWW pages (findDuplicates.cc), a URL class (URL.h), and a minimal skeleton for your hashtable class hashTable.h). Add code to hashTable.h to implement a hash table storing URLs. This Makefile may be useful.
As for previous homeworks, working with other people during your planning phase is encouraged. For this homework, you are permitted to write the code, i.e., program, with one other person in CS1321. To learn the material, both of you should be actively involved in the programming. Failure to do so will almost certainly hurt your comprehension of the material in the rest of the course.
Each one- or two-person team of programmers should submit only its completed implementation, consisting of the file hashTable.h. You do not need to send any other files. Please send only text documents, do not send Microsoft Word documents, PDF documents, HTML documents, etc. Please include both your names and email addresses at the top of your program.
We will test your program using our own input. Please be sure your code compiles without warning when using g++ -Wall -pedantic.
See the guidelines for programming assignments for instructions on how to e-mail the programs. For this assignment use a subject line of ``cs1321 homework 9''. If a team of two are in different sections, submit exactly once to one of the two permissible email addresses.