Here is one way to approach a homework that has more pages of explanation than lines of code.
Remember that discussions with classmates are allowed, up to the point at which you begin writing code.
pair<double,int>()To specify the pair's contents upon creation, use
pair<double,int>(3.4,12)If you do not want to explicitly write the pair's types, you can use
make_pair(3.4,12)and the compiler will take its best guess about the pair's types.
p.firstTo access a pair p's second item, use
p.second
vector<T> vTo create a vector of size n, use
vector<T> v(n)To change a vector so it can hold n elements, use
v.resize(n);After doing this, accessing positions 0, 1, ..., n - 1 is legal.
v.empty()yields true. If
v.size()yields s, then positions 0, 1, ..., s - 1 may be accessed.
v[i]Alternative equivalent syntax for accessing, but not changing, the element is v.at(i). This has the nice feature that, if the vector does not have a position i, it kills the program rather than just returning a garbage value.
v.push_back(item)enlarges the vector by one position, inserting item into the last position.
v.pop_back(item)shrinks the vector by one position, eliminating item from the last position.
v.begin() and v.end() yield iterators for the vector's beginning and one past the end.
code | meaning |
string s; | creates a string with no characters |
string t("hello"); | creates a string containing ``hello'' |
string t = "hello"; | another way of doing the same thing |
s = t; | makes s equal ``hello'' |
cout << s[1]; | prints the letter `e' |
s = "good"; | changes s's contents |
s = s + "bye"; | changes s to ``goodbye'' |
cout << s + t; | prints ``goodbyehello'' |
s.empty(); | yields false because s has characters |
t.size(); | yields 5 because it has five characters |
t.push_back('s'); | appends s to ``hello'' |
t.clear(); | shrinks t to the empty string |
t.c_str(); | converts t to a C-string. |
When using the .c_str() function to convert from a string to a C-style string, be sure to use the result immediately. It may ``magically'' disappear by the time the next statement is executed. This function is seldom needed, but it is useful when using the .open(filename) function for istreams and ostreams, which takes only const char [].
The .size() member function yields the number of elements in the hash table. Be sure to #include <hash_map> near the top of your file.
The problem is to finish writing a simple Web search engine. Before describing what code needs to be written, we present the model and the algorithmic ideas.
We call two Web documents similar if they contain many of the same words used with similar frequency. Each Web document is modeled using a very long vector, with each vector component representing the occurrence frequency of a particular word in the document. For example, if the word ``molasses'' occurs twice as frequently as the word ``jam,'' the molasses component will be twice as large as the jam component.
Technically, two Web documents are similar if the angle between them is small. To understand what this means, first consider the dot product of two vectors. The dot product of two vectors is the sum of the pairwise multiplication of vector components. For example, (3, 4, 5) . (6, 7, 8) = 3*6 + 4*7 + 5*8. The dot product is large if the two documents have many of the same words. For the computation, we actually use the relative frequency of words within a document, e.g., ``molasses'' forms 20% of the document's words while ``jam'' forms 10%. Using the formula for dot product A . B = | A|| B| cos, we see that the angle is small if (A/| A|) . (B/| B|) is large.
The two parts of a search engine are:
Preprocessing the documents requires collecting the documents, extracting the documents' words for use as vector components, and then computing each document's vector. For this homework, we just used the wget command to snarf a collection of documents. We then collected all of the documents' words into a hash table and finally converted each document into a vector.
To convert a document into a vector, for every word we read from a document, we increment the word's component in the vector. To determine the component number, we ask the hash table for the word's component. For example, if the hash table is called ht, we can determine hello's component (an integer) using ht.find("hello"). We then normalize the vector A by scaling by the reciprocal of its length | A| = . That is, we multiply every component of A by 1/| A|.
For example, if a document contains only the words ``bonjour'' (three times) and ``hello'' (four times) and the components of ``bonjour'' and ``hello'' are 12 and 20, respectively, then the unnormalized document vector will have a 3 in component 12, a 4 in component 20, and zeroes everywhere else. (The number of vector components is determined by the number of words in the hash table.) The normalized vector then has 0.6 in component 12, 0.8 in component 20, and zeroes everywhere else.
Given a list L of search words, we wish to determine the closest Web documents. To do so, we first construct a search vector from L. For each word w in L, we use the hash table to increase w's component by one. Then, we normalize by scaling by the reciprocal of its length. A document is similar if its dot product with the search vector is large. Search words not in the hash table can be ignored.
(Aside: Although I had not previously considered this, I suppose a user could type the same search word repeatedly. For example, searching for ``hello hello hello hello hello goodbye goodbye'' specifies that ``hello'' is 2.5 times more important than ``goodbye''. Is adding the importance of each word to a search query a useful feature? Do any current search engines provide this feature?)
We will provide software for preprocessing sets of documents, the results of the preprocessing, and a few sets of example documents. Your job is to finish writing the code that queries the user for search words, determines which documents are most similar, and prints the results.
For each set of documents we provide, we will provide a file containing the set's words and, for each document, the vector components. We will also provide code to read in the file, storing its contents in a hash table and in a vector of document-vector pairs with one component per document.
Although you probably do not need to know how to preprocess the documents to complete the assignment, we describe it here for completeness and so you can process your own set of documents if you desire. The prepareDatabase program takes one or two command-line arguments:
Your job is to finish the search engine code that queries the user for search words, computes how close each document is to the search vector, and prints the closest documents. More specifically, this code is supposed to:
Using STL containers can easily lead to very long names for types. For example, hash_map <const string, vector<double>::size_type> is the type of the hash table translating words to vector components. Instead of typing this forty-eight character type name, we say
typedef hash_map<const string, vector<double>::size_type> KeywordMapping;creating a new equivalent type named KeywordMapping. Thus, declaring variables with a type of KeywordMapping is the same as using the forty-eight character type.
(The syntax for the type definition statement typedef is
typedef type new-synonym;)
types.h contains several type definitions:
Your job is to finish writing the search engine code in search-engine.cc. You will also need the type declaration file types.h. Be sure it is called ``types.h'' and is in the current directory. To compile, use a command similar to g++ -Wall -pedantic search-engine.cc -o search-engine.
If you want to process your own set of documents, download prepareDatabase.cc and types.h. Compile using a command similar to g++ -Wall -pedantic prepareDatabase.cc -o prepareDatabase.
To ease compilation, use the Makefile. To create an executable called search-engine, use
make search-engine
You will need to generate your own data to test your code, but we have provided three sets of documents and one program to generate random sets.
Each set's database file for use with search-engine has a ``.db'' suffix. Compressed archive files ending with a ``.tgz'' suffix are provided in case you want to copy a set of documents to your own computer. To extract the files, use the command tar xzvf filename.
Please do not copy the large set of documents to your home directory on the CS department's computers. If any significant fraction of students do so, the department's disk will quickly fill up. Instead copy just the database file to your home directory. If you really do want all the files, please store them in the directory called /tmp so that (1) they will not fill the computer science disk and (2) they will automatically be erased when the machine is rebooted.
Our search engine uses one approach to solve the most important and most difficult task performed by search engines: determining which Web documents are closely related to each other. Our code, however, uses a very simple ranking scheme similar to what Altavista probably used to use. More complicated ranking schemes can yield more usable results such as those returned by Google. Also, our preprocessor makes a very limited attempt to filter out uninteresting words by omitting all words of three or fewer characters but does not stem suffixes from words such as ``played'' and ``playing'' so that they match. It makes a heuristic attempt to remove punctuation and ignore case. We also do not filter unacceptable Web documents from the document pool.
Commercial search engines must accept more complicated input syntax such as boolean operators (usually), have at least 99.9% uptime, be able to handle large number of simultaneous queries, and deal with network issues.
We will test your code on our sample data. Please be sure it compiles without warning when using g++ -Wall -pedantic.
See the guidelines for programming assignments for instructions on how to e-mail the programs. For this assignment use a subject line of ``cs1321 homework 3''.