For this assignment you are going to do some code using the graph algorithms that have been discussed in class. As with all the assignments, you can write this in whatever language you want, but I have to be able to compile and run it under Linux on the lab machines. What you turn in will be graded for correctness and speed. It might also be used in a future assignment so keep that in mind and try to build flexible and robust data structures. You will use this data structure to solve the following problem. You might also use some other data structures to go along with it, but you don't have to write those if you can find a good library for them.
This problem is actually taken from work that Kevin Livingstone is doing in Biology. You will be playing with graphs that represent relationships between genes in a number of different species. You will be given a data file that includes the specifications of the species and what genes in other species they are genetically related to. It will be given as a number telling you how closely related they are. I want you to find groupings of genes that are closely related to one another (strongly connected components). Then for each each grouping I want you to return a metric of how closely bound that group is (total length on a minimum spanning tree assuming edges aren't directed).
The interface for this program will be all text. You will read from standard input and write to standard output. Unlike earlier assignments, the standard input will simply be the data for the gene relations. You will also be provided with a command line argument that is a single number giving the threshold similarity that you should consider. A second command line argument is the minimum size for strongly connected components to consider.
The input will have one gene per line. The first word on the line will be the gene in question. It will be followed by a arbitrary number of pairs where each pair includes a gene code and a similarity score. Technically, the similarity value is called an expectation value. It tells you how many random matches one would expect with the same similarity. Smaller values means things are more closely related. A value of zero says the similarity would not happen randomly. This is basically an adjacency list format where the gene codes specify vertices and the expectation values are the weights on edges. When you do the processing you only consider edges with an expectation value at or below the value specified on the command line.
Your output should consist of the following. Begin with one line listing the number of components that meet the size requirement. Follow that with one line for each strongly connected component with a size at or above the given value. That line will include the size of the component in genes followed by the sum of the weights on the minimum spanning tree for that component followed by the gene specifications for each of the genes in that component in alphabetical order. The components should be output in the following order. Sort first from largest component to smallest component. Break ties in size by the sum of the weights in the minimum spanning tree. Break ties on that by alphabetical order of all the gene IDs in the component.
So a small input might look like the following.
G1 G2 1e-5 G3 1e-3
G2
G1 1e-4
G3 G2 2e-5 G1 0
If we process this with a threshold (first command line argument) no smaller than 1e-4 then this has one strongly connected component including all the genes. Assuming our second argument is 3 or smaller we would get the following output.
1
3 1e-5 G1 G2 G3
The component has 3 elements and we included the edges from G3 to G1 and from G1 to G2 giving a sum of 1e-5. Note that technically a minimum spanning tree only makes sense on an undirected graph so for this step we view edges simply as connections without direction.
Were we to run this with the arguments 1e-7 2 we would get 0 as the only output. The 1e-7 would throw out all connections other than the one from G3 to G1 and we can't have a strongly connected component with 2 vertices or more with only a single edge.