CS200: Problem Set 7: Building Web Communities

University of Virginia, Department of Computer Science
CS200: Computer Science, Spring 2004

Problem Set 7: Building Web Communities — Comments

Question 1: For each question, provide the SQL command that performs the requested action and run your command on your database. Note that the commands modify the state of the database, so you need to do them in order.
Insert a user into your users table with user name alyssa, email aph@cs.virginia.edu, last name Hacker and firstnames Alyssa P..
 INSERT INTO users (user, email, lastname, firstnames)
 VALUES ('alyssa', 'aph@cs.virginia.edu', 'Hacker', 'Alyssa P.')
Insert a user into your users table with user name ben, email bb@cs.virginia.edu, last name Bitdiddle and firstnames Ben.
 INSERT INTO users (user, email, lastname, firstnames)
 VALUES ('ben', 'bb@cs.virginia.edu', 'Bitdiddle', 'Ben')
Select the lastname of all users in your table.
 SELECT lastname FROM users
Select the lastname and firstnames of all users in your table in alphabetical order by firstnames.
 SELECT firstnames, lastname FROM users ORDER BY firstnames
Select the email address of all users in your table with lastname matching Hacker.
 SELECT email FROM users WHERE lastname='Hacker'
Delete all entries from your table whose id does not equal the id for your entry.
 DELETE FROM users WHERE id <> 1 LIMIT 2
Question 2: Insert the code needed to add the new entry to the links table in invite-process.php (it should go where Question 2 is marked, and require no more than two statements).
Here is the code we inserted:
	$newidno = mysql_result ($newid, 0, 0);
	executeQuery ("INSERT INTO links (fromid, toid)
 VALUES ('$uid', '$newidno')");
Question 3: Insert the code needed to display the community members as links to their personal pages in community-members.php (it should go where Question 3 is marked).
 for ($entry = 0; $entry < mysql_num_rows ($result); $entry++) {
 print "<a href=\"personal-page.php?id=" .
	mysql_result ($result, $entry, 2) . "\">" .
	mysql_result ($result, $entry, 0) . " " .
	mysql_result ($result, $entry, 1) . "</a> ";
 }
Question 4: How much work is your community-members.php code? (Express your answer using Θ notation, but make sure to clearly state your assumptions and what all variables you use mean.)
Our community-members.php code is &Theta(n) where n is the number of members in our community. For this to be true, we need the following assumptions:

mysql_result is O(1) — each call to mysql_result is constant time and doesn't depend on the size of the result. This is a reasonable assumption since PHP has support for arrays, and any element of an array can be accessed in O(1) work. Note that this would not be a reasonable assumption if we were using Scheme lists instead, since accessing elements of a list is Θ(m) where m is the length of the list.
Executing a SQL SELECT query if O(n) where n is the size of the table we are selecting on. The community-members.php code makes several SELECT queries on the users table before entering the loop. If any of these is worse than Θ(n), then the total work would need to be worse than Θ(n). In fact, SQL SELECT queries may be faster than O(n) because of indexing. It is probably reasonable to assume that a SELECT query on an indexed field is Θ(r) where r is the number of rows in the result of the query. So, a query that finds a single entry is Θ(1).
print is O(1) — this is an unreasonable assumption, since time to print scales linearly with the length of the printed string. But, since we assume the names in the database are roughly the same length, we can assume all the prints are approximately the same length, and printing work is constant.
With these assumptions, each iteration of the for loop is constant work, so the work scales with the number of times through the for loop. The for loop executes mysql_num_rows ($result) times, which is the number of entries in the user table that are not null. Since we expect a constant fraction of the entries will be null, this scales with the number of entries in the user table. Hence, we can say that executing community-members.php is Θ(n) where n is the number of community members (entries in the user table).
Question 5: Assuming the main constraint on the size of the community is that evaluating find-links-process.php for any two members of the community cannot take more than 60 seconds, how large a community can our current implementation support?
To answer this question, you will need to make lots of assumptions. Make all the assumptions you need, but the more grounded your answer is in reality, the better. If you succeed in getting a lot of people to join your community, you may be able to do some useful experiments also (but please try not to crash UVa's web server!)
Discussion:
This is a really tough question to answer well. It requires combining a theoretical analysis of the complexity of find-links-process.php with some measurements to estimate the performance for low values of n. We need the analysis to know how the time taken scales with n, but we need the concrete measurements to know what the constants are.
So, we need to figure out how the work required by find-links-process.php scales with the size of the community. The easy was was to follow the link to Dijkstra's algorithm on the problem set to find that Dijkstra's algorithm is Θ(m + n log n) where n is the number of verticies (that is, the number of community members), and m is the total number of edges (that is, the total number of links in the community where inviter-invitee links count as 2 since they are bidirectional and acquainted links count as 1 since they are one directional). Unfortunately, that result assumes a smarter implementation of Djikstra's algorithm than the provided code, and would not be correct for the actual find-links-process code we provided. (Later in the discussion it will explain why.)
Here, we dervive the complexity of Dijkstra's algorithm by analyzing the provided code. We will use the same assumptions about MySQL operations as from Question 4.
The initialization loop is Θ(n):
 $result = executeQuery ("SELECT id FROM users");
 $longest = mysql_num_rows ($result) + 1; // longer than longest path

 // Initialize $nodes to an array of the community members.
 // Initialize the distance to all nodes as the maximum value.
 // Initialize the previous node for every node as -1 (no known path).
 for ($i = 0; $i < mysql_num_rows ($result); $i++) {
 $node = mysql_result ($result, $i, 0);
 $nodes[$i] = $node;
 $dist[$node] = $longest;
 $previous[$node] = -1;
 }
The body of the loop is constant time (recall our assumption about mysql_result being O(1)), and the number of iterations is the number of elements in the users table, n.
The main loop is:
 while (count ($Q) > 0) { // more elements in $Q
 // find the node in $Q with the minimum distance
 $unode = findmin ($Q);
 if ($unode == $toid) {
 break; // Can stop working if we've reached the toid
 }

 // Remove the minimum node
 $u = array_search ($unode, $Q);
 $Q[$u] = false;
 $Q = array_filter ($Q); // removes false elements

 // Find all nodes linked to u, and see if there are better paths
 $invitees = executeQuery ("SELECT toid FROM links WHERE fromid='$unode'");
 relaxNeighbors ($unode, $invitees);
 $inviters = executeQuery ("SELECT fromid FROM links WHERE toid='$unode'");
 relaxNeighbors ($unode, $inviters);
 }
Each iteration involves a call to findmin, which looks through the array passed in to find the mininum distance. It must look at every element of the array, so it is Θ(p) where p is the number of elements in the array. The array passed in is $Q which is the set of nodes to work on, so its average size is n/2 (it starts as size n, and we remove one node each iteration until it is empty). Hence, the call to findmin is Θ(n).
The array_search and array_filter calls should be O(n). (Since we already know findmin is Θ(n), we do not need to worry about tight bounds for any operation that is no more than Θ(n) work; hence, it is enough to know they are O(n) without worrying about whether or not they are Θ(n). In fact, both calls are Θ(n) also.)
The calls to relaxNeighbors are Θ(p) where p is the number of elements in the passed array. Its hard to estimate what the actual number of elements is each call without knowing more about the community structure. But, we know that all links in the community will be considered exactly once through the whole process. The relaxNeighbors will be called for every possible $unode (that is, every commmunity member) at most once (it might be called less than onces for some members, if the destination node is found before they are considered). So, the total work of all to calls to relaxNeighbors is Θ(m) where m is the total number of links in the community. This is the total for the entire loop, so we should add it to the rest of the work for the loop, but not include it in the work per iteration.
Thus, each iteration of the loop (not counting the relaxNeighbors work) is Θ(n) work where n is the number of community members. To know the total work, we need to know how many iterations there are. Each iteration removes one element from $Q, and we keep going until $Q is empty. The initial size of $Q is n, so there are n interations before $Q is empty.
Hence, the total work of the loop not including relaxNeighbors is Θ(n²). Including the Θ(m) work for relaxNeighbors, we have Θ(m + n²).
So, why was the Θ(m + n log n) result found in the Wikipedia wrong? Their result is correct, but it assumes a smarter implementation of findmin. If we keep $Q ordered (and maintain it using faster techniques than array_filter), we can find the minimum node in Θ(log n) instead of Θ(n). Thus, the overal work would be Θ(m + n log n) as advertised.
The question asked for an estimate of the size of community we can support (n), but didn't include m, the total number of links. So, we would like a measure of work that is only in terms of n. If we assume each community member is linked to a constant number of people (for example, averages one inviter, 3 invitees, and 5 acquaintances, and these numbers don't change even as the community grows), then we can assume m is a multiple of n (in this example m = 9n). If this is true, we can remove the m term from our work measurement, since it grows slower than n². Hence, the work grows as the square of the number of community members: Θ(n²).
This means for n > some value, we can estimate the time to solve a query as: C + Wn² where C and W are unknown constants. To figure out how big a community we can support, we need to estimate what the constants are.
We could make guesses, but they are unlikely to be reasonable unless we make some actual measurements. For small n, the C value is likely to dominate, so we'd like to have a moderately large n. I don't have enough friends to get this in my community, but I'm good at making imaginary friends instead! My make-imaginary-friends-process.php code adds as many friends as I want, and inserts random inviter and acquaintance links:
 ...
 $number = $_POST['number'];
 print "Making " . $number . " imaginary friends...";

 for ($i = 0; $i < $number; $i++) {
 // Pick a random inviter (before adding the new friend)
 $randominviter = executeQuery 
 ("SELECT id FROM users WHERE user IS NOT NULL ORDER BY RAND() LIMIT 1");

 // Pick some random acquaintainces (10)
 $randomacquaints = executeQuery 
 ("SELECT id FROM users WHERE user IS NOT NULL ORDER BY RAND() LIMIT 0, 10");

 if (mysql_num_rows ($randominviter) != 1) {
	error ("No random inviter!");
 }

 executeQuery ("INSERT INTO users (password) VALUES ('marker')");
 $newid = executeQuery ("SELECT id FROM users WHERE password='marker'");

 if (mysql_num_rows ($newid) > 1) {
	error ("Multiple new user not found in table!");
 } else {
	$newidno = mysql_result ($newid, 0, 0);
	executeQuery ("UPDATE users SET password='imaginary', email='imaginary" 
 . $newidno . "@friend.com', user='imaginary"
		 . $newidno . "', firstnames='Friend" . $newidno 
		 . "', lastname='Hobbes' WHERE id='$newidno'");
 }
 
 $inviterid = mysql_result ($randominviter, 0, 0);
 executeQuery ("INSERT INTO links (fromid, toid) VALUES ('$inviterid', '$newidno')");

 for ($a = 0; $a < mysql_num_rows ($randomacquaints); $a++) {
	$acquaintidno = mysql_result ($randomacquaints, $a, 0);
	executeQuery 
	 ("INSERT INTO acquaints (fromid, toid) VALUES ('$newidno', '$acquaintidno')");
 }

 print ("Added imaginary friend: " . personName ($newidno) . " ");
 }

 $users = executeQuery ("SELECT id FROM users WHERE user IS NOT NULL");
 $links = executeQuery ("SELECT fromid FROM links");
 $acquaints = executeQuery ("SELECT fromid FROM acquaints");
 print ("Congratulations! Your community now has " . mysql_num_rows ($users) 
	 . " people and " 
	 . (mysql_num_rows ($links) + mysql_num_rows ($acquaints)) . " links.");
I added enough friends to have 200 (once you can make dynamic web sites, everyone wants to be your friend!)
In order to measure the time to respond to a find links query, we need to add some timing code to the processing code, since it is too difficult to measure it with a stop watch. Luckily, PHP provides procedures for getting the current time to microsecond accuracy (microtime).
I want to find the maximum query time, but it will be sufficient to try a few random queries and use the maximum time. With community size 200, I tried finding links between randomly selected (I used DrScheme's random procedure for this, except for the last one which selected the last two friends) imaginary friends and got these results:

From To Links Processing Time (s)

64 141 3 1.719

186 144 2 1.304

186 184 4 2.803

233 232 3 2.733

So, we can say the worst case with n=200 is likely to be around 3 seconds. Plugging values into our equation, we have:
3 = C + Wn² = C + 40000 W
We have one equation, but two unknowns, so we need another datapoint. I added 200 more imaginary friends to make n = 400. And tried finding links. Here are the results:

From To Links Processing Time (s)

154 224 3 5.95

377 3 2 0.52

427 430 3 8.76

424 431 4 10.66

433 432 5 11.00

So, with n = 400, it is likely the worst time will be around 12 seconds. Thus we have,
3 = C + 40000 W
12 = C + 160000 W
Now, we need to find an algebra expert to solve 2 equations in 2 unknowns. Subtracting the first equation from the second, we have 9 = 120000 W.
So, W = 0.000075 and C < 0.0000001. (Turns out with high enough n values, C is irrelevant.)
So, this would predict: maxtime = 0.000075 n². If we want maxtime = 60, then n = (sqrt (/ 60 0.000075)) = 894.
To check our prediction, we need to make 494 more imaginary friends. After doing so, my community has 894 people and 9872 links.
Here are the results of some find links:

From To Links Processing Time (s)

926 927 5 51.30

915 924 5 52.61

(Only a few queries this time, since its not polite to run compute-intensive processes on web servers!)
The prediction turns out to be quite accurate. Yippee!
Questions 6-8
The code for my completed site is here: community.zip

From	To	Links	Processing Time (s)
64	141	3	1.719
186	144	2	1.304
186	184	4	2.803
233	232	3	2.733

From	To	Links	Processing Time (s)
154	224	3	5.95
377	3	2	0.52
427	430	3	8.76
424	431	4	10.66
433	432	5	11.00

From	To	Links	Processing Time (s)
926	927	5	51.30
915	924	5	52.61








cs200-staff@cs.virginia.edu Using these Materials