CS340:  Assignment 1

Deadline:  4 pm, Friday, January 25
Work on this individually.  You may talk to other students about questions related to understanding the problem, i.e. what we want you to do.  But you may not talk about how to solve it, either about design or about coding.

The following is a simplified version of a problem in statistical pattern recognition or data mining. The problem is called the k-nearest neighbor (k-NN) problem.  This is a method for doing classification, a problem where each of a large set of data items has a vector of data values assigned to it, and each belongs to a category.  A data item of unknown classification must be "classified", i.e. assigned to one category based on the known classification of the previously-classified data items.  If k = 1, this is called the nearest-neighbor method.

An example:  medical data values (e.g. weight, BMI, family history, etc.) are recorded for a large number of people when they were 40 years old. The two categories are whether or not they developed diabetes by the time they turned 65.  For new 40-year-old patients, we want to predict if they will develop diabetes or not.  We use the medical data values for each new patient and classify it based on the large data set of known classification.

Problem specification:

A large number of M data items are represented by a pair of values (X,Y) and a label.  The label represents one of a fixed set of categories. (For this problem, we'll say there are just two categories, cat1 and cat2.)  A new data item with its own (X,Y) value is to be processed against each of these data items and the Euclidean distance between the pair of (X,Y) values is calculated.  We'll find the k items that are closest to the new data item and use them as described below.  (Euclidean distance between two points is pretty simple, but if you've forgotten see Wikipedia.)

Program inputs:
Calculations and outputs:
For each unclassified data-item that is entered, do the following:
  1. Print the set of k data items that are closest to each unclassified data-item in non-decreasing order (i.e. nearest first). For each item, print its category, (X,Y) values, and distance to the unclassified data item.
  2. Print which category this data-item would be assigned to, based on whether the majority of the k-nearest-neighbors belong to the first or second category.  This "voting" is how the k-nearest-neighbor algorithm classifies the data-item.
  3. Print the average distance of the k-nearest-neighbors to the data-item for each of the two categories.  (This is not how k-NN makes a decision about classification, but your program should do it anyway. )

Example results:
Say that k=5, the unclassified data-item has value (0,0),  and the nearest neighbors turn out to be:
cat2   2  0
cat1 0.5  0
cat2   1  0
cat1   0  0
cat2   0  3
Then the output for (1) above might look something like this:
(cat1,0,0,dist=0)  (cat1,0.5,0,dist=0.5)  (cat2,1,0,dist=1.0)  (cat2,2,0,dist=2.0)  (cat2,0,3,dist=3.0)

The output for (2) might look something like this:
Data item (0,0) assigned to: cat2
(This is because the "vote" for cat2 is 3-to-2 among the k=5 nearest neighbors. You can print the vote too if you want.)

The output for (3) might look something like this:
Average distance to cat1 items:  0.25
Average distance to cat2 items:  2.0

Constraints:
Report:

Also turn in a report that has your name and UVa email-ID and includes brief statements about the following.  Please number your answers using the numbers below.  Your report can be in Word, plain-text, RTF, or PDF format.  
  1. Name any abstract data types that are important in your program (if any).
  2. Briefly describe any major data structures that you use and for what purpose in your program.  If you used things from a standard library (or other reusable components), mention those here.
  3. If you can think of any ways this problem statement is a bad requirements specification, briefly list at most three problems.  This may be missing or confusing requirements.  If you list anything here and you had to make an assumption or a change to make a working program, explain that.
  4. Briefly describe your design in terms of the program components you are using.  (We are deliberately not telling you exactly how to describe this.  Describe your design in a brief but useful way -- imagine that you wanted a fellow student to understand how to do this without giving them your code.)
  5. Briefly describe any error checking you did in your program.
  6. Briefly describe three good test cases that you did (or should have done) to test that your code is correct.  Say why your set of test cases are good choices.
  7. (The answer to this part will not affect your grade at all. We're curious.)  How many hours did you spend on coding this assignment?  (Please be honest.  We won't hold this against you in any way!)
How we'll grade this report:
We'll read your report pretty rapidly, looking to see if you have a good grasp of the high-level ideas we're looking for.  We won't be reading for too much detail.  We'll grade each part with a fairly simple grading-rubric:
5: knows it well.  4: pretty good but could be better;   3: acceptable but needs work;  2: just below acceptable;  1: well below acceptable

We really think you it will take one one-and-a-half pages at most to do this.  If you find yourself writing more than that, try to make explain it more briefly. Also, it's important for this class that you communicate clearly and concisely, and that padding reports or documents with extra words makes for bad written communication.