CS340: Assignment 1
Deadline: 4 pm, Friday, January 25
Work on this individually. You may talk to other students about questions related to understanding the problem, i.e. what we want you to do. But you may not talk about how to solve it, either about design or about coding.
The following is a simplified version of a problem in statistical
pattern recognition or data mining. The problem is called the k-nearest
neighbor (k-NN) problem. This is a method for doing
classification, a problem where each of a large set of data items has a
vector of data values assigned to it, and each belongs to a
category. A data item of unknown classification must be
"classified", i.e. assigned to one category based on the known
classification of the previously-classified data items. If k = 1,
this is called the nearest-neighbor method.
An example: medical data values (e.g. weight, BMI, family
history, etc.) are recorded for a large number of people when they were
40 years old. The two categories are whether or not they developed
diabetes by the time they turned 65. For new 40-year-old
patients, we want to predict if they will develop diabetes or
not. We use the medical data values for each new patient and
classify it based on the large data set of known classification.
Problem specification:
A large number of M data items are represented by a pair of values
(X,Y) and a label. The label represents one of a fixed set of
categories. (For this problem, we'll say there are just two categories,
cat1 and cat2.) A new data item with its own (X,Y) value is to be
processed against each of these data items and the Euclidean distance
between the pair of (X,Y) values is calculated. We'll find the k
items that are closest to the new data item and use them as described
below. (Euclidean distance between two points is pretty simple,
but if you've forgotten see Wikipedia.)
Program inputs:
- Value of k: prompt the user for a value of k.
- Value of M: prompt the user for the value of M, the number of values to be read from the data file.
- Data file name: prompt the user for the name of a data file
containing the classified data items. Each item will be on a line
by itself, where each line is the category value followed by the X and
Y values (all separated by 1 or more spaces). X and Y may be any
floating point values (negative, positive or zero).
- Unclassified data values: prompt the user for (X,Y) value
pairs. Keep prompting and processing these (see below) until the
user enters 1.0 and 1.0 (yes, this is kind of dumb but it's simple).
Calculations and outputs:
For each unclassified data-item that is entered, do the following:
- Print the set of k data items that are closest to each
unclassified data-item in non-decreasing order (i.e. nearest first).
For each item, print its category, (X,Y) values, and distance to the
unclassified data item.
- Print which category this data-item would be assigned to, based
on whether the majority of the k-nearest-neighbors belong to the first
or second category. This "voting" is how the k-nearest-neighbor
algorithm classifies the data-item.
- Print the average distance of the k-nearest-neighbors to the data-item for each of the two categories. (This is not how k-NN makes a decision about classification, but your program should do it anyway. )
Example results:
Say that k=5, the unclassified data-item has value (0,0), and the nearest neighbors turn out to be:
cat2 2 0
cat1 0.5 0
cat2 1 0
cat1 0 0
cat2 0 3
Then the output for (1) above might look something like this:
(cat1,0,0,dist=0) (cat1,0.5,0,dist=0.5) (cat2,1,0,dist=1.0) (cat2,2,0,dist=2.0) (cat2,0,3,dist=3.0)
The output for (2) might look something like this:
Data item (0,0) assigned to: cat2
(This is because the "vote" for cat2 is 3-to-2 among the k=5 nearest neighbors. You can print the vote too if you want.)
The output for (3) might look something like this:
Average distance to cat1 items: 0.25
Average distance to cat2 items: 2.0
Constraints:
- Use an object-oriented design for your solution. Use any OO language you wish.
- Make use of standard libraries or other resuable components as much as you can.
- Document any error checking you do in the report (see below).
- Document any assumptions or changes you make in the report (see below).
- Comments are not required other than a header in the main file listing your name and contact info.
- Submit all source files, and an executable file of some sort.
An executable Jar file if you're using Java. A .EXE file
if you're using C++ on Windows. (If you're not using one of
those systems, then explain to us how to build and run your program.
Make it as easy as possible for us to build and run your program.)
Report:
Also turn in a report that has your name and UVa email-ID and
includes brief statements about the following. Please number your
answers using the numbers below. Your report can be in Word,
plain-text, RTF, or PDF format.
- Name any abstract data types that are important in your program (if any).
- Briefly describe any major data structures
that you use and for what purpose in your program. If you used
things from a standard library (or other reusable components), mention
those here.
- If you can think of any ways this problem statement is a bad requirements specification, briefly list at most three
problems. This may be missing or confusing requirements. If
you list anything here and you had to make an assumption or a change to
make a working program, explain that.
- Briefly describe your design in terms of the program components
you are using. (We are deliberately not telling you exactly how
to describe this. Describe your design in a brief but useful way
-- imagine that you wanted a fellow student to understand how to do
this without giving them your code.)
- Briefly describe any error checking you did in your program.
- Briefly describe three good test cases that you did (or should have done) to test that your code is correct. Say why your set of test cases are good choices.
- (The answer to this part will not affect your grade at all. We're
curious.) How many hours did you spend on coding this assignment?
(Please be honest. We won't hold this against you in any
way!)
How we'll grade this report:
We'll read your report pretty rapidly, looking to see if you have a
good grasp of the high-level ideas we're looking for. We won't be
reading for too much detail. We'll grade each part with a fairly
simple grading-rubric:
5: knows it well. 4: pretty good but could be better; 3:
acceptable but needs work; 2: just below acceptable; 1:
well below acceptable
We really think you it will take one one-and-a-half pages at most to do
this. If you find yourself writing more than that, try to make
explain it more briefly. Also, it's important for this class that you
communicate clearly and concisely, and that padding reports or
documents with extra words makes for bad written communication.