4/12/16 - This is a revision. The only things that changed from the original assigment are underlined below. Here are some added clarifications...

You do not have to implement NTGrowth. However, if you decide to implement and test it, it is worth 5% extra credit.
You should at least try different values of k (number of nearest neighbors) OR try distance-weighting.
This project does not involve locally-weighted regression.
For feature selection/weighting, you have the option of trying any of the methods we talked about, including:
- filter-based method: priortizing features by information gain, KL-divergence, Fisher score, etc.
- wrapper-based method: e.g. SFS and SBE, beam search, DIET
- Relief
- PCA
However, you should attempt to find an appropriate dataset where the technique works. For example, there is no point in applying SFS or PCA to the Iris database, since: a) there are only 4 features, and b) the concepts are easily separable (accuracy without feature selection is ~95% for most algorithms). So look for a more challenging dataset where feature selection/weighting can make an improvement.
For datasets with mixed features (both continuous and discrete), you might want to find a way to make the feature weights or selection criteria comparable. For example, use the sigmoid function to put all weights in the range of 0-1.
Optionally, you can just choose databases with all discrete features or all continuous features for testing your feature selection/weighting method.

CSCE 633 - Spring 2016
Project 3 - Instance-Based Learning and Feature Selection
due date: Thurs, April 28 (by start of class, 3:55pm)

The goal of this assignment is to implement a Instance-Based (or nearest neighbor) algorithm with some form of feature selection or feature weighting (in any programming language of your choice), and test it on at least 5 datasets from the UCI Machine Learning Repository. Compare the performance of your IBL program your Neural Network and your Decision Tree.

The nearest-neighber algorithm is very easy to implement. However, there are several details with which you might want to experiment. For example, you might want to test different values of k (for k-NN), or use distance-weighting. You might also want to experiment with methods such as NTGrowth.

It will be helpful to normalize the input attributes so they are roughly equally weighted.

An important component of this project is to implement some form of feature selection or feature weighting. Your goal should be to make your IBL program as accurate as possible, using the best method you can implement for feature selection/weighting. You should present the results of your IBL algorithm with and without feature selection/weighting, to determine whether it has any effect on accuracy.

When comparing the performance of algorithms, use T-tests.

What to Turn In

Submit your files through the https://wiki.cse.tamu.edu/index.php/Turning_in_Assignments_on_CSNet (you might need to be inside the TAMU firewall to access this) Include the following:

Your source code. Also include a descripion of how to compile and run your program, sufficient so that it can be tested by the grader.
A write-up that describes salient details about your implementation.
A table of results. Include the confidence intervals on accuracy for your IBL algorithm, as well as your decision tree (with pruning) and the best version of your neural network, on at least 5 datasets. Interpret/explain your results and their implications. Is IBL systematically better than, same as, or worse than your Decision Tree or Neural Network? Is this true just for certain datasets, and if so, why?
In the table, include confidence intervals for your nearest neighbor algorithm, and do a T-test to determine whether any of the differences in accuracy compared to other algorithms are statistically significant. Comment on whether you observed any benefit with feature selection/weighting.