- In assignment 4 you shall implement the Naïve Bayes machine learning algorithm and use it on some datasets
- It can be implemented in any programming language you like
- You can work alone or in group of two students
- You shall present your application and code at an oral examination
- You are not required to build a REST web service for this assignment
See the Deadlines and Submissions page.
Note! The purpose of this assignment is that you shall learn how to implement Naïve Bayes, encoding label strings to integers, calculating accuracy scores, performing cross-validation and generating confusion matrix. These functionalities are often available in machine learning libraries such as Weka or Scikit-learn, which you are not allowed to use. You are allowed to use library functions for loading and shuffling the data and all necessary mathematical operations, data structures etc.
Code structure requirements
|void fit ( X:float, y:int )||Trains the model on input examples X and labels y|
|int predict ( X:float )||Classifies examples X and returns a list of predictions|
|float accuracy_score ( preds:int, y:int )||Calculates accuracy score for a list of predictions|
|int confusion_matrix ( preds:int, y:int )||Generates a confusion matrix and returns it as an integer matrix|
|int crossval_predict ( X:float, y:int, folds:int )||Runs n-fold cross-validation and returns a list of predictions|
Input data (a float matrix with input variables as columns and examples as rows) is usually denoted with X and the categories/labels (a list of integers) is usually denoted as y. Predictions (a list of integers) shall be compared with the actual labels (y) when calculating the accuracy score (percentage correct predictions) and generating the confusion matrix.
You can verify your results with the results in Web ML Experimenter. The Iris dataset is built-in in Web ML (click the Try Iris dataset button), and the Banknote authentication can be uploaded from the csv file. Note that the cross-validation results can differ slightly due to differences in how the data is split into folds, but the accuracy you get should be almost similar to the accuracy in Web ML.