## Before the workshop

- Watch lectures 1 to 4 and read the corresponding chapters in the course literature
- Prepare questions (if you have any) on the contents in lectures 1 to 4

## Software

It is recommended that you use a suitable Docker image for the workshop assignments, see Docker page.

## Aim of the workshop

The aim of this workshop is to discuss the contents in lectures 1 to 4, and do practical assignments on what you have learned in the lectures.

## Assignments

**A1: Text classification in the Weka tool**

- Download and install Weka if you haven’t done it already
- Download and unzip the Wikipedia_300 dataset from the Datasets page
- You need to apply the StringToWordVector filter in the Preprocess tab, and select Articletype in the target attribute dropdown list in the Classify tab
- Classify the dataset in Weka using the algorithms NaiveBayes and NaiveBayesMultinomial and 10-fold cross validation. What are the differences between the two classifiers, and why do you think one is better than the other?

**A2: Text classification using the Weka library**

- Write Java code for classifying the Wikipedia_300 dataset using the NaiveBayesMultinomial algorithm with the Weka.jar library
- Make sure you apply the StringToWordVector filter and set correct index of the class label (should be 0)
- Read about how to use the Weka library from Java code here

**A3: Classification of Iris dataset in the Weka tool**

- Classify the Iris dataset in Weka using the algorithm k-Nearest Neighbor (lazy/IBk), Decision Trees (trees/J48) and Naïve Bayes (bayes/NaiveBayes)
- The Iris dataset can be found in the data folder in the Weka installation or can be downloaded from the Datasets page
- Which algorithm gives the best result?

**A4: Classification of Iris dataset using the Weka library**

- Write Java code for classifying the Iris data using the Weka.jar library. Select algorithm based on your findings in assignment A3-1.

**A5: Classification of Iris dataset in Scikit**

- Classify the Iris dataset in Scikit using the k-Nearest Neighbor algorithm
- Experiment with different values for k. Which setting gives the best accuracy?
- Test and compare the results when using a Decision Tree classifier instead. Which gives the best accuracy?

**A6: Classification of Iris dataset in R**

- Classify the Iris dataset in R using Decision Trees (CART) and k-Nearest Neighbor (kNN) algorithms
- Which algorithm gives the best result?
- Does the result from kNN match the results from Scikit and Weka? If not, what are the reasons for the differences in result?

**A7: Regression using the Weka library**

- Download the GPUbenchmark dataset from the Datasets page
- Write Java code that:
- Iterate over all 19 training examples
- For each iteration:
- Remove one example from the dataset (remember to re-read the dataset each iteration otherwise all examples will be removed in iteration 19)
- Train the k-Nearest Neighbor classifier (IBk) on the remaining 18 examples in the dataset
- Predict the benchmark value for the training example you removed
- Calculate the absolute difference in the predicted and actual benchmark value

- Calculate the average absolute different over all 19 iterations
- Experiment with different values for k. Which gives the lowest average difference?
- Experiment with removing one attribute at a time. Is the result improved if you remove some attribute?