Workshop 1

Before the workshop

  • Watch lectures 1 to 4 and read the corresponding chapters in the course literature
  • Install Java, Python, Weka, R and and Scikit on your laptop
  • Prepare questions (if you have any) on the contents in lectures 1 to 4

Aim of the workshop

The aim of this workshop is to discuss the contents in lectures 1 to 4, and do practical assignments on what you have learned in the lectures.

Assignments

A1: Text classification in the Weka tool

  • Download and install Weka if you haven’t done it already
  • Download and unzip the Wikipedia_300 dataset from the Datasets page
  • You need to apply the StringToWordVector filter in the Preprocess tab, and select Articletype in the target attribute dropdown list in the Classify tab
  • Classify the dataset in Weka using the algorithms NaiveBayes and NaiveBayesMultinomial and 10-fold cross validation. What are the differences between the two classifiers, and why do you think one is better than the other?



A2: Text classification using the Weka library

  • Write Java code for classifying the Wikipedia_300 dataset using the NaiveBayesMultinomial algorithm with the Weka.jar library
  • Make sure you apply the StringToWordVector filter and set correct index of the class label (should be 0)
  • Read about how to use the Weka library from Java code here



A3: Classification of Iris dataset in the Weka tool

  • Classify the Iris dataset in Weka using the algorithm k-Nearest Neighbor (lazy/IBk), Decision Trees (trees/J48) and Naïve Bayes (bayes/NaiveBayes)
  • The Iris dataset can be found in the data folder in the Weka installation or can be downloaded from the Datasets page
  • Which algorithm gives the best result?



A4: Classification of Iris dataset using the Weka library

  • Write Java code for classifying the Iris data using the Weka.jar library. Select algorithm based on your findings in assignment A3-1.



A5: Regression using the Weka library

  • Download the GPUbenchmark dataset from the Datasets page
  • Write Java code that:
    • Iterate over all 19 training examples
    • For each iteration:
      • Remove one example from the dataset (remember to re-read the dataset each iteration otherwise all examples will be removed in iteration 19)
      • Train the k-Nearest Neighbor classifier (IBk) on the remaining 18 examples in the dataset
      • Predict the benchmark value for the training example you removed
      • Calculate the absolute difference in the predicted and actual benchmark value
    • Calculate the average absolute different over all 19 iterations
    • Experiment with different values for k. Which gives the lowest average difference?
    • Experiment with removing one attribute at a time. Is the result improved if you remove some attribute?



A6: Classification of Iris dataset in Scikit

  • Classify the Iris dataset in Scikit using the k-Nearest Neighbor algorithm
  • Experiment with different values for k. Which setting gives the best accuracy?
  • Test and compare the results when using a Decision Tree classifier instead. Which gives the best accuracy?



A7: Classification of Iris dataset in R

  • Classify the Iris dataset in R using Decision Trees (CART) and k-Nearest Neighbor (kNN) algorithms
  • Which algorithm gives the best result?
  • Does the result from kNN match the results from Scikit and Weka? If not, what are the reasons for the differences in result?

Welcome to CoursePress

en utav Linnéuniversitets lärplattformar. Som inloggad student kan du kommunicera, hålla koll på dina kurser och mycket mer. Du som är gäst kan nå de flesta kurser och dess innehåll utan att logga in.

Läs mer lärplattformar vid Linnéuniversitetet

Student account

To log in you need a student account at Linnaeus University.

Read more about collecting your account

Log in LNU