P3 – Text Classification

This is one of the pre-defined project ideas you can choose for your project.

Text classification of Wikipedia articles

You are required to use Python and Scikit-learn for this project.

Classify the Wikipedia 300 dataset (150 articles about Video games, 150 about Programming) using machine learning. The dataset can be downloaded here.

For text classification the bag-of-words approach where you convert an article to word counts are typically used. An improvement is TF-IDF (Term Frequency-Inverse Document Frequency) which is used to convert from word counts to word frequencies. TF-IDF is especially useful if the size of the articles varies a lot. Suitable algorithms for text classification are Multinomial Naïve Bayes (MultinomialNB) and Support Vector Machines with linear kernels (LinearSVC).

You can read about text classification in Scikit-learn here.

Grading

Grade Requirements
E
  • Classify the dataset using MultinomailNB and LinearSVC with the bag-of-words approach
  • Evaluate accuracy on the same data as used for training the algorithms
C-D
  • Also evaluate accuracy using 10-fold cross validation
A-B
  • Use TF-IDF to convert from word counts to word frequencies
  • Does TF-IDF improve classification accuracy when using cross-validation?

Welcome to CoursePress

en utav Linnéuniversitets lärplattformar. Som inloggad student kan du kommunicera, hålla koll på dina kurser och mycket mer. Du som är gäst kan nå de flesta kurser och dess innehåll utan att logga in.

Läs mer lärplattformar vid Linnéuniversitetet

Student account

To log in you need a student account at Linnaeus University.

Read more about collecting your account

Log in LNU