P3 – Text Classification
This is one of the pre-defined project ideas you can choose for your project.
Text classification of Wikipedia articles
You are required to use Python and Scikit-learn for this project.
Classify the Wikipedia 300 dataset (150 articles about Video games, 150 about Programming) using machine learning. The dataset can be downloaded at the Datasets page.
For text classification, the bag-of-words approach where you convert an article to word counts is typically used. An improvement is TF-IDF (Term Frequency-Inverse Document Frequency), which converts from word counts to word frequencies. TF-IDF is especially useful if the size of the articles varies a lot. Suitable algorithms for text classification are Multinomial Naïve Bayes (MultinomialNB) and Support Vector Machines with linear kernels (LinearSVC).
You can read about text classification in Scikit-learn here.
Grading
Grade | Requirements |
---|---|
E |
|
C-D |
|
A-B |
|