This is one of the pre-defined project ideas you can choose for your project.
Clustering Wikipedia articles
Modify your clustering system from Assignment 2 to use Wikipedia articles (90 articles about Programming, 90 about Games). You can download the dataset here.
To use the dataset for clustering, you need to select a number of words and calculate the frequency of these words in each Wikipedia article. It is not recommended to use all words from the articles since similarity calculations will then take long time. You can for example use the following words:
language, programming, computer, software, hardware, data, player, online, system, development,
machine, console, developer, design, history, technology, standard, information, article, example
The article Arcade_game would then have the following frequencies: